[seek-dev] (no subject)
Bertram Ludaescher
ludaesch at sdsc.edu
Thu Sep 23 20:34:52 PDT 2004
Jing:
In your queries you have SQL, nested queries, and some XML/XPath
stuff.
What is the SQL+XML/XPath mix that you are using? Where does it come
from?
My first suspicion for the overly long running times would be a "bad"
query plan, i.e., using e.g. multiple scans over the data, no indexes
etc.
For example, your high-level query includes a range query (some
spatial attribute is between 10.0 and 31.5)
By the time this query hits the query engine, nothing of this spatial
range aspects seems visible to the DB engine and instead I see a
nested query with a "LIKE" etc.
Why isn't there a straight SQL query? And how do you use SQL to query
XML?
I don't understand this model.. Where can one learn more about this
model/system?
Bertram
>>>>> "JT" == Jing Tao <tao at nceas.ucsb.edu> writes:
JT>
JT> Hi, matt, steve and chris:
JT> This afternoon I played a little bit with the query steve submit to
JT> ecogrid node and found something about number and string.
JT>
JT> Metacat stores the metadata in "nodedata" field in xml_nodes table. The
JT> data type for "nodedata" is char. If you want to compare a number in
JT> query, db should do casting.
JT>
JT> basically, steve's query will look like:
JT> AND
JT> condition1 northBoundingCoordinate<=31.5
JT> condition2 southBoundingCoordinate>=10.0
JT> condition3 surname=joye
JT> AND
JT>
JT> In metacat log, the time consumer is the main query, taking 271 seconds
JT> (extened query is trivial, only take 3 seconds). I post the main query
JT> to sql client, it took 267.4 seconds to run. So the result is almost same.
JT>
JT> Then I did some other tests and try to figure it out. Here is some testing:
JT>
JT> Query 1: northBoundingCoordinate<=31.5
JT>
JT> SQL query look like(simplied, only show different part to quey 2,3):
JT> SELECT DISTINCT docid FROM xml_nodes WHERE nodedata <= 31.5 AND
JT> parentnodeid IN (SELECT nodeid FROM xml_index WHERE path LIKE
JT> 'geographicCoverage/boundingCoordinates/northBoundingCoordinate'
JT>
JT> Time: 236 seconds(No too much faster than whole query) appro 100 documents
JT>
JT>
JT> Query 2: southBoundingCoordinate>=10.0
JT> SQL query look like(simplied, only show differenc to query 1, 3)
JT> SELECT DISTINCT docid FROM xml_nodes WHERE nodedata >= 10.0 AND
JT> parentnodeid IN (SELECT nodeid FROM xml_index WHERE path LIKE
JT> 'geographicCoverage/boundingCoordinates/southBoundingCoordinate')
JT>
JT> Time: 114.9 seconds appro 100 docs
JT>
JT> Query 3:
JT> Condition surname=joye
JT>
JT> SQL query looks like(only show different part to query 1, 2)
JT> SELECT DISTINCT docid FROM xml_nodes WHERE UPPER(nodedata) = 'JOYE' AND
JT> parentnodeid IN (SELECT nodeid FROM xml_index WHERE path LIKE
JT> '/eml/dataset/creator/individualName/surName'
JT>
JT> Time: 2.8 seconds(Much faster) 4 docs
JT>
JT> Oh, Third one(2.8 secs) is much faster than first one(236 secs) and second
JT> one(114.9). Postgres have some problem in casting :)
JT>
JT> I also run following query:
JT>
JT> Query 4: northBoundingCoordinate<=31.5
JT> The condition is as same as query 1, but the sql query are little bit
JT> different. It add something to get rid of non-number cells (this is
JT> metacat does now):
JT> SELECT DISTINCT docid FROM xml_nodes WHERE nodedata <= 31.5 AND
JT> UPPER(nodedata) = LOWER(nodedata) AND LTRIM(nodedata) != ' ' AND nodedata
JT> IS NOT NULL AND parentnodeid IN (SELECT nodeid FROM xml_index WHERE path
JT> LIKE 'geographicCoverage/boundingCoordinates/northBoundingCoordinate'
JT>
JT> Time: 247.1 seconds(Almost to query 1 - 236 seconds)
JT>
JT>
JT> Query 5: southBoundingCoordinate>=10.0
JT> The condition is as same as query 2, but the sql query are little bit
JT> different. It add some thing got get rid of nonnumber of cells.
JT>
JT> SELECT DISTINCT docid FROM xml_nodes WHERE nodedata >= 10.0 AND
JT> UPPER(nodedata) = LOWER(nodedata) AND nodedata IS NOT NULL AND
JT> LTRIM(nodedata) != ' ' parentnodeid IN (SELECT nodeid FROM xml_index WHERE
JT> path LIKE
JT> 'geographicCoverage/boundingCoordinates/southBoundingCoordinate'
JT>
JT> Time: 14.1 seconds (much faster than query 2 - 114.9 secs)
JT>
JT> The difference between query 2 and query 5 is, query 5 gets rid of
JT> non-number cell in search space. And query 5(14 secs) is much faster than
JT> query 2(114 secs). So it seems progres has a hard time to ignore
JT> non-number char during the casting(from char to number).
JT>
JT> But why query 4 (247 seconds) and query 1(236 seconds) no big difference
JT> even we add something to get rid of nonnumber cell in query 4? I guess the
JT> our adding is not completed, it doesn't get rid of all non-number char in
JT> cell. We need fixed that.
JT>
JT> I also try to run those queries in Oracle in ecoinfo. But I got a
JT> "ORA-01722: invalid number" error. Durane got the same problem in
JT> numberica search. Those prove our metacat query couldn't remove all
JT> non-number character too.
JT>
JT> It seems posgres can ingore non-number character in cell during
JT> string-number casting, but it is very expensive. Oracle just throw a sql
JT> excpeiton.
JT>
JT> Hope this is helpful.
JT>
JT> Jing
JT>
JT>
JT> --
JT> Jing Tao
JT> National Center for Ecological
JT> Analysis and Synthesis (NCEAS)
JT> 735 State St. Suite 204
JT> Santa Barbara, CA 93101
JT>
JT> _______________________________________________
JT> seek-dev mailing list
JT> seek-dev at ecoinformatics.org
JT> http://www.ecoinformatics.org/mailman/listinfo/seek-dev
More information about the Seek-dev
mailing list