[seek-dev] (no subject)

Thu Sep 23 20:34:52 PDT 2004

Jing:

In your queries you have SQL, nested queries, and some XML/XPath
stuff.

What is the SQL+XML/XPath mix that you are using? Where does it come
from? 

My first suspicion for the overly long running times would be a "bad"
query plan, i.e., using e.g. multiple scans over the data, no indexes
etc.

For example, your high-level query includes a range query (some
spatial attribute is between 10.0 and 31.5)

By the time this query hits the query engine, nothing of this spatial
range aspects seems visible to the DB engine and instead I see a
nested query with a "LIKE" etc.

Why isn't there a straight SQL query? And how do you use SQL to query
XML? 

I don't understand this model.. Where can one learn more about this
model/system?

Bertram

>>>>> "JT" == Jing Tao <tao at nceas.ucsb.edu> writes:
JT> 
JT> Hi, matt, steve and chris:
JT> This afternoon I played a little bit with the query steve submit to 
JT> ecogrid node and found something about number and string.
JT> 
JT> Metacat stores the metadata in "nodedata" field in xml_nodes table. The 
JT> data type for "nodedata" is char. If you want to compare a number in 
JT> query, db should do casting.
JT> 
JT> basically, steve's query will look like:
JT> AND 
JT>  condition1 northBoundingCoordinate<=31.5
JT>  condition2 southBoundingCoordinate>=10.0
JT>  condition3 surname=joye
JT> AND
JT> 
JT> In metacat log, the time consumer is the main query, taking 271 seconds 
JT> (extened  query is trivial, only take 3 seconds). I post the main query 
JT> to sql client, it took 267.4 seconds to run. So the result is almost same.
JT> 
JT> Then I did some other tests and try to figure it out. Here is some testing:
JT> 
JT> Query 1: northBoundingCoordinate<=31.5
JT> 
JT> SQL query look like(simplied, only show different part to quey 2,3):
JT> SELECT DISTINCT docid FROM xml_nodes WHERE nodedata <= 31.5   AND 
JT> parentnodeid IN (SELECT nodeid FROM xml_index WHERE path LIKE 
JT> 'geographicCoverage/boundingCoordinates/northBoundingCoordinate'  
JT> 
JT> Time: 236 seconds(No too much faster than whole query) appro 100 documents
JT> 
JT> 
JT> Query 2: southBoundingCoordinate>=10.0
JT> SQL query look like(simplied, only show differenc to query 1, 3)
JT>  SELECT DISTINCT docid FROM xml_nodes WHERE nodedata >= 10.0  AND 
JT> parentnodeid IN (SELECT nodeid FROM xml_index WHERE path LIKE 
JT> 'geographicCoverage/boundingCoordinates/southBoundingCoordinate')
JT> 
JT> Time: 114.9 seconds appro 100 docs
JT> 
JT> Query 3:
JT> Condition surname=joye
JT> 
JT> SQL query looks like(only show different part to query 1, 2)
JT> SELECT DISTINCT docid FROM xml_nodes WHERE UPPER(nodedata) = 'JOYE' AND 
JT> parentnodeid IN (SELECT nodeid FROM xml_index WHERE path LIKE 
JT> '/eml/dataset/creator/individualName/surName'  
JT> 
JT> Time: 2.8 seconds(Much faster) 4 docs
JT> 
JT> Oh, Third one(2.8 secs) is much faster than first one(236 secs) and second 
JT> one(114.9). Postgres have some problem in casting :)
JT> 
JT> I also run following query:
JT> 
JT> Query 4: northBoundingCoordinate<=31.5
JT> The condition is as same as query 1, but the sql query are little bit 
JT> different. It add something to get rid of non-number cells (this is 
JT> metacat does now):
JT> SELECT DISTINCT docid FROM xml_nodes WHERE nodedata <= 31.5  AND 
JT> UPPER(nodedata) = LOWER(nodedata) AND LTRIM(nodedata) != ' '  AND nodedata 
JT> IS NOT NULL AND parentnodeid IN (SELECT nodeid FROM xml_index WHERE path 
JT> LIKE 'geographicCoverage/boundingCoordinates/northBoundingCoordinate'  
JT> 
JT> Time: 247.1 seconds(Almost to query 1 - 236 seconds)
JT> 
JT> 
JT> Query 5: southBoundingCoordinate>=10.0
JT> The condition is as same as query 2, but the sql query are little bit 
JT> different. It add some thing got get rid of nonnumber of cells.
JT> 
JT> SELECT DISTINCT docid FROM xml_nodes WHERE nodedata >= 10.0 AND 
JT> UPPER(nodedata) = LOWER(nodedata)  AND nodedata IS NOT NULL AND  
JT> LTRIM(nodedata) != ' ' parentnodeid IN (SELECT nodeid FROM xml_index WHERE 
JT> path LIKE 
JT> 'geographicCoverage/boundingCoordinates/southBoundingCoordinate' 
JT> 
JT> Time: 14.1 seconds (much faster than query 2 - 114.9 secs)
JT> 
JT> The difference between query 2 and query 5 is, query 5 gets rid of 
JT> non-number cell in search space. And query 5(14 secs) is much faster than
JT> query 2(114 secs). So it seems progres has a hard time to ignore 
JT> non-number char during the casting(from char to number).
JT> 
JT> But why query 4 (247 seconds) and query 1(236 seconds) no big difference 
JT> even we add something to get rid of nonnumber cell in query 4? I guess the 
JT> our adding is not completed, it doesn't get rid of all non-number char in 
JT> cell. We need fixed that.
JT> 
JT> I also try to run those queries in Oracle in ecoinfo. But I got a 
JT> "ORA-01722: invalid number" error. Durane got the same problem in 
JT> numberica search. Those prove our metacat query couldn't remove all 
JT> non-number character too.
JT> 
JT> It seems posgres can ingore non-number character in cell during 
JT> string-number casting, but it is very expensive. Oracle just throw a sql 
JT> excpeiton. 
JT> 
JT> Hope this is helpful.
JT> 
JT> Jing
JT> 
JT> 
JT> -- 
JT> Jing Tao
JT> National Center for Ecological
JT> Analysis and Synthesis (NCEAS)
JT> 735 State St. Suite 204
JT> Santa Barbara, CA 93101
JT> 
JT> _______________________________________________
JT> seek-dev mailing list
JT> seek-dev at ecoinformatics.org
JT> http://www.ecoinformatics.org/mailman/listinfo/seek-dev