[seek-dev] (no subject)

Thu Sep 23 22:16:29 PDT 2004

Hi Bertram,

THose queries are from Metacat.  Metacat is a hybrid thing, where we 
store XML data in a relational database by modeling the DOM tree instead 
of the schema, and our query language is an XPath-like language.  So, 
the metacat engine does the query translation from the XPath syntax to 
the relational model.

The advantage of this model is that we can store arbitrary XML documents 
in a db and validate them against their schemas without any changes to 
our model.  So as EML has gone through its 13 changes, and when we 
wanted to support new models like FGDC, GML, or NBII, there are no 
changes required to metacat -- it just works.

Performance is a huge problem (some power function of document size), 
for several reasons.  First, the DOM is inherently hierarchical, which 
is not a strength of the relational model.  So, we have some pretty 
nasty queries that come out when you start to walk the parent-child 
relations in the tree.  We partially sove this by maintaining a 
path-index, which allows us to do much simpler string-matching 
operations to find the right nodes.  Second, all nodedata is stored in a 
single attribute as a varchar, and so xml fields that are not character 
need to be casted to their numeric forms when a comparison is being 
done.  Practically speaking this happens rarely with metadata (its 
almost all text), but Steve's case is the main exception -- geographic 
coordinates.

I've worked on rewriting the queries for over a month, but I am far from 
an expert.  Some of my analyis of the query plan revealed some 
full-table scans, but fixing those didn't help much with overall query 
speed.  We also make extensive use of the 'IN' operator, which I've read 
can be inefficient when the list gets long.  But it seems more recent 
versions of oracle do better with it, and when I rewrote the query as a 
series of subqueries creating temporary tables, I still saw no 
performance gains.

The last thing I've considered is building in some EML-specific hacks 
that allow at least common EML queries to be faster, but its somewhat 
repugnant to me as it doesn't solve the general problem.  Sid Garg is 
going to continue to work on this over the next several months to try to 
get some major gains....we'll see.  I'd love to talk to you about this 
in more detail, but there are a lot of details, and email would be an 
inefficient way to get through it all.  Maybe we could chat in Kansas 
about it?

Matt

Bertram Ludaescher wrote:
> 
> Jing:
> 
> In your queries you have SQL, nested queries, and some XML/XPath
> stuff.
> 
> What is the SQL+XML/XPath mix that you are using? Where does it come
> from? 
> 
> My first suspicion for the overly long running times would be a "bad"
> query plan, i.e., using e.g. multiple scans over the data, no indexes
> etc.
> 
> For example, your high-level query includes a range query (some
> spatial attribute is between 10.0 and 31.5)
> 
> By the time this query hits the query engine, nothing of this spatial
> range aspects seems visible to the DB engine and instead I see a
> nested query with a "LIKE" etc.
> 
> Why isn't there a straight SQL query? And how do you use SQL to query
> XML? 
> 
> I don't understand this model.. Where can one learn more about this
> model/system?
> 
> Bertram
> 
> 
> 
> 
>>>>>>"JT" == Jing Tao <tao at nceas.ucsb.edu> writes:
> 
> JT> 
> JT> Hi, matt, steve and chris:
> JT> This afternoon I played a little bit with the query steve submit to 
> JT> ecogrid node and found something about number and string.
> JT> 
> JT> Metacat stores the metadata in "nodedata" field in xml_nodes table. The 
> JT> data type for "nodedata" is char. If you want to compare a number in 
> JT> query, db should do casting.
> JT> 
> JT> basically, steve's query will look like:
> JT> AND 
> JT>  condition1 northBoundingCoordinate<=31.5
> JT>  condition2 southBoundingCoordinate>=10.0
> JT>  condition3 surname=joye
> JT> AND
> JT> 
> JT> In metacat log, the time consumer is the main query, taking 271 seconds 
> JT> (extened  query is trivial, only take 3 seconds). I post the main query 
> JT> to sql client, it took 267.4 seconds to run. So the result is almost same.
> JT> 
> JT> Then I did some other tests and try to figure it out. Here is some testing:
> JT> 
> JT> Query 1: northBoundingCoordinate<=31.5
> JT> 
> JT> SQL query look like(simplied, only show different part to quey 2,3):
> JT> SELECT DISTINCT docid FROM xml_nodes WHERE nodedata <= 31.5   AND 
> JT> parentnodeid IN (SELECT nodeid FROM xml_index WHERE path LIKE 
> JT> 'geographicCoverage/boundingCoordinates/northBoundingCoordinate'  
> JT> 
> JT> Time: 236 seconds(No too much faster than whole query) appro 100 documents
> JT> 
> JT> 
> JT> Query 2: southBoundingCoordinate>=10.0
> JT> SQL query look like(simplied, only show differenc to query 1, 3)
> JT>  SELECT DISTINCT docid FROM xml_nodes WHERE nodedata >= 10.0  AND 
> JT> parentnodeid IN (SELECT nodeid FROM xml_index WHERE path LIKE 
> JT> 'geographicCoverage/boundingCoordinates/southBoundingCoordinate')
> JT> 
> JT> Time: 114.9 seconds appro 100 docs
> JT> 
> JT> Query 3:
> JT> Condition surname=joye
> JT> 
> JT> SQL query looks like(only show different part to query 1, 2)
> JT> SELECT DISTINCT docid FROM xml_nodes WHERE UPPER(nodedata) = 'JOYE' AND 
> JT> parentnodeid IN (SELECT nodeid FROM xml_index WHERE path LIKE 
> JT> '/eml/dataset/creator/individualName/surName'  
> JT> 
> JT> Time: 2.8 seconds(Much faster) 4 docs
> JT> 
> JT> Oh, Third one(2.8 secs) is much faster than first one(236 secs) and second 
> JT> one(114.9). Postgres have some problem in casting :)
> JT> 
> JT> I also run following query:
> JT> 
> JT> Query 4: northBoundingCoordinate<=31.5
> JT> The condition is as same as query 1, but the sql query are little bit 
> JT> different. It add something to get rid of non-number cells (this is 
> JT> metacat does now):
> JT> SELECT DISTINCT docid FROM xml_nodes WHERE nodedata <= 31.5  AND 
> JT> UPPER(nodedata) = LOWER(nodedata) AND LTRIM(nodedata) != ' '  AND nodedata 
> JT> IS NOT NULL AND parentnodeid IN (SELECT nodeid FROM xml_index WHERE path 
> JT> LIKE 'geographicCoverage/boundingCoordinates/northBoundingCoordinate'  
> JT> 
> JT> Time: 247.1 seconds(Almost to query 1 - 236 seconds)
> JT> 
> JT> 
> JT> Query 5: southBoundingCoordinate>=10.0
> JT> The condition is as same as query 2, but the sql query are little bit 
> JT> different. It add some thing got get rid of nonnumber of cells.
> JT> 
> JT> SELECT DISTINCT docid FROM xml_nodes WHERE nodedata >= 10.0 AND 
> JT> UPPER(nodedata) = LOWER(nodedata)  AND nodedata IS NOT NULL AND  
> JT> LTRIM(nodedata) != ' ' parentnodeid IN (SELECT nodeid FROM xml_index WHERE 
> JT> path LIKE 
> JT> 'geographicCoverage/boundingCoordinates/southBoundingCoordinate' 
> JT> 
> JT> Time: 14.1 seconds (much faster than query 2 - 114.9 secs)
> JT> 
> JT> The difference between query 2 and query 5 is, query 5 gets rid of 
> JT> non-number cell in search space. And query 5(14 secs) is much faster than
> JT> query 2(114 secs). So it seems progres has a hard time to ignore 
> JT> non-number char during the casting(from char to number).
> JT> 
> JT> But why query 4 (247 seconds) and query 1(236 seconds) no big difference 
> JT> even we add something to get rid of nonnumber cell in query 4? I guess the 
> JT> our adding is not completed, it doesn't get rid of all non-number char in 
> JT> cell. We need fixed that.
> JT> 
> JT> I also try to run those queries in Oracle in ecoinfo. But I got a 
> JT> "ORA-01722: invalid number" error. Durane got the same problem in 
> JT> numberica search. Those prove our metacat query couldn't remove all 
> JT> non-number character too.
> JT> 
> JT> It seems posgres can ingore non-number character in cell during 
> JT> string-number casting, but it is very expensive. Oracle just throw a sql 
> JT> excpeiton. 
> JT> 
> JT> Hope this is helpful.
> JT> 
> JT> Jing
> JT> 
> JT> 
> JT> -- 
> JT> Jing Tao
> JT> National Center for Ecological
> JT> Analysis and Synthesis (NCEAS)
> JT> 735 State St. Suite 204
> JT> Santa Barbara, CA 93101
> JT> 
> JT> _______________________________________________
> JT> seek-dev mailing list
> JT> seek-dev at ecoinformatics.org
> JT> http://www.ecoinformatics.org/mailman/listinfo/seek-dev

-- 
-------------------------------------------------------------------
Matt Jones                                     jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Fax: 425-920-2439    Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)
University of California Santa Barbara
Interested in ecological informatics? http://www.ecoinformatics.org
-------------------------------------------------------------------