[seek-dev] (no subject)

Fri Sep 24 06:36:36 PDT 2004

Hi Matt,

would you consider having the query generator use types from a schema or
ontology describing the EML storage model? I had exactly your
implementation before, using varchars for everything, and similar
problems - now I get the types from the OWL or RDFS ontologies, store
each literal field in its own table with a 'value' field of the
appropriate type, keep a catalog of the literal tables handy, and have
an XML file that tells the runtime which specific SQL types, indexes,
and operators to pick for each condition you express in the generic
constraint language (which now directly constrains OWL/RDF individuals,
not XML anymore). It's simpler than it sounds, and runtimes have
improved tenfold - also because i can now use postgis and GIST indexes
in postgres for bounding box based queries when the target 'literal' is
a WKT polygon or other shape. Despite having all the IN operators too
and the occasional 5k SELECT, the only time I get a slow query is when a
user manages to intersect the USA with all other countries....

ferdinando

On Fri, 2004-09-24 at 01:16, Matt Jones wrote:
> Hi Bertram,
> 
> THose queries are from Metacat.  Metacat is a hybrid thing, where we 
> store XML data in a relational database by modeling the DOM tree instead 
> of the schema, and our query language is an XPath-like language.  So, 
> the metacat engine does the query translation from the XPath syntax to 
> the relational model.
> 
> The advantage of this model is that we can store arbitrary XML documents 
> in a db and validate them against their schemas without any changes to 
> our model.  So as EML has gone through its 13 changes, and when we 
> wanted to support new models like FGDC, GML, or NBII, there are no 
> changes required to metacat -- it just works.
> 
> Performance is a huge problem (some power function of document size), 
> for several reasons.  First, the DOM is inherently hierarchical, which 
> is not a strength of the relational model.  So, we have some pretty 
> nasty queries that come out when you start to walk the parent-child 
> relations in the tree.  We partially sove this by maintaining a 
> path-index, which allows us to do much simpler string-matching 
> operations to find the right nodes.  Second, all nodedata is stored in a 
> single attribute as a varchar, and so xml fields that are not character 
> need to be casted to their numeric forms when a comparison is being 
> done.  Practically speaking this happens rarely with metadata (its 
> almost all text), but Steve's case is the main exception -- geographic 
> coordinates.
> 
> I've worked on rewriting the queries for over a month, but I am far from 
> an expert.  Some of my analyis of the query plan revealed some 
> full-table scans, but fixing those didn't help much with overall query 
> speed.  We also make extensive use of the 'IN' operator, which I've read 
> can be inefficient when the list gets long.  But it seems more recent 
> versions of oracle do better with it, and when I rewrote the query as a 
> series of subqueries creating temporary tables, I still saw no 
> performance gains.
> 
> The last thing I've considered is building in some EML-specific hacks 
> that allow at least common EML queries to be faster, but its somewhat 
> repugnant to me as it doesn't solve the general problem.  Sid Garg is 
> going to continue to work on this over the next several months to try to 
> get some major gains....we'll see.  I'd love to talk to you about this 
> in more detail, but there are a lot of details, and email would be an 
> inefficient way to get through it all.  Maybe we could chat in Kansas 
> about it?
> 
> Matt
> 
> Bertram Ludaescher wrote:
> > 
> > Jing:
> > 
> > In your queries you have SQL, nested queries, and some XML/XPath
> > stuff.
> > 
> > What is the SQL+XML/XPath mix that you are using? Where does it come
> > from? 
> > 
> > My first suspicion for the overly long running times would be a "bad"
> > query plan, i.e., using e.g. multiple scans over the data, no indexes
> > etc.
> > 
> > For example, your high-level query includes a range query (some
> > spatial attribute is between 10.0 and 31.5)
> > 
> > By the time this query hits the query engine, nothing of this spatial
> > range aspects seems visible to the DB engine and instead I see a
> > nested query with a "LIKE" etc.
> > 
> > Why isn't there a straight SQL query? And how do you use SQL to query
> > XML? 
> > 
> > I don't understand this model.. Where can one learn more about this
> > model/system?
> > 
> > Bertram
> > 
> > 
> > 
> > 
> >>>>>>"JT" == Jing Tao <tao at nceas.ucsb.edu> writes:
> > 
> > JT> 
> > JT> Hi, matt, steve and chris:
> > JT> This afternoon I played a little bit with the query steve submit to 
> > JT> ecogrid node and found something about number and string.
> > JT> 
> > JT> Metacat stores the metadata in "nodedata" field in xml_nodes table. The 
> > JT> data type for "nodedata" is char. If you want to compare a number in 
> > JT> query, db should do casting.
> > JT> 
> > JT> basically, steve's query will look like:
> > JT> AND 
> > JT>  condition1 northBoundingCoordinate<=31.5
> > JT>  condition2 southBoundingCoordinate>=10.0
> > JT>  condition3 surname=joye
> > JT> AND
> > JT> 
> > JT> In metacat log, the time consumer is the main query, taking 271 seconds 
> > JT> (extened  query is trivial, only take 3 seconds). I post the main query 
> > JT> to sql client, it took 267.4 seconds to run. So the result is almost same.
> > JT> 
> > JT> Then I did some other tests and try to figure it out. Here is some testing:
> > JT> 
> > JT> Query 1: northBoundingCoordinate<=31.5
> > JT> 
> > JT> SQL query look like(simplied, only show different part to quey 2,3):
> > JT> SELECT DISTINCT docid FROM xml_nodes WHERE nodedata <= 31.5   AND 
> > JT> parentnodeid IN (SELECT nodeid FROM xml_index WHERE path LIKE 
> > JT> 'geographicCoverage/boundingCoordinates/northBoundingCoordinate'  
> > JT> 
> > JT> Time: 236 seconds(No too much faster than whole query) appro 100 documents
> > JT> 
> > JT> 
> > JT> Query 2: southBoundingCoordinate>=10.0
> > JT> SQL query look like(simplied, only show differenc to query 1, 3)
> > JT>  SELECT DISTINCT docid FROM xml_nodes WHERE nodedata >= 10.0  AND 
> > JT> parentnodeid IN (SELECT nodeid FROM xml_index WHERE path LIKE 
> > JT> 'geographicCoverage/boundingCoordinates/southBoundingCoordinate')
> > JT> 
> > JT> Time: 114.9 seconds appro 100 docs
> > JT> 
> > JT> Query 3:
> > JT> Condition surname=joye
> > JT> 
> > JT> SQL query looks like(only show different part to query 1, 2)
> > JT> SELECT DISTINCT docid FROM xml_nodes WHERE UPPER(nodedata) = 'JOYE' AND 
> > JT> parentnodeid IN (SELECT nodeid FROM xml_index WHERE path LIKE 
> > JT> '/eml/dataset/creator/individualName/surName'  
> > JT> 
> > JT> Time: 2.8 seconds(Much faster) 4 docs
> > JT> 
> > JT> Oh, Third one(2.8 secs) is much faster than first one(236 secs) and second 
> > JT> one(114.9). Postgres have some problem in casting :)
> > JT> 
> > JT> I also run following query:
> > JT> 
> > JT> Query 4: northBoundingCoordinate<=31.5
> > JT> The condition is as same as query 1, but the sql query are little bit 
> > JT> different. It add something to get rid of non-number cells (this is 
> > JT> metacat does now):
> > JT> SELECT DISTINCT docid FROM xml_nodes WHERE nodedata <= 31.5  AND 
> > JT> UPPER(nodedata) = LOWER(nodedata) AND LTRIM(nodedata) != ' '  AND nodedata 
> > JT> IS NOT NULL AND parentnodeid IN (SELECT nodeid FROM xml_index WHERE path 
> > JT> LIKE 'geographicCoverage/boundingCoordinates/northBoundingCoordinate'  
> > JT> 
> > JT> Time: 247.1 seconds(Almost to query 1 - 236 seconds)
> > JT> 
> > JT> 
> > JT> Query 5: southBoundingCoordinate>=10.0
> > JT> The condition is as same as query 2, but the sql query are little bit 
> > JT> different. It add some thing got get rid of nonnumber of cells.
> > JT> 
> > JT> SELECT DISTINCT docid FROM xml_nodes WHERE nodedata >= 10.0 AND 
> > JT> UPPER(nodedata) = LOWER(nodedata)  AND nodedata IS NOT NULL AND  
> > JT> LTRIM(nodedata) != ' ' parentnodeid IN (SELECT nodeid FROM xml_index WHERE 
> > JT> path LIKE 
> > JT> 'geographicCoverage/boundingCoordinates/southBoundingCoordinate' 
> > JT> 
> > JT> Time: 14.1 seconds (much faster than query 2 - 114.9 secs)
> > JT> 
> > JT> The difference between query 2 and query 5 is, query 5 gets rid of 
> > JT> non-number cell in search space. And query 5(14 secs) is much faster than
> > JT> query 2(114 secs). So it seems progres has a hard time to ignore 
> > JT> non-number char during the casting(from char to number).
> > JT> 
> > JT> But why query 4 (247 seconds) and query 1(236 seconds) no big difference 
> > JT> even we add something to get rid of nonnumber cell in query 4? I guess the 
> > JT> our adding is not completed, it doesn't get rid of all non-number char in 
> > JT> cell. We need fixed that.
> > JT> 
> > JT> I also try to run those queries in Oracle in ecoinfo. But I got a 
> > JT> "ORA-01722: invalid number" error. Durane got the same problem in 
> > JT> numberica search. Those prove our metacat query couldn't remove all 
> > JT> non-number character too.
> > JT> 
> > JT> It seems posgres can ingore non-number character in cell during 
> > JT> string-number casting, but it is very expensive. Oracle just throw a sql 
> > JT> excpeiton. 
> > JT> 
> > JT> Hope this is helpful.
> > JT> 
> > JT> Jing
> > JT> 
> > JT> 
> > JT> -- 
> > JT> Jing Tao
> > JT> National Center for Ecological
> > JT> Analysis and Synthesis (NCEAS)
> > JT> 735 State St. Suite 204
> > JT> Santa Barbara, CA 93101
> > JT> 
> > JT> _______________________________________________
> > JT> seek-dev mailing list
> > JT> seek-dev at ecoinformatics.org
> > JT> http://www.ecoinformatics.org/mailman/listinfo/seek-dev
-- 
Ferdinando Villa, Ph.D., Associate Research Professor, Ecoinformatics
Gund Institute for Ecological Economics and Dept of Botany, Univ. of Vermont
http://ecoinformatics.uvm.edu