[seek-dev] (no subject)

Bing Zhu bzhu at sdsc.edu
Fri Sep 24 10:36:49 PDT 2004


I believe we have the similar issue in SRB when I tried to compose more
complicated queries for WhyWhere data and had problem dealing with 'integer'
type meta data.

Note the following condition example which a typical query part in our
query document.

   <condition operator="LESS THAN OR EQUALS"
concept="northBoundingCoordinate">31.5</condition>

The missing piece of information to translate it into underneath data
management
system such as SRB is the attribute "datatype". If we can have this
information
in our query document, the problem of "northBoundingCoordinate<=31.5" can be
avoided.


Bing





-----Original Message-----
From: seek-dev-admin at ecoinformatics.org
[mailto:seek-dev-admin at ecoinformatics.org]On Behalf Of Jing Tao
Sent: Thursday, September 23, 2004 7:49 PM
To: jones at nceas.ucsb.edu
Cc: seek-dev at ecoinformatics.org
Subject: [seek-dev] (no subject)


Hi, matt, steve and chris:

This afternoon I played a little bit with the query steve submit to
ecogrid node and found something about number and string.

Metacat stores the metadata in "nodedata" field in xml_nodes table. The
data type for "nodedata" is char. If you want to compare a number in
query, db should do casting.

basically, steve's query will look like:
AND
 condition1 northBoundingCoordinate<=31.5
 condition2 southBoundingCoordinate>=10.0
 condition3 surname=joye
AND

In metacat log, the time consumer is the main query, taking 271 seconds
(extened  query is trivial, only take 3 seconds). I post the main query
to sql client, it took 267.4 seconds to run. So the result is almost same.

Then I did some other tests and try to figure it out. Here is some testing:

Query 1: northBoundingCoordinate<=31.5

SQL query look like(simplied, only show different part to quey 2,3):
SELECT DISTINCT docid FROM xml_nodes WHERE nodedata <= 31.5   AND
parentnodeid IN (SELECT nodeid FROM xml_index WHERE path LIKE
'geographicCoverage/boundingCoordinates/northBoundingCoordinate'

Time: 236 seconds(No too much faster than whole query) appro 100 documents


Query 2: southBoundingCoordinate>=10.0
SQL query look like(simplied, only show differenc to query 1, 3)
 SELECT DISTINCT docid FROM xml_nodes WHERE nodedata >= 10.0  AND
parentnodeid IN (SELECT nodeid FROM xml_index WHERE path LIKE
'geographicCoverage/boundingCoordinates/southBoundingCoordinate')

Time: 114.9 seconds appro 100 docs

Query 3:
Condition surname=joye

SQL query looks like(only show different part to query 1, 2)
SELECT DISTINCT docid FROM xml_nodes WHERE UPPER(nodedata) = 'JOYE' AND
parentnodeid IN (SELECT nodeid FROM xml_index WHERE path LIKE
'/eml/dataset/creator/individualName/surName'

Time: 2.8 seconds(Much faster) 4 docs

Oh, Third one(2.8 secs) is much faster than first one(236 secs) and second
one(114.9). Postgres have some problem in casting :)

I also run following query:

Query 4: northBoundingCoordinate<=31.5
The condition is as same as query 1, but the sql query are little bit
different. It add something to get rid of non-number cells (this is
metacat does now):
SELECT DISTINCT docid FROM xml_nodes WHERE nodedata <= 31.5  AND
UPPER(nodedata) = LOWER(nodedata) AND LTRIM(nodedata) != ' '  AND nodedata
IS NOT NULL AND parentnodeid IN (SELECT nodeid FROM xml_index WHERE path
LIKE 'geographicCoverage/boundingCoordinates/northBoundingCoordinate'

Time: 247.1 seconds(Almost to query 1 - 236 seconds)


Query 5: southBoundingCoordinate>=10.0
The condition is as same as query 2, but the sql query are little bit
different. It add some thing got get rid of nonnumber of cells.

SELECT DISTINCT docid FROM xml_nodes WHERE nodedata >= 10.0 AND
UPPER(nodedata) = LOWER(nodedata)  AND nodedata IS NOT NULL AND
LTRIM(nodedata) != ' ' parentnodeid IN (SELECT nodeid FROM xml_index WHERE
path LIKE
'geographicCoverage/boundingCoordinates/southBoundingCoordinate'

Time: 14.1 seconds (much faster than query 2 - 114.9 secs)

The difference between query 2 and query 5 is, query 5 gets rid of
non-number cell in search space. And query 5(14 secs) is much faster than
query 2(114 secs). So it seems progres has a hard time to ignore
non-number char during the casting(from char to number).

But why query 4 (247 seconds) and query 1(236 seconds) no big difference
even we add something to get rid of nonnumber cell in query 4? I guess the
our adding is not completed, it doesn't get rid of all non-number char in
cell. We need fixed that.

I also try to run those queries in Oracle in ecoinfo. But I got a
"ORA-01722: invalid number" error. Durane got the same problem in
numberica search. Those prove our metacat query couldn't remove all
non-number character too.

It seems posgres can ingore non-number character in cell during
string-number casting, but it is very expensive. Oracle just throw a sql
excpeiton.

Hope this is helpful.

Jing


--
Jing Tao
National Center for Ecological
Analysis and Synthesis (NCEAS)
735 State St. Suite 204
Santa Barbara, CA 93101

_______________________________________________
seek-dev mailing list
seek-dev at ecoinformatics.org
http://www.ecoinformatics.org/mailman/listinfo/seek-dev




More information about the Seek-dev mailing list