[seek-dev] Metacase Performance and SQL Optimization (WAS: (no subject))

Fri Sep 24 10:27:32 PDT 2004

On Fri, 2004-09-24 at 13:00, Steve Tekell wrote:
> Typed Parameters?:
> (Maybe this is the same as what someone else mentioned).
> I have used a generic parameter table which had different columns for
> different datatypes.  Something like
> PARAM_NAME varchar
> PARAM_TYPE_ID int (1=string, 2=int,....)
> PARAM_STRING_VALUE varchar
> PARAM_INT_VALUE int
> PARAM_DECIMAL_VALUE decimal
> PARAM_DATETIME_VALUE datetime
> This avoids casting and prevents invalid data, although I am not sure the
> feasibility in the case of metacat.
> 

pretty similar to what I was mentioning, except I use a separate table
for each possible literal (e.g. literal_nn) with one field of the right
type for the value, and the id of the object it relates to. Advantages
are that you only store the literals that are really there in the
original material, making the table compact and fast to search into, you
don't fill up your permanent storage with indexed blanks, and more
importantly, it works a lot better for stuff like polygon-based GIS
data, allowing to overcome limitations in most DB implementations that
constrain the amount or types of columns to only one 'special' column
per table, or only one index, or require the column to be the last one
in the table, or admit only one kind of shape, you name it. That's what
did it for me. There is some bookkeeping required (you need to know in
what table to search each field named in the query) but it's not a lot
different than finding out which column to search for, and an in-memory
hash, serialized between invocations, takes care of it nicely and is
easily mantained automatically. Tables only get created when needed,
which is also nice - you don't have to forecast the possible types in
advance, which is good for porting the db when you finally buy the gis
extensions....

ferdinando

> Steve
> 
> > -----Original Message-----
> > From: seek-dev-admin at ecoinformatics.org 
> > [mailto:seek-dev-admin at ecoinformatics.org] On Behalf Of Shawn Bowers
> > Sent: Friday, September 24, 2004 8:16 AM
> > To: Jing Tao
> > Cc: jones at nceas.ucsb.edu; seek-dev at ecoinformatics.org
> > Subject: Re: [seek-dev] (no subject)
> > 
> > 
> > Just FYI: it is usually much better to write a join than a 
> > nested subquery if you want the query to run fast.  Most 
> > query optimizers don't do very well at optimizing nested 
> > subqueries ... so you might try writing the joins below as 
> > explicit joins.
> > 
> > shawn
> > 
> > On Thu, 23 Sep 2004, Jing Tao wrote:
> > 
> > > Hi, matt, steve and chris:
> > >
> > > This afternoon I played a little bit with the query steve submit to 
> > > ecogrid node and found something about number and string.
> > >
> > > Metacat stores the metadata in "nodedata" field in xml_nodes table. 
> > > The data type for "nodedata" is char. If you want to 
> > compare a number 
> > > in query, db should do casting.
> > >
> > > basically, steve's query will look like:
> > > AND
> > >  condition1 northBoundingCoordinate<=31.5
> > >  condition2 southBoundingCoordinate>=10.0
> > >  condition3 surname=joye
> > > AND
> > >
> > > In metacat log, the time consumer is the main query, taking 271 
> > > seconds (extened  query is trivial, only take 3 seconds). I 
> > post the 
> > > main query to sql client, it took 267.4 seconds to run. So 
> > the result is almost same.
> > >
> > > Then I did some other tests and try to figure it out. Here 
> > is some testing:
> > >
> > > Query 1: northBoundingCoordinate<=31.5
> > >
> > > SQL query look like(simplied, only show different part to quey 2,3):
> > > SELECT DISTINCT docid FROM xml_nodes WHERE nodedata <= 31.5   AND
> > > parentnodeid IN (SELECT nodeid FROM xml_index WHERE path LIKE 
> > > 'geographicCoverage/boundingCoordinates/northBoundingCoordinate'
> > >
> > > Time: 236 seconds(No too much faster than whole query) appro 100 
> > > documents
> > >
> > >
> > > Query 2: southBoundingCoordinate>=10.0 SQL query look 
> > like(simplied, 
> > > only show differenc to query 1, 3)  SELECT DISTINCT docid FROM 
> > > xml_nodes WHERE nodedata >= 10.0  AND parentnodeid IN 
> > (SELECT nodeid 
> > > FROM xml_index WHERE path LIKE
> > > 'geographicCoverage/boundingCoordinates/southBoundingCoordinate')
> > >
> > > Time: 114.9 seconds appro 100 docs
> > >
> > > Query 3:
> > > Condition surname=joye
> > >
> > > SQL query looks like(only show different part to query 1, 2) SELECT 
> > > DISTINCT docid FROM xml_nodes WHERE UPPER(nodedata) = 'JOYE' AND 
> > > parentnodeid IN (SELECT nodeid FROM xml_index WHERE path LIKE 
> > > '/eml/dataset/creator/individualName/surName'
> > >
> > > Time: 2.8 seconds(Much faster) 4 docs
> > >
> > > Oh, Third one(2.8 secs) is much faster than first one(236 secs) and 
> > > second one(114.9). Postgres have some problem in casting :)
> > >
> > > I also run following query:
> > >
> > > Query 4: northBoundingCoordinate<=31.5 The condition is as same as 
> > > query 1, but the sql query are little bit different. It add 
> > something 
> > > to get rid of non-number cells (this is metacat does now):
> > > SELECT DISTINCT docid FROM xml_nodes WHERE nodedata <= 31.5  AND
> > > UPPER(nodedata) = LOWER(nodedata) AND LTRIM(nodedata) != ' '  AND 
> > > nodedata IS NOT NULL AND parentnodeid IN (SELECT nodeid 
> > FROM xml_index 
> > > WHERE path LIKE 
> > 'geographicCoverage/boundingCoordinates/northBoundingCoordinate'
> > >
> > > Time: 247.1 seconds(Almost to query 1 - 236 seconds)
> > >
> > >
> > > Query 5: southBoundingCoordinate>=10.0 The condition is as same as 
> > > query 2, but the sql query are little bit different. It add 
> > some thing 
> > > got get rid of nonnumber of cells.
> > >
> > > SELECT DISTINCT docid FROM xml_nodes WHERE nodedata >= 10.0 AND
> > > UPPER(nodedata) = LOWER(nodedata)  AND nodedata IS NOT NULL AND
> > > LTRIM(nodedata) != ' ' parentnodeid IN (SELECT nodeid FROM 
> > xml_index 
> > > WHERE path LIKE 
> > > 'geographicCoverage/boundingCoordinates/southBoundingCoordinate'
> > >
> > > Time: 14.1 seconds (much faster than query 2 - 114.9 secs)
> > >
> > > The difference between query 2 and query 5 is, query 5 gets rid of 
> > > non-number cell in search space. And query 5(14 secs) is 
> > much faster 
> > > than query 2(114 secs). So it seems progres has a hard time 
> > to ignore 
> > > non-number char during the casting(from char to number).
> > >
> > > But why query 4 (247 seconds) and query 1(236 seconds) no big 
> > > difference even we add something to get rid of nonnumber 
> > cell in query 
> > > 4? I guess the our adding is not completed, it doesn't get 
> > rid of all 
> > > non-number char in cell. We need fixed that.
> > >
> > > I also try to run those queries in Oracle in ecoinfo. But I got a
> > > "ORA-01722: invalid number" error. Durane got the same problem in 
> > > numberica search. Those prove our metacat query couldn't remove all 
> > > non-number character too.
> > >
> > > It seems posgres can ingore non-number character in cell during 
> > > string-number casting, but it is very expensive. Oracle 
> > just throw a 
> > > sql excpeiton.
> > >
> > > Hope this is helpful.
> > >
> > > Jing
> > >
> > >
> > >
> > 
> > _______________________________________________
> > seek-dev mailing list
> > seek-dev at ecoinformatics.org
> > http://www.ecoinformatics.org/mailman/listinfo/seek-dev
> > 
> 
> _______________________________________________
> seek-dev mailing list
> seek-dev at ecoinformatics.org
> http://www.ecoinformatics.org/mailman/listinfo/seek-dev
-- 
Ferdinando Villa, Ph.D., Associate Research Professor, Ecoinformatics
Gund Institute for Ecological Economics and Dept of Botany, Univ. of Vermont
http://ecoinformatics.uvm.edu