[seek-dev] Metacase Performance and SQL Optimization (WAS: (no subject))

Fri Sep 24 10:00:36 PDT 2004

Optimizing SQL: EXISTS vs. IN

also correlated subqueries using EXISTS will outperform nested subqueries
using IN.  The larger the resultset of the subquery the more extreme this
is.

(At least this is true with relational data where you have columns for
different fields rather than field name as a value like in Metacat.)

Generally using IN, forces a it to build a list of results whereas using
EXISTS will allow it to check as it goes and use existing indexes without
have to first calculate the list.

I've spent much time optimizing SQL queries for large databases and found
correlated subqueries using EXISTS to be very helpful.  Especially when you
were using IN because joins were difficult or impossible.  It might be
easier to migrate existing queries to correlated subqueries using EXISTS
than to joins.

Optimizing SQL: JOINs - FORM vs. WHERE

One note on optimizing with JOINs.  This may not be to useful here but I
thought I'd mention it.  With full SQL-92 syntax, you can create JOINs in
both the WHERE clause or the FROM clause.  Some databases only support them
in the WHERE clause.  Databases (at least good ones) will execute JOINs in
the most optimal order.  That is if you have several JOINs in your WHERE
clause, it won't execute them in the order declared, but rather use
statistics to determine the most beneficial plan.  However, if you have
JOINs in both your FROM clause and your WHERE clause, it will use those in
the FROM first and then those in WHERE (as well as other WHERE clauses).
This give you both an opportunity to optimize a query yourself when you know
better as well as screw up the databases ability to optimize it when you
don't.

Typed Parameters?:
(Maybe this is the same as what someone else mentioned).
I have used a generic parameter table which had different columns for
different datatypes.  Something like
PARAM_NAME varchar
PARAM_TYPE_ID int (1=string, 2=int,....)
PARAM_STRING_VALUE varchar
PARAM_INT_VALUE int
PARAM_DECIMAL_VALUE decimal
PARAM_DATETIME_VALUE datetime
This avoids casting and prevents invalid data, although I am not sure the
feasibility in the case of metacat.

Steve

> -----Original Message-----
> From: seek-dev-admin at ecoinformatics.org 
> [mailto:seek-dev-admin at ecoinformatics.org] On Behalf Of Shawn Bowers
> Sent: Friday, September 24, 2004 8:16 AM
> To: Jing Tao
> Cc: jones at nceas.ucsb.edu; seek-dev at ecoinformatics.org
> Subject: Re: [seek-dev] (no subject)
> 
> 
> Just FYI: it is usually much better to write a join than a 
> nested subquery if you want the query to run fast.  Most 
> query optimizers don't do very well at optimizing nested 
> subqueries ... so you might try writing the joins below as 
> explicit joins.
> 
> shawn
> 
> On Thu, 23 Sep 2004, Jing Tao wrote:
> 
> > Hi, matt, steve and chris:
> >
> > This afternoon I played a little bit with the query steve submit to 
> > ecogrid node and found something about number and string.
> >
> > Metacat stores the metadata in "nodedata" field in xml_nodes table. 
> > The data type for "nodedata" is char. If you want to 
> compare a number 
> > in query, db should do casting.
> >
> > basically, steve's query will look like:
> > AND
> >  condition1 northBoundingCoordinate<=31.5
> >  condition2 southBoundingCoordinate>=10.0
> >  condition3 surname=joye
> > AND
> >
> > In metacat log, the time consumer is the main query, taking 271 
> > seconds (extened  query is trivial, only take 3 seconds). I 
> post the 
> > main query to sql client, it took 267.4 seconds to run. So 
> the result is almost same.
> >
> > Then I did some other tests and try to figure it out. Here 
> is some testing:
> >
> > Query 1: northBoundingCoordinate<=31.5
> >
> > SQL query look like(simplied, only show different part to quey 2,3):
> > SELECT DISTINCT docid FROM xml_nodes WHERE nodedata <= 31.5   AND
> > parentnodeid IN (SELECT nodeid FROM xml_index WHERE path LIKE 
> > 'geographicCoverage/boundingCoordinates/northBoundingCoordinate'
> >
> > Time: 236 seconds(No too much faster than whole query) appro 100 
> > documents
> >
> >
> > Query 2: southBoundingCoordinate>=10.0 SQL query look 
> like(simplied, 
> > only show differenc to query 1, 3)  SELECT DISTINCT docid FROM 
> > xml_nodes WHERE nodedata >= 10.0  AND parentnodeid IN 
> (SELECT nodeid 
> > FROM xml_index WHERE path LIKE
> > 'geographicCoverage/boundingCoordinates/southBoundingCoordinate')
> >
> > Time: 114.9 seconds appro 100 docs
> >
> > Query 3:
> > Condition surname=joye
> >
> > SQL query looks like(only show different part to query 1, 2) SELECT 
> > DISTINCT docid FROM xml_nodes WHERE UPPER(nodedata) = 'JOYE' AND 
> > parentnodeid IN (SELECT nodeid FROM xml_index WHERE path LIKE 
> > '/eml/dataset/creator/individualName/surName'
> >
> > Time: 2.8 seconds(Much faster) 4 docs
> >
> > Oh, Third one(2.8 secs) is much faster than first one(236 secs) and 
> > second one(114.9). Postgres have some problem in casting :)
> >
> > I also run following query:
> >
> > Query 4: northBoundingCoordinate<=31.5 The condition is as same as 
> > query 1, but the sql query are little bit different. It add 
> something 
> > to get rid of non-number cells (this is metacat does now):
> > SELECT DISTINCT docid FROM xml_nodes WHERE nodedata <= 31.5  AND
> > UPPER(nodedata) = LOWER(nodedata) AND LTRIM(nodedata) != ' '  AND 
> > nodedata IS NOT NULL AND parentnodeid IN (SELECT nodeid 
> FROM xml_index 
> > WHERE path LIKE 
> 'geographicCoverage/boundingCoordinates/northBoundingCoordinate'
> >
> > Time: 247.1 seconds(Almost to query 1 - 236 seconds)
> >
> >
> > Query 5: southBoundingCoordinate>=10.0 The condition is as same as 
> > query 2, but the sql query are little bit different. It add 
> some thing 
> > got get rid of nonnumber of cells.
> >
> > SELECT DISTINCT docid FROM xml_nodes WHERE nodedata >= 10.0 AND
> > UPPER(nodedata) = LOWER(nodedata)  AND nodedata IS NOT NULL AND
> > LTRIM(nodedata) != ' ' parentnodeid IN (SELECT nodeid FROM 
> xml_index 
> > WHERE path LIKE 
> > 'geographicCoverage/boundingCoordinates/southBoundingCoordinate'
> >
> > Time: 14.1 seconds (much faster than query 2 - 114.9 secs)
> >
> > The difference between query 2 and query 5 is, query 5 gets rid of 
> > non-number cell in search space. And query 5(14 secs) is 
> much faster 
> > than query 2(114 secs). So it seems progres has a hard time 
> to ignore 
> > non-number char during the casting(from char to number).
> >
> > But why query 4 (247 seconds) and query 1(236 seconds) no big 
> > difference even we add something to get rid of nonnumber 
> cell in query 
> > 4? I guess the our adding is not completed, it doesn't get 
> rid of all 
> > non-number char in cell. We need fixed that.
> >
> > I also try to run those queries in Oracle in ecoinfo. But I got a
> > "ORA-01722: invalid number" error. Durane got the same problem in 
> > numberica search. Those prove our metacat query couldn't remove all 
> > non-number character too.
> >
> > It seems posgres can ingore non-number character in cell during 
> > string-number casting, but it is very expensive. Oracle 
> just throw a 
> > sql excpeiton.
> >
> > Hope this is helpful.
> >
> > Jing
> >
> >
> >
> 
> _______________________________________________
> seek-dev mailing list
> seek-dev at ecoinformatics.org
> http://www.ecoinformatics.org/mailman/listinfo/seek-dev
>