[seek-kr-sms] Comments on Shawn Bowers [Notes from VLDB] message posted on Sept 06, 2004

Thu Oct 7 21:40:45 PDT 2004

Dear Shawn,

I accidentally found your notes (published at
http://www.ecoinformatics.org/pipermail/seek-kr-sms/2004-September/000251.ht
ml) regarding our paper entitled: 

Deepavali Bhagwat, Laura Chiticariu, Wang Chiew Tan, Gaurav Vijayvargiya: 
An Annotation Management System for Relational Databases. VLDB 2004

I would like to comment on some points you made in these notes. I hope these
comments will help you understand our work better.

     [experiments
     datasets: TPCH database test queries: spj
     queries, varied joins from 0 to 4 and selected attributes 1, 3,
     and 5
     compare sql vs psql default vs psql default-all

     [this experiment stuff is required at vldb; obviously here it
     doesn't mean anything ... they aren't comparing it to
     anything. This is something that bugs me: it is good to discuss
     implementation and experiments, but unless you are comparing
     against something that exists, there is no need to spend a 1/3 of
     a paper on it]]

I disagree completely with this comment. First of all, there is no such
requirement that "experiment stuff is required at vldb". VLDB has a known
tradition of accepting papers that may be completely theory, systems or
theory+systems.

Second, in our experimental section we do draw a comparison (as you
correctly noticed from the VLDB talk) between SQL and pSQL, the new language
that we propose. To the best of our knowledge, ours is the first
implementation of an annotation management system for relational databases
that allows a user to specify how annotations should propagate (via pSQL
queries). As pSQL is a newly introduced language, SQL was the only language
we could compare pSQL with in terms of execution performance (hence we do
compare our language against something that exists already). The difference
between a pSQL query Q and its "corresponding SQL query" Q' (i.e., the SQL
query obtained after all PROPAGATE clauses in Q are removed - please refer
to our paper-section 2, second paragraph) is that Q propagates annotations
to the result, while Q' doesn’t. Thus, our main goal was to evaluate exactly
this overhead introduced by propagating annotations to the result. 

Thirdly, our experimental evaluations show that pSQL queries with
Default-All clause perform much worst compared to their corresponding SQL
queries. On the good side though, pSQL with Default clause performs much
better compared to pSQL Default-All. At best a pSQL default query performed
about the same as its corresponding SQL query and on the average, pSQL
queries with Default clause performed only 40% worst on the 100Mb dataset
when compared to their corresponding SQL queries. For larger datasets, this
40% tends to become even smaller (18% for 1Gb datasets - please refer to
Section 5 of our paper for a more detailed discussion). Given the many
benefits one may have from propagating annotations (tracing the provenance
and flow of data would be one example) and based on these empirical results,
we concluded that it is not unreasonable to propagate annotations under the
Default scheme (in other words, a user would not wait an unreasonable amount
of time to get the annotations along with the actual data instead of the
actual data only). I therefore fail to understand why you said our
experimental results "don't mean anything" and we aren't "comparing to
anything". 

Another point I want to make is that our experimental section only takes
about 1.5 pages from a total of 12 pages (not 1/3 of the paper as you have
incorrectly mentioned), although we wished we had more space to include
other experimental results.

   [A NOTE: I looked over their paper, reading it through about 3/4's of
   the way.  I think that there is a mistake in the paper. In
   particular, they are assuming that the query processor is
   non-optimizing.]

It doesn't matter whether the query engine is optimizing or not. Please see
the explanation below.

   [After seeing their talk, it really helped me to
   better understand their paper. One reason for the auxiliary queries
   is to help satisfy their semantics for default-all, which tries to
   scoop-up all the annotations for a data value. Here is an example:

     SELECT p.ID as ID, p.Name as Name
     FROM PIR p
     PROPAGATE DEFAULT-ALL

   The PIR table has attributes ID and Name.  The original table has
   two rows: (p332 {a7}, AB {a8}) and (p916 {a9}, AB {a10}), where {}'s
   denote the annotations of the cell. The result is a table again with
   two rows, but with more annotations: (p332 {a7}, AB {a8,a10}) and
   (p916 {a9}, AB {a8,a10}). The additional annotations on AB were
   "scooped-up".

   So, in their approach they use these auxilliary queries to do this
   "scooping".  They define the semantics of this approach using
   conjunctive queries, e.g., giving the following queries (according
   to their algorithm):

     q2(ID, Name) :- PIR(ID, Name).
     q2(ID, Name) :- PIR(ID, Name), PIR(ID', Name).
     q2(ID, Name) :- PIR(ID, Name), PIR(ID, Name').

   Or, using SQL:

     SELECT p.ID as ID, p.Name as Name
     FROM PIR p
     UNION
     SELECT p.ID as ID, p.Name as Name
     FROM PIR p, PIR p'
     WHERE p'.ID = p.ID
     UNION
     SELECT p.ID as ID, p.Name as Name
     FROM PIR p, PIR p''
     WHERE p''.Name = p.Name]

As described in section 3, given a pSQL query with Default-All clause, our
Generate-Query-Basis algorithm produces a set of pSQL queries with custom
propagate clause, and not a set of SQL queries. So for this example, the
algorithm produces the following three pSQL queries with custom scheme:

Q_0:
SELECT p.ID as ID, p.Name as Name
FROM PIR p
PROPAGATE p.ID TO ID, p.Name TO Name

Q_1:
SELECT p.ID as ID, p.Name as Name
FROM PIR p, PIR p1
WHERE p1.ID = p.ID
PROPAGATE p.ID TO ID, p.Name TO Name, p1.ID TO ID

Q_2:
SELECT p.ID as ID, p.Name as Name
FROM PIR p, PIR p2
WHERE p2.Name = p.Name
PROPAGATE p.ID TO ID, p.Name TO Name, p2.Name TO Name

   [But, the basis of their approach contradicts the approach: they
   claim that the three different queries are all equivalent, but that
   they also provide more annotations. First, there is something funny
   about there reasoning, and second, optimizers should recognize the
   equivalence, and presumably rewrite to the "smaller" query.]

We indeed say that the pSQL queries Q_0, Q_1 and Q_2 are equivalent, but
note that we make a clear distinction between the notion of equivalence and
annotation-equivalence (please refer to the Subsection Containment vs.
annotation-containment in Section 2 of the paper). If two pSQL queries Q_1
and Q_2 are equivalent (i.e., Q_1(D)=Q_2(D), for every database D), this
doesn't imply that Q_1 and Q_2 are annotation equivalent (i.e., they produce
the same annotated output on all databases). Please refer to the following
paper for more details about annotation-containment:

Wang Chiew Tan: Containment of Relational Queries with Annotation
Propagation. DBPL 2003: 37-53

   [This approach is also just plain wrong in the conjunctive form, which
   becomes clear when substituted with the first-order logic rep. when
   existenials are introduced over ID' and Name'. Ultimately, it isn't
   such a problem except that their entire theory is based on the
   conjunctive form. I haven't finished reading the paper, but I am
   guessing that the approach works because they do some other
   rewriting to extend the table itself with the annotations (they call
   it a naive representation of the annotations).]

The approach indeed works. Our translator rewrites the pSQL queries with
custom propagation scheme into SQL queries against our naïve storage scheme.
I am including below the rewritings for Q_0, Q_1 and Q_2 according to our
translation algorithm. Note that the naïve representation of the relation
PIR(ID, Name) is in fact PIR(ID, ID_a, Name, Name_a), where ID_a and Name_a
are additional columns used to store the annotations for the ID, and
respectively, Name columns. 

Q_0:
SELECT p.ID as ID, p.ID_a as ID_a, p.Name as Name, p.Name_a as Name_a FROM
PIR p

Q_1:
SELECT p.ID as ID, p.ID_a as ID_a, p.Name as Name, p.Name_a as Name_a FROM
PIR p, PIR p1 WHERE p1.ID = p.ID UNION SELECT p.ID as ID, p1.ID_a as ID_a,
p.Name as Name, NULL as Name_a FROM PIR p, PIR p1 WHERE p1.ID = p.ID

Q_2:
SELECT p.ID as ID, p.ID_a as ID_a, p.Name as Name, p.Name_a as Name_a FROM
PIR p, PIR p2 WHERE p2.Name = p.Name UNION SELECT p.ID as ID, NULL as ID_a,
p.Name as Name, p2.Name_a as Name_a FROM PIR p, PIR p2 WHERE p2.Name =
p.Name

Clearly, Q_0, Q_1 and Q_2 are not equivalent, hence it doesn't matter
whether or not the query engine is optimizing.

I apologize for posting this long e-mail but I felt it is important to point
out your misconceptions about our paper since you have posted your comments
about our paper on a public newsgroup.
Please let me know if you have further questions.

Bests,
Laura