[seek-kr-sms] Re: Comments on Shawn Bowers [Notes from VLDB] message posted on Sept 06, 2004

Fri Oct 8 12:43:06 PDT 2004

Dear Laura (and Colleagues),

I appreciate that you responded in detail to the email I sent to the
seek-kr-sms discussion group, and I also apologize that the email
was made publicly available, and that you saw my critiques in this way.
Here is some of the context surrounding the original email:

  - I attended VLDB, saw your presentation and a number of others, and
    tried to make fairly detailed notes (and some commentary) of most of
    the presentations I saw.

  - The notes were sent to a discussion group for members of a
    research project. This discussion group is primarily composed of
    people funded through the project. The email was meant for members of
    this group and not for the general public. In hindsight, I shouldn't
    have emailed the notes to the group (or should have at least deleted
    my commentary), being that the list is logged on the Web.

  - The notes I made, and especially the commentary, were meant for
    discussion, much like a class mailing list at a university, e.g., or
    reading group discussing research and papers. Some of the comments
    were critical (in fact, probably too much so) to spur discussion and
    were also "off-the-cuff" comments/doubts as I was writing the notes.

  - I spent additional time looking at your (and your colleagues) work
    because I find it really interesting, new, and important. I didn't get
    a chance to spend quite enough time getting into many of the details
    concerning your paper; and read through the paper only at the
    conference ... and so many of the comments are a bit half-baked and
    reflect me just trying to understand what you've done.

I hope that you didn't and don't take what I wrote as personal criticism.
The critiques were only directed at technical issues; and were really only
meant for the internal members of the seek-kr-sms list.  Also, and because
of this, we removed the emails from the web archive of the mailing list.

Since you did take the time to comment, I'd like to respond to some of
your comments. But, I understand if you are not eager to further
discuss/debate the technical issues ...

>      [this experiment stuff is required at vldb; obviously here it
>      doesn't mean anything ... they aren't comparing it to
>      anything. This is something that bugs me: it is good to discuss
>      implementation and experiments, but unless you are comparing
>      against something that exists, there is no need to spend a 1/3 of
>      a paper on it]]
>
> I disagree completely with this comment. First of all, there is no
> such requirement that "experiment stuff is required at vldb". VLDB has
> a known tradition of accepting papers that may be completely theory,
> systems or theory+systems.
>

You are right: There is no requirement for experimentation sections in
VLDB or any other database conference that I am aware of -- this was more
of a general (overly critical and somewhat cynical) comment that I don't
find the experiment sections that useful generally. Some of what I said
wasn't meant to be directed at your paper. I just find that often the
experiments are a bit misused. A bigger problem is that often experiments
discussed are not repeatable, given the information provided in the papers
...

> Second, in our experimental section we do draw a comparison (as you
> correctly noticed from the VLDB talk) between SQL and pSQL, the new
> language that we propose. To the best of our knowledge, ours is the
> first implementation of an annotation management system for relational
> databases that allows a user to specify how annotations should
> propagate (via pSQL queries).  As pSQL is a newly introduced language,
> SQL was the only language we could compare pSQL with in terms of
> execution performance (hence we do compare our language against
> something that exists already). The difference between a pSQL query Q
> and its "corresponding SQL query" Q' (i.e., the SQL query obtained
> after all PROPAGATE clauses in Q are removed - please refer to our
> paper-section 2, second paragraph) is that Q propagates annotations to
> the result, while Q' doesn^Rt. Thus, our main goal was to evaluate
> exactly this overhead introduced by propagating annotations to the
> result.

What you say makes sense ... that by comparing a pSQL query to an SQL
query you can potentially get an estimate of the overhead for your
implementation of annotation propogation.  I was under the impression that
your implementation takes a pSQL query, and rewrites the query into an
equivalent SQL query (that does propogate annotations).  One wouldn't
expect a normal SQL query to propogate annotations -- so in that sense, it
is a bit like comparing apples and oranges, which is where my confusion
came from -- but it makes sense that you would compare these with the
intention to discuss overhead.

As another note on deteriming the overhead, it seems like you could
actually make a stronger statement concerning overhead than the
experimental estimates. In particular, it seems like you can give a fairly
precise characterization of the overhead because you know precisely the
SQL-part of a pSQL query, and you know precisely how the pSQL queries are
rewritten into equivalent SQL expresssions. It probably gets somewhat
complex with query optimization, but perhaps that could be factored out of
the equation.  Just a thought.

> It doesn't matter whether the query engine is optimizing or not.
> Please see the explanation below.

My comments on the "accuracy" of your approach are a bit muddled. As I
said, I don't think these details change your results (i.e., I wasn't
arguing that your implementation is incorrect). I am just confused a bit
about your formalization.

Where I saw, and still see, a problem is in the way the conjunctive
queries are used in the formalization. My thinking is described below.
These comments are all based on Section 3 of your paper.

Given a default-all pSQL query Q, the rewriting of Q to an equivalent SQL
query (that "correctly" propogates annotations) B(Q) consists of two
steps: (1) computing the "representative" query Q0 of Q, (2) computing the
"auxilliary" queries Q1 ... Qn. This rewriting you call
"generate-query-basis".

My interpretation of the "representative" query Q0 is that for each tuple
produced by the query, each value of the tuple has all of the annotations
from the particular values used to generate the result.

Your example, in the CQ notation, is:

A0(x,y,z) :- Mapping_Table(w,x,u,v), SWISS-PROT(x,z), PIR(u,y).

So, e.g., for a tuple in A0, the x-value of the tuple takes annotations
from the corresponding x-value of the Mapping_Table AND the corresponding
x-value of the SWISS-PROT table that were used to generate the particular
result.

The "auxilliary" queries go one step further, by also including all the
annotations for the same x-values in the Mapping_Table and the SWISS-PROT
table. For example, the following may be the facts that "generated" a
particular A0 result tuple <5,0,0>.

Mapping_Table(0,5,0,0), SWISS-PROT(5,0), PIR(0,0).

But, there may also be a tuple/fact Mapping_Table(1,5,1,1), where there is
a different annotation for this occurrence of the value 5. The goal of the
"auxilliary" queries is to also include this annotation in the result
<5,0,0>. (This was what I meant by "scoop-up" before.)

Assuming this is an accurate interpretation of the "auxilliary" queries,
here is where I see a problem.

You then show that you generate a number of auxilliary CQ queries, e.g.,
here is the one you give for the above example.

A1(x,y,z) :- Mapping_Table(w,x,u,v), SWISS-PROT(x,z), PIR(u,y),
             Mapping_Table(w1,x,w2,w3).

I understand that this query is an equivalent query to A0. In fact, that
is where I think the problem comes in with your formalization.

Semantically, the query above states:

(FORALL x,y,z) [A1(x,y,z) =>
  (EXISTS w,u,v,w1,w2,w3)
  Mapping_Table(w,x,u,v) & SWISS-PROT(x,z) & PIR(u,y) &
  Mapping_Table(w1,x,w2,w3)]

In words, for every x, y, and z, A1(x,y,z) holds if there exists a w, u,
v, w1, w2, and w3 such that Mapping_Table(w,x,u,v), SWISS-PROT(x,z),
PIR(u,y), and Mapping_Table(w1,x,w2,w3) also holds. Clearly, if the first
mapping table formula is satisfied, then the second is as well.

Thus, because of the existential quantifiers over w1, w2, and w3, there is
no implied "iteration" over these values for the second mapping table
formula. Hence, they are not necessarily "collected" by the rule, as shown
in the first-order semantics of the rule.

This is what I was refering to by "non-optimizing" -- this is a known
optimization technique for datalog; I think the first paper that discusses
some of these issues is "Optimizing existential datalog queries", R.
Ramakrishnan et al., PODS, 1988.

Basically, I might be being dense, but I don't see how in your
formalization these auxilliary rules perform the desired result of the
"generate-query-basis" as stated, without giving a different
interpretation to CQ rules.  I think on the SQL side, the rewritings make
sense because you are adding the _a columns and so on...

As a side note: I think the "representative" query Q0 could also make a
useful propogation scheme in itself, and actually begs the question of the
purpose of the original annotations.  For example, if one gives an
annotation to a value v in column c in one tuple, and in a different tuple
a different annotation to v in c, what is the reason a user made, or a
database contains, these distinct annotations for the same value?  It
could just be a mistake, or possibly the data was "unioned" from multiple
sources. But could it also be that the tuple denotes some type of context
in which the annotation is applicable? If so, it might not be appropriate
to go further and do the auxilliary part ... From your experience looking
at annotations, why did you find the proprogate-all a more natural
semantics for interpreting the purpose of the annotations?

Again, sorry about the critical comments being made "googeable" and thanks
for responding.

Shawn