[seek-kr-sms] Notes from VLDB

Mon Sep 6 09:17:05 PDT 2004

The last of the notes...

-----------------

Managing data from high-throughput genomic processing: a case study
(invited)
-------------------------------------------------------------------
Toby Bloom

   The Broad Institute

     cancer genomics / expression analysis

     medical and pop. genomics/genotyping

     chemical bio

     cell circuitry / sequencing

   The sequencing center

     a manafacturing center

     trying to track goop

     data consitency, verification

   the sequencing process

     start with tubes of dna

     go through processing, adding agents

     wind up with dna in plastic plates 8x12, 16x20 array

     1000's of arrays per day

     many organisms at once, plates through many steps ...

     adding, eventually, florescent die, and what you get out is a
     binary signal from a laser

     clean it up -- via blast (verification)
     bunch of organism

     get out a sequence

     and a set of quality scores (a likely-hood score for each base)

     distance and heights of paths

   Data provenance

     huge amount of data management is walking these provenance trees
     (the steps from organism to a sequence)

   Lab monitoring

     "find the processes that didn't complete"

     which plates got stuck? and why? (i.e., the plate got dropped and
     broke; it got stuck in a freezer somewhere; figuring out if a
     process was restarted based on what you know of the process)

     200,000 to 250,000 reads a week

     50 billion bases

   Troubleshooting queries

     For example, find the reason quality dropped on Aug. 1st

       get all reads from July to Sept (over 14 million reads)

       show me read quality by deck by time (in 3 minute intervals) by
       well geometry (slicing and dicing problem)

       display it so i can see the trends ... (very difficult with so
       many data points, with so much processing that is done on them)

       things are a bit different than in the commercial apps, but it
       isn't totally clear why/how (in terms of visualization / mining)

       connections in the data visualization is interesting and sort of
       unknown -- probably a lot of mining could be done here

   Assembly

     most of the bioinformatics code was written by bio grad students
     in a mad rush

     things are put in flat files, loaded into main memory, "crunched",
     then written back out to a flat file, then sent out to public
     servers

     generally structured information, although not viewed by
     biologists to be useful for dbs

   Iterate

     genome fishing

     giant versioning problem / configuration problem

     added problem that everything goes to standard database

     nice thing is that there is a single standard, but everything
     needs to still be consistent, and versioning is a major problem

     doesn't look like research; just a hard problem

   Some Stats

     50 Gigs of raw sequence data per day > 200K reads/day

     ~50 bill dna nucleotide bases per year; maintain attributes
     assoc. with each base; for reference: human genome is ~3 billion
     bases

     data maintained at the level of a "read" (~800 bases, >200 million
     reads in the database)

     fewer transactions with better data design

   process evolution (research)

     process changes every 6 months

     acquisition rate doubles every 2 years

     major process chagnes are expected regularly

       e.g., throw out all the equipment and start over (50 mill down
       drain)

   The research environment

     users don't know what they need -- and they aren't supposed to
     know yet

     the results of any one exp. can cause drastic and very rapid
changes in all future plans

   The structure of the data

     ... flashed

   The result

     The schemas change extremely quickly; which is the motivation for
     flat files

     Lots of schema ops needed: compute on old and new, migrate, merge,
     etc.

     Fixed schemas are not good, too much work changing schemas

   Fuzzy hierarchies

     bio data doesn't fit nice taxonomies

     one example is organism hierarchies: how to keep a standard
     taxonomy

   Conclusion

     research env. makes the schema change problem yet worse

     data doesn't fit into nice, well-strctured tabular format all the
     time

     queries may be qualitatively diff. in this env. not only due to
     flexibility, but due to size of the results

     robustness, data provenance, data quality are harder issues in a
     rapidly changing environment

Database challenges in the integration of biomedical data sets
(invited)
--------------------------------------------------------------
Rakesh Nagarajan (Washington school of medicine)

   Looking at human disease

     most diseases affect multiple levels (genes, etc)

   Moving into a quantitative area (like in physics)

     more high-throughput analysis

     genome-wide micro arrays

     large scale re-sequencing

     looking at 100's of proteins ...

   How do you combine

     functional genomics experiments

     clinico-pathology parameters (patient info)

     gene annotation

   Gene annotation

     biomedical KB of genes

     reside in multiple publically avail. biomed. dbs: unigene,
     locuslink, dbsnp, refseq, homologene, gene ontology, pubmed, omim

     integration challenges: data source heterogeneity and reliability,
     syntactic and semantic equality, data synchronization

   microarray profiling (functional genomics)

     rna -> label them -> ... scan -> one sample on chip at a time ->
     scan looking for red/green/blue ... -> analyze

     individual chip: binary image, probe intensity data, summarized
     expression data

   mutation profiling

     sequence 100s of patients and genes, which have mutations, how
     does it relate to the disease at hand

     ids dna level genetic changes responsible for disease development
     or susceptibility

   proteomic profiling

     similar, takes samples, put them on dies, mix, single 2d gell,
     image gell, find differentially expressed proteins, and eventually
     identifiy proteins

   clinicopathology data

     entire patient medical record

     text or numbers; analog or digital signals; images and videos

   biomedical data integration part 1

     clinicopathology parameters to functional genomic experiments

     metadata connection patient, sample, data

   functional genomic experiments

     similar, connect up the metadata

   approaches

     link integration: start a page, read about it, link to additional
     info, surf

     view integration: kliesli or K2. you see all the data together and
     through queries (resides in the respective source databases,
     environment built around thes)

     data warehouse

   architecture

     it is really hard to get all the data together, all at once

     data refresh: data update daily (experimental data); annotation
     weekly

     speed: fast querying, less than 100 users, transfer large amounts
     of data in real time

     want to mine the data as an app

   data warehouse plus link integration

     the approach they chose

     integration at the lowest level

     needs schema integration, but not query integration (because of
     warehouse)

     arch. appears to be a data warehouse logical design for their
     specific types of data

   function express server

     ETL: Unigene, homologene, unigene, locuslink

     functional genomic data loaders (form for putting in chips)

       like morpho for genes (but probably not as much metadata,
       ... can't tell)

     neural net algorithm used for integration (?)

     [he is blazing through these slides; hard to take good
     notes... probably should look at the paper]

   analysis and visualization applications

     a portal for accessing the warehouse for scientist (lots of plots,
     etc., links, create gene networks for text mining, expression of a
     set of genes using clustering algorithms, literature base
     networks, some browsing via gene onto) ... looks like a nice start

   experiences

     works good for them, not scaleable to a large base

     xml works well for them because of the schema flexibility

     need syntactic and semantic integration ... mentions the use of
     ontologies

   future biomedical research

     shows a big complex figure I can't read from here ...

     basically a picture of a big process/business workflow with
     various needs/tools

   nationwide data integration

     a project on data silos, grid connectivity, etc.

     flashed a website, but couldn't catch it

   questions

     right now we just want to give the data to the biologists

     ...

The bloomba personal content database (invited)
-----------------------------------------------
Raymie Stata (UC Santa Cruz)

[why was this an invited talk?]

   an outlook-like personal database (PCDB)

   why?

     proliferation of data (email, docs, photos, history, etc.)
     proliferation of devices
     proliferation of sharing

   differences between longhorn

   requirements of pcdb

     scales well in item count
     associative retrieval (search) vs. hierarchy
     transaction-like semantics
     transparent replication
     "souped-up file system" not a scaled-down database
        apply some db stuff to file systems

   web search vs personal search

     computing environment: web search is dedicated and controlled,
     personal search is borrowed and "hostile" (virus checkers are at
     the top of the list; delete files, etc., and users as well ---
     [hmm, the users are hostile...])

     fundamentally interactive

   bloomba

     pim build on fast, scalable pcdb

       email, contacts, caledar
       search integrated organicaly into the ui
       spamassassin, "smart groups", rss, ...

     calendar sharing over file or web servers; palm sync as well

   queries are everywhere

     looks like an email system (because it is hard to get people to
     use something totally new)

     snuck in features:

       can save queries, and they look like folders

       bloomba search bar (searches all of your email, gigabytes of
       email, and searches attaches as well)

       can search in the message itself (like a name, etc.)

       queries stored on calendar items

       your inbox is actually a search running on the background
       (queries are continuous)

       everything goes through the database

   pcdb functionality

     data model: semi-structured, immutable docs; mutable collections
     of tags (strings) on those

     operations: insert, change tags, get, get summary, delete; atomic,
     consistent, weakly isolated, durable; run (expanded) query, run
     query counters

   components

     repository, resource managers (document, tag, index, summary, and
     address managers currently)

   engineering challenges

     hide latency, in the extreme; defer as much as poosible, and then:
     defer it even longer

     they store to disk instead of in main-memory, so hiding latency is
     very important in this app

     don't have to do much background tasks (they defer maintaince
     indefinately ... measured in days)

     reliability doesn't mean consistency as much as, for the user,
     getting the document they need at that moment to them (what is
     needed is accessibility, not so much consistency management)

   query execution

     requirements: continuous, support counting, scalable

     solution: "delta approach" -- each query handled by a thread, that
     are stateless (as much as possible) so they can be scaled. The
     threads need to be in a position to compute a delta. The ui is
     given a delta on how to update the presentation of the result of
     the query.

   Design of query runner

     mutator threads used to update

     snapshot the database before and after the change, the thread
     compares the difference, which becomes the delta

     only present the before and after that goes through a change (like
     compression -- the static stuff is compressed away for the
     threads)

   Conclusion

     what do we need from the data model?

     data model

       pretty happy with it, e.g., don't want to move to XML. but,
       there is a need for more complex relationships

     availability

       operate in a regime where recovery is required but hasn't
       happened yet

     compression

       very important issue. compression seems to be a forgotten field
       now in db research (it was hot, but not any more)

     replication/distribution

       a very wide open area for personal data [i agree with this
       ... not much work has gone on here for *personal*, cross device
       data availability, access, searching, etc]

   questions

     their work is primarily ir-based, so query language very much a
     standard google keyword-based search

     biggest problem for corruption is virus checkers

An annotation management system for relational databases
--------------------------------------------------------
Laura Chiticariu (UC Santa Cruz)

   Annotation management system

     a system that is able to propagate meta-data along with the data
     as the data is being moved around

     main motivation: to trace the provenance and flow of data

     trace annotations through transformations (a la data warehouses,
     extract transform load--ETL)

     a transformation for here are queries

   vision

     NYRestaurants table (restaurant, cost, type, zip)

     all restaurants view, cheap restaurants views

     if certain cells in source table have annotations, these should be
     propogated through the view

   other apps

     keep info not otherwise stored in current db design

     highlight wrong data

     security and quality metric

   related work

     not new idea ... Wang and Madnick [vldb'90]
     lee, bressan and madnick [widm]
     bernstein and bergstraesser [data eng. bulletin 99]
     webdb 99 - superimposed paper
     why provenance cui, widom, wiener [cww00]
     annotations on genomic data

   outline

     pSQL queries

       semantics

	custom
	default
	default-all

       select-from-where with additional propogate clause that
       describes how annotation should propogate

       default, default-all, and custom clause

       a psql query is a union of psql fragments

   the custom scheme

     based on user spec.

     SELECT DISTINCT A
     FROM R r
     propogate r.a to A

     UNION

     SELECT DISTINCT A
     FROM R r
     PROPOGATE r.B TO A

   the default scheme

     propogates annotations according to where data is copied from

     e.g., select B from R r, would have propogate r.B to B

     problematic for joins ... if you want all annotations to follow

     select R.a, R.b, S.c
     from R, S
     where R.b = S.b

     only R.b's anntotations get propogated ... not S.b's here

   the default-all scheme

     propogate annotations according to where data is copied from
     according to all equilvanent

   query basis

     a query basis of Q, denoted B(Q), is a finite set of pSQL queries
     (with defualt propagation scheme)

   generating a query basis

     given R(A,B) and S(B,C),
     and query

       select r.A, s.B, s.C
       from R r, S s
       where r.B = s.B
       propogate default-all

     representative query Qo:

       propogate r.A to A, , s.B to B, s.C to C, r.B, to B

     auxilliary queries

       have to read the paper because this example doesn't explain what
       is happening ... the auxilliary queries aren't very clear from
       the talk, need to read the paper

   correctness of the algorithm

     theorem: given a pSQL query Q with default-all, the algorithm
     generates a query basis of Q

   implementation

     translator module

       input: a psql query q
       output: an sql query q' writte againts the naive storage scheme

     q' is sent to the rdbms and executed

     postprocessing

   a naive storage scheme

     every att of every relation there is an additional attribute for
     storing the annotations

     conceivably, other possible storage schemes (explore that later)

   experiments

     datasets: TPCH database test queries: spj

     queries, varied joins from 0 to 4 and selected attributes 1, 3,
     and 5

     compare sql vs psql default vs psql default-all

     [this experiment stuff is required at vldb; obviously here it
     doesn't mean anything ... they aren't comparing it to
     anything. This is something that bugs me: it is good to discuss
     implementation and experiments, but unless you are comparing
     against something that exists, there is no need to spend a 1/3 of
     a paper on it]

   A NOTE: I looked over their paper, reading it through about 3/4's of
   the way.  I think that there is a mistake in the paper. In
   particular, they are assuming that the query processor is
   non-optimizing.  After seeing their talk, it really helped me to
   better understand their paper. One reason for the auxilliary queries
   is to help satisfy their semantics for default-all, which tries to
   scoop-up all the annotations for a data value. Here is an example:

     SELECT p.ID as ID, p.Name as Name
     FROM PIR p
     PROPAGATE DEFAULT-ALL

   The PIR table has attributes ID and Name.  The original table has
   two rows: (p332 {a7}, AB {a8}) and (p916 {a9}, AB {a10}), where {}'s
   denote the annotations of the cell. The result is a table again with
   two rows, but with more annotations: (p332 {a7}, AB {a8,a10}) and
   (p916 {a9}, AB {a8,a10}). The additional annotations on AB were
   "scooped-up".

   So, in their approach they use these auxilliary queries to do this
   "scooping".  They define the semantics of this approach using
   conjunctive queries, e.g., giving the following queries (according
   to their algorithm):

     q2(ID, Name) :- PIR(ID, Name).
     q2(ID, Name) :- PIR(ID, Name), PIR(ID', Name).
     q2(ID, Name) :- PIR(ID, Name), PIR(ID, Name').

   Or, using SQL:

     SELECT p.ID as ID, p.Name as Name
     FROM PIR p
     UNION
     SELECT p.ID as ID, p.Name as Name
     FROM PIR p, PIR p'
     WHERE p'.ID = p.ID
     UNION
     SELECT p.ID as ID, p.Name as Name
     FROM PIR p, PIR p''
     WHERE p''.Name = p.Name

   But, the basis of their approach contradicts the approach: they
   claim that the three different queries are all equivalent, but that
   they also provide more annotations. First, there is something funny
   about there reasoning, and second, optimizers should recognize the
   equivalence, and presumably rewrite to the "smaller" query. This
   approach is also just plain wrong in the conjunctive form, which
   becomes clear when substituted with the first-order logic rep. when
   existenials are introduced over ID' and Name'. Ultimately, it isn't
   such a problem except that their entire theory is based on the
   conjunctive form. I haven't finished reading the paper, but I am
   guessing that the approach works because they do some other
   rewriting to extend the table itself with the annotations (they call
   it a naive representation of the annotations).

Symmetric relations and cardinality-bounded multisets in database
systems
-----------------------------------------------------------------
Kenneth Ross

   symmetric relations and multisets

     if r(a,b) then r(b,a); a symmetric

     s(a,b,c,d) then s also holds for all permutations of (a,b,c,d)

     symmetric relations == multisets

   bounded cardinality

     pairs of items: people who have met

     limited size groups sharing a common property: people on a soccer
     team

     conservative bounds: children of a person; employees with the same
     manager

   multisets in rdbms

     first-normal form: (set-id, other columns), (set-id, member)

     "which sets contain alic, bob, and carol?" need a 4-way join

   object-relational db

     non-first-normal form rep: (set-id, other cols, collection)

     like varray in oracle

       collection must be accessed as a whole
       no indexing on elements of arrays (need scans)
       collection elements not queryable like columns

   user-defined relational multisets

     (set-id, other cols, person1, person2, person3)

     schema designer must: avoid redundancy (order, e.g.,
     bob,alice,carol but not alice,carol,bob), code updates to respect
     constraints, for each multiset op

     user must: query mult. cols (bob could be in any column); k!
     permutations for k cols.

     this functionality should be provided by the database

   approach

     represent multisets as columns in a table

       columns interchangeable
       table type provided by db
       one combined index for all symm. columns
       automatic integrity checking on updates

   technical contributions

     symmetric table data structure
     query optimization
     inferring symmetric views
     special syntax

   kernel

     create table meeting (
       set-id integer;
       ....
       symmetric multiset persons (person1, person2, person3) varchar
     )

     db automatically stores person in lexical order

   queries

     select * from meeting where {alice, bob, carol} among persons

     new "among" keyword; no need to explicitly list all permutations
     of person1, person2, person3

     create table tv (
       set-id integer;
       multiset slots {c1, c2, ..., c24} integer
     }

     select * from tv where {2, 4} among sltos and no ({5} among slots}

     sltos not symmetric; proposed techniques useful for queries that
     treat slots symmetrically

   symmetric closure operator

     compute all column-permutations of the symmetric attributes

     if k is the kernel, then p(K) are the perms

   implementing symmetric closure

     generating permutations is easy

     really want to implement combination of selections and p(K), i.e.,
     sel_s(p(K))

       allos the use of predicates in access plans (e.g., use the
       combined index on symm atts)

     even though p adds no expressive power, desirable as an operator

   optimization

     try to push selections inside p

       ok for symmetric conditions (e.g., x=y)

       partial info pushed for nonsymmetric conditions (e.g., x-y > 7)

       closure applied over fewer records; smaller intermediate results

     push joins too

     other rules (see paper)

   example

     m(p1, p2, l, d, t) = p_p1,p2(K)

       p1 and p2 meet at location L on data D at time T

     s(v, ...)

       v is a suspect

     r(w, ...)

       w is a monitored location

     s join_p1=v ( r join_w=l p_p1,p2(K) )

       do join before symmetric closure (push join on R and K first)

       then, push s join, but rewrite join condition as p1=v or p2=v

   symmetric views

     try to show that a view V is symmetric

     a symmetrically materialized view can be stored efficiently as a
     kernel

     we can replace V by l(ker(V)) inside larger queries: enables more
     query optimization options

   inferring symmetry

     uses the idea of containment mappings

       employees sharing the same manager

         Q(K1,K2) :- E(K1,M), E(K2,M).

         Q^T(K1,K2) :- E(K2,M), E(K1,M).

         where Q^T is a containment mapping. Here, just the identity.

       employees whose managers are siblings

         q(k1,k2) :- e(k1,m1), e(k2,m2), s(m1,m2)

         q^t(k1,k2) :- e(k2,m1), e(k1,m2), s(m1,m2)

         not contained unless you know the symmetry of s.

       can get sound and complete mappings for certain queries

   experimental evaluation

     k(Y, x1, x2, x3), x1 <= x2 <= x3
     500,000 rows

     q1: select * from K where
           (x1='foo' and x2='bar' and x3='baz') or
	  ... all permutations

     as a view:

       create view v as
         select Y, X1, x2, x3 from K union
	select Y, x2, x1, x3 from k union ...

     q3: select m1.x, m2.x, m3.x, s.y
         from m m1, m m2, m m3, s
	where m1.id=s.id and ...

     q4: select * from k where {'foo','bar','baz'} among {x1,x2,x3}

       rewrite to: select * from K where (X1='bar' and X2='baz' and
       x3='foo')

       [i am a little lost on their explanation why they can rewrite
       their query to this ... have to read the paper ... but obviously
       they will do a lot better!]

       [but of course, they are going to tell us about it]

   claim their approach scales: syntax linear in set/query size, etc.

   conclusions: multisets as columns

   [this approach really just introduced a symmetric multiset, and
   non-symmetric multiset constraint over relations ... which is cool,
   but they don't appear to view the problem in this way; so I wonder
   if there are any problems with this type of constraint "interfering"
   with constraints for reasoning -- they did mention that you can
   infer for some queries -- seems weak on abstraction/theory and heavy
   on trying it out; maybe the paper goes into more detail about the
   theory side]

GridFields: Algebraic manipulations of scientific datasets
----------------------------------------------
Bill Howe

   CORIE: sensor area and simulations used to verify, recue, transform,
   visualization

     sensors produce 5 mb/day, simulation 5-10 gb / day

   grid-equipped scientific datasets

     1-d, 2-d, 3-d, etc. gridded data sets

   examples

     H = 2d horizontal grid

     T = 1d time grid

     V = 1d vertical grid

   contributions

     familar activities in a new domain:

       logical data model with ops; derive some algebraic laws,
       implement physical ops, devise exectuion strategies, query
       language

   model

     grid topology: a collection of cells of various dimensions;
     implicit or explicit incidence relationships

       2-cells = {A,B}
       1-cells = {m,n,p,p,q,r}
       0-cells = ...

   Grid + Bound data = GridField

     total function over cells of dimension k
     tuples of numeric primitives
     two gridfields may share a grid

   representation

     attributes stored column-wise; values linked to cells positionally

     (grid, k, attributes)

   operations

     bind(b) -- associate gridfileds with data
     intersection, cross prod =-- combine gridfields
     restrict(r) --
     aggregrate (project, join, apply, slice) --

   rewrite rules

     cross commutes with restrict, apply
     ... flashed the slide

   conventional approaches

     relational limitations

       3 tables: node data, cell data, incidence table

       load time approaches data generation time
       OLTP paradigm inappropriate for primarily read-only data
       data redundancy due to trivial join dependency embedded in key

     spatial db limitations

       2 tables: node data (with point data type), cell data (polygon type)

       incidence rel computed rather than explicitly recorded
       geometry redundantly defined in nodes and cells
       no concept of "grid" impedence mismatch

     3d visualization libraries

       grid -> restrict -> new grid

       VTK is a main app (extract geom, threshold, extract grid, ...,
       types of restrict ops)

     limitations distilled

       grids aren't first class
       topology conflated with geometry
       equivalence obscured

       "same topology, different geometry"

   examples of optimization

     H : (x,y,b) ~ x,y horizontal coords, b is depth
     V : (z)

     H cross V, restrict (z > b), bind(salinity data set), project on
     x,y to view a region of interest

     can push restrict, then take cross product, then restrict to
     bottom, etc.

     compared to vtk, and rdbms [... vtk outperforms until higher
     selectivity]

   transect (vertical sclice)

     another example ...

     these are claimed as queries, but they are more like workflows
     ... dataflow-based manipulations of the grids; of course you can
     do some optimization

   related work

     baumann 99, marathe + salem 97 (multidim. arrays)

     data models for sci. data (butler 89, collins 91)

     vtk

     gis

   conclusions

     gridfield expressions provide succinct "recipes" for practical
     data products

     equilv provide optimization opps.

     exp. with hand-optimizers

[NOTE: there was an invited paper session on issues in data
warehousing that I also wanted to attend, but couldn't do both ...]