[seek-kr-sms] Notes from VLDB
Shawn Bowers
bowers at sdsc.edu
Mon Sep 6 09:17:05 PDT 2004
The last of the notes...
-----------------
Managing data from high-throughput genomic processing: a case study
(invited)
-------------------------------------------------------------------
Toby Bloom
The Broad Institute
cancer genomics / expression analysis
medical and pop. genomics/genotyping
chemical bio
cell circuitry / sequencing
The sequencing center
a manafacturing center
trying to track goop
data consitency, verification
the sequencing process
start with tubes of dna
go through processing, adding agents
wind up with dna in plastic plates 8x12, 16x20 array
1000's of arrays per day
many organisms at once, plates through many steps ...
adding, eventually, florescent die, and what you get out is a
binary signal from a laser
clean it up -- via blast (verification)
bunch of organism
get out a sequence
and a set of quality scores (a likely-hood score for each base)
distance and heights of paths
Data provenance
huge amount of data management is walking these provenance trees
(the steps from organism to a sequence)
Lab monitoring
"find the processes that didn't complete"
which plates got stuck? and why? (i.e., the plate got dropped and
broke; it got stuck in a freezer somewhere; figuring out if a
process was restarted based on what you know of the process)
200,000 to 250,000 reads a week
50 billion bases
Troubleshooting queries
For example, find the reason quality dropped on Aug. 1st
get all reads from July to Sept (over 14 million reads)
show me read quality by deck by time (in 3 minute intervals) by
well geometry (slicing and dicing problem)
display it so i can see the trends ... (very difficult with so
many data points, with so much processing that is done on them)
things are a bit different than in the commercial apps, but it
isn't totally clear why/how (in terms of visualization / mining)
connections in the data visualization is interesting and sort of
unknown -- probably a lot of mining could be done here
Assembly
most of the bioinformatics code was written by bio grad students
in a mad rush
things are put in flat files, loaded into main memory, "crunched",
then written back out to a flat file, then sent out to public
servers
generally structured information, although not viewed by
biologists to be useful for dbs
Iterate
genome fishing
giant versioning problem / configuration problem
added problem that everything goes to standard database
nice thing is that there is a single standard, but everything
needs to still be consistent, and versioning is a major problem
doesn't look like research; just a hard problem
Some Stats
50 Gigs of raw sequence data per day > 200K reads/day
~50 bill dna nucleotide bases per year; maintain attributes
assoc. with each base; for reference: human genome is ~3 billion
bases
data maintained at the level of a "read" (~800 bases, >200 million
reads in the database)
fewer transactions with better data design
process evolution (research)
process changes every 6 months
acquisition rate doubles every 2 years
major process chagnes are expected regularly
e.g., throw out all the equipment and start over (50 mill down
drain)
The research environment
users don't know what they need -- and they aren't supposed to
know yet
the results of any one exp. can cause drastic and very rapid
changes in all future plans
The structure of the data
... flashed
The result
The schemas change extremely quickly; which is the motivation for
flat files
Lots of schema ops needed: compute on old and new, migrate, merge,
etc.
Fixed schemas are not good, too much work changing schemas
Fuzzy hierarchies
bio data doesn't fit nice taxonomies
one example is organism hierarchies: how to keep a standard
taxonomy
Conclusion
research env. makes the schema change problem yet worse
data doesn't fit into nice, well-strctured tabular format all the
time
queries may be qualitatively diff. in this env. not only due to
flexibility, but due to size of the results
robustness, data provenance, data quality are harder issues in a
rapidly changing environment
Database challenges in the integration of biomedical data sets
(invited)
--------------------------------------------------------------
Rakesh Nagarajan (Washington school of medicine)
Looking at human disease
most diseases affect multiple levels (genes, etc)
Moving into a quantitative area (like in physics)
more high-throughput analysis
genome-wide micro arrays
large scale re-sequencing
looking at 100's of proteins ...
How do you combine
functional genomics experiments
clinico-pathology parameters (patient info)
gene annotation
Gene annotation
biomedical KB of genes
reside in multiple publically avail. biomed. dbs: unigene,
locuslink, dbsnp, refseq, homologene, gene ontology, pubmed, omim
integration challenges: data source heterogeneity and reliability,
syntactic and semantic equality, data synchronization
microarray profiling (functional genomics)
rna -> label them -> ... scan -> one sample on chip at a time ->
scan looking for red/green/blue ... -> analyze
individual chip: binary image, probe intensity data, summarized
expression data
mutation profiling
sequence 100s of patients and genes, which have mutations, how
does it relate to the disease at hand
ids dna level genetic changes responsible for disease development
or susceptibility
proteomic profiling
similar, takes samples, put them on dies, mix, single 2d gell,
image gell, find differentially expressed proteins, and eventually
identifiy proteins
clinicopathology data
entire patient medical record
text or numbers; analog or digital signals; images and videos
biomedical data integration part 1
clinicopathology parameters to functional genomic experiments
metadata connection patient, sample, data
functional genomic experiments
similar, connect up the metadata
approaches
link integration: start a page, read about it, link to additional
info, surf
view integration: kliesli or K2. you see all the data together and
through queries (resides in the respective source databases,
environment built around thes)
data warehouse
architecture
it is really hard to get all the data together, all at once
data refresh: data update daily (experimental data); annotation
weekly
speed: fast querying, less than 100 users, transfer large amounts
of data in real time
want to mine the data as an app
data warehouse plus link integration
the approach they chose
integration at the lowest level
needs schema integration, but not query integration (because of
warehouse)
arch. appears to be a data warehouse logical design for their
specific types of data
function express server
ETL: Unigene, homologene, unigene, locuslink
functional genomic data loaders (form for putting in chips)
like morpho for genes (but probably not as much metadata,
... can't tell)
neural net algorithm used for integration (?)
[he is blazing through these slides; hard to take good
notes... probably should look at the paper]
analysis and visualization applications
a portal for accessing the warehouse for scientist (lots of plots,
etc., links, create gene networks for text mining, expression of a
set of genes using clustering algorithms, literature base
networks, some browsing via gene onto) ... looks like a nice start
experiences
works good for them, not scaleable to a large base
xml works well for them because of the schema flexibility
need syntactic and semantic integration ... mentions the use of
ontologies
future biomedical research
shows a big complex figure I can't read from here ...
basically a picture of a big process/business workflow with
various needs/tools
nationwide data integration
a project on data silos, grid connectivity, etc.
flashed a website, but couldn't catch it
questions
right now we just want to give the data to the biologists
...
The bloomba personal content database (invited)
-----------------------------------------------
Raymie Stata (UC Santa Cruz)
[why was this an invited talk?]
an outlook-like personal database (PCDB)
why?
proliferation of data (email, docs, photos, history, etc.)
proliferation of devices
proliferation of sharing
differences between longhorn
requirements of pcdb
scales well in item count
associative retrieval (search) vs. hierarchy
transaction-like semantics
transparent replication
"souped-up file system" not a scaled-down database
apply some db stuff to file systems
web search vs personal search
computing environment: web search is dedicated and controlled,
personal search is borrowed and "hostile" (virus checkers are at
the top of the list; delete files, etc., and users as well ---
[hmm, the users are hostile...])
fundamentally interactive
bloomba
pim build on fast, scalable pcdb
email, contacts, caledar
search integrated organicaly into the ui
spamassassin, "smart groups", rss, ...
calendar sharing over file or web servers; palm sync as well
queries are everywhere
looks like an email system (because it is hard to get people to
use something totally new)
snuck in features:
can save queries, and they look like folders
bloomba search bar (searches all of your email, gigabytes of
email, and searches attaches as well)
can search in the message itself (like a name, etc.)
queries stored on calendar items
your inbox is actually a search running on the background
(queries are continuous)
everything goes through the database
pcdb functionality
data model: semi-structured, immutable docs; mutable collections
of tags (strings) on those
operations: insert, change tags, get, get summary, delete; atomic,
consistent, weakly isolated, durable; run (expanded) query, run
query counters
components
repository, resource managers (document, tag, index, summary, and
address managers currently)
engineering challenges
hide latency, in the extreme; defer as much as poosible, and then:
defer it even longer
they store to disk instead of in main-memory, so hiding latency is
very important in this app
don't have to do much background tasks (they defer maintaince
indefinately ... measured in days)
reliability doesn't mean consistency as much as, for the user,
getting the document they need at that moment to them (what is
needed is accessibility, not so much consistency management)
query execution
requirements: continuous, support counting, scalable
solution: "delta approach" -- each query handled by a thread, that
are stateless (as much as possible) so they can be scaled. The
threads need to be in a position to compute a delta. The ui is
given a delta on how to update the presentation of the result of
the query.
Design of query runner
mutator threads used to update
snapshot the database before and after the change, the thread
compares the difference, which becomes the delta
only present the before and after that goes through a change (like
compression -- the static stuff is compressed away for the
threads)
Conclusion
what do we need from the data model?
data model
pretty happy with it, e.g., don't want to move to XML. but,
there is a need for more complex relationships
availability
operate in a regime where recovery is required but hasn't
happened yet
compression
very important issue. compression seems to be a forgotten field
now in db research (it was hot, but not any more)
replication/distribution
a very wide open area for personal data [i agree with this
... not much work has gone on here for *personal*, cross device
data availability, access, searching, etc]
questions
their work is primarily ir-based, so query language very much a
standard google keyword-based search
biggest problem for corruption is virus checkers
An annotation management system for relational databases
--------------------------------------------------------
Laura Chiticariu (UC Santa Cruz)
Annotation management system
a system that is able to propagate meta-data along with the data
as the data is being moved around
main motivation: to trace the provenance and flow of data
trace annotations through transformations (a la data warehouses,
extract transform load--ETL)
a transformation for here are queries
vision
NYRestaurants table (restaurant, cost, type, zip)
all restaurants view, cheap restaurants views
if certain cells in source table have annotations, these should be
propogated through the view
other apps
keep info not otherwise stored in current db design
highlight wrong data
security and quality metric
related work
not new idea ... Wang and Madnick [vldb'90]
lee, bressan and madnick [widm]
bernstein and bergstraesser [data eng. bulletin 99]
webdb 99 - superimposed paper
why provenance cui, widom, wiener [cww00]
annotations on genomic data
outline
pSQL queries
semantics
custom
default
default-all
select-from-where with additional propogate clause that
describes how annotation should propogate
default, default-all, and custom clause
a psql query is a union of psql fragments
the custom scheme
based on user spec.
SELECT DISTINCT A
FROM R r
propogate r.a to A
UNION
SELECT DISTINCT A
FROM R r
PROPOGATE r.B TO A
the default scheme
propogates annotations according to where data is copied from
e.g., select B from R r, would have propogate r.B to B
problematic for joins ... if you want all annotations to follow
select R.a, R.b, S.c
from R, S
where R.b = S.b
only R.b's anntotations get propogated ... not S.b's here
the default-all scheme
propogate annotations according to where data is copied from
according to all equilvanent
query basis
a query basis of Q, denoted B(Q), is a finite set of pSQL queries
(with defualt propagation scheme)
generating a query basis
given R(A,B) and S(B,C),
and query
select r.A, s.B, s.C
from R r, S s
where r.B = s.B
propogate default-all
representative query Qo:
propogate r.A to A, , s.B to B, s.C to C, r.B, to B
auxilliary queries
have to read the paper because this example doesn't explain what
is happening ... the auxilliary queries aren't very clear from
the talk, need to read the paper
correctness of the algorithm
theorem: given a pSQL query Q with default-all, the algorithm
generates a query basis of Q
implementation
translator module
input: a psql query q
output: an sql query q' writte againts the naive storage scheme
q' is sent to the rdbms and executed
postprocessing
a naive storage scheme
every att of every relation there is an additional attribute for
storing the annotations
conceivably, other possible storage schemes (explore that later)
experiments
datasets: TPCH database test queries: spj
queries, varied joins from 0 to 4 and selected attributes 1, 3,
and 5
compare sql vs psql default vs psql default-all
[this experiment stuff is required at vldb; obviously here it
doesn't mean anything ... they aren't comparing it to
anything. This is something that bugs me: it is good to discuss
implementation and experiments, but unless you are comparing
against something that exists, there is no need to spend a 1/3 of
a paper on it]
A NOTE: I looked over their paper, reading it through about 3/4's of
the way. I think that there is a mistake in the paper. In
particular, they are assuming that the query processor is
non-optimizing. After seeing their talk, it really helped me to
better understand their paper. One reason for the auxilliary queries
is to help satisfy their semantics for default-all, which tries to
scoop-up all the annotations for a data value. Here is an example:
SELECT p.ID as ID, p.Name as Name
FROM PIR p
PROPAGATE DEFAULT-ALL
The PIR table has attributes ID and Name. The original table has
two rows: (p332 {a7}, AB {a8}) and (p916 {a9}, AB {a10}), where {}'s
denote the annotations of the cell. The result is a table again with
two rows, but with more annotations: (p332 {a7}, AB {a8,a10}) and
(p916 {a9}, AB {a8,a10}). The additional annotations on AB were
"scooped-up".
So, in their approach they use these auxilliary queries to do this
"scooping". They define the semantics of this approach using
conjunctive queries, e.g., giving the following queries (according
to their algorithm):
q2(ID, Name) :- PIR(ID, Name).
q2(ID, Name) :- PIR(ID, Name), PIR(ID', Name).
q2(ID, Name) :- PIR(ID, Name), PIR(ID, Name').
Or, using SQL:
SELECT p.ID as ID, p.Name as Name
FROM PIR p
UNION
SELECT p.ID as ID, p.Name as Name
FROM PIR p, PIR p'
WHERE p'.ID = p.ID
UNION
SELECT p.ID as ID, p.Name as Name
FROM PIR p, PIR p''
WHERE p''.Name = p.Name
But, the basis of their approach contradicts the approach: they
claim that the three different queries are all equivalent, but that
they also provide more annotations. First, there is something funny
about there reasoning, and second, optimizers should recognize the
equivalence, and presumably rewrite to the "smaller" query. This
approach is also just plain wrong in the conjunctive form, which
becomes clear when substituted with the first-order logic rep. when
existenials are introduced over ID' and Name'. Ultimately, it isn't
such a problem except that their entire theory is based on the
conjunctive form. I haven't finished reading the paper, but I am
guessing that the approach works because they do some other
rewriting to extend the table itself with the annotations (they call
it a naive representation of the annotations).
Symmetric relations and cardinality-bounded multisets in database
systems
-----------------------------------------------------------------
Kenneth Ross
symmetric relations and multisets
if r(a,b) then r(b,a); a symmetric
s(a,b,c,d) then s also holds for all permutations of (a,b,c,d)
symmetric relations == multisets
bounded cardinality
pairs of items: people who have met
limited size groups sharing a common property: people on a soccer
team
conservative bounds: children of a person; employees with the same
manager
multisets in rdbms
first-normal form: (set-id, other columns), (set-id, member)
"which sets contain alic, bob, and carol?" need a 4-way join
object-relational db
non-first-normal form rep: (set-id, other cols, collection)
like varray in oracle
collection must be accessed as a whole
no indexing on elements of arrays (need scans)
collection elements not queryable like columns
user-defined relational multisets
(set-id, other cols, person1, person2, person3)
schema designer must: avoid redundancy (order, e.g.,
bob,alice,carol but not alice,carol,bob), code updates to respect
constraints, for each multiset op
user must: query mult. cols (bob could be in any column); k!
permutations for k cols.
this functionality should be provided by the database
approach
represent multisets as columns in a table
columns interchangeable
table type provided by db
one combined index for all symm. columns
automatic integrity checking on updates
technical contributions
symmetric table data structure
query optimization
inferring symmetric views
special syntax
kernel
create table meeting (
set-id integer;
....
symmetric multiset persons (person1, person2, person3) varchar
)
db automatically stores person in lexical order
queries
select * from meeting where {alice, bob, carol} among persons
new "among" keyword; no need to explicitly list all permutations
of person1, person2, person3
create table tv (
set-id integer;
multiset slots {c1, c2, ..., c24} integer
}
select * from tv where {2, 4} among sltos and no ({5} among slots}
sltos not symmetric; proposed techniques useful for queries that
treat slots symmetrically
symmetric closure operator
compute all column-permutations of the symmetric attributes
if k is the kernel, then p(K) are the perms
implementing symmetric closure
generating permutations is easy
really want to implement combination of selections and p(K), i.e.,
sel_s(p(K))
allos the use of predicates in access plans (e.g., use the
combined index on symm atts)
even though p adds no expressive power, desirable as an operator
optimization
try to push selections inside p
ok for symmetric conditions (e.g., x=y)
partial info pushed for nonsymmetric conditions (e.g., x-y > 7)
closure applied over fewer records; smaller intermediate results
push joins too
other rules (see paper)
example
m(p1, p2, l, d, t) = p_p1,p2(K)
p1 and p2 meet at location L on data D at time T
s(v, ...)
v is a suspect
r(w, ...)
w is a monitored location
s join_p1=v ( r join_w=l p_p1,p2(K) )
do join before symmetric closure (push join on R and K first)
then, push s join, but rewrite join condition as p1=v or p2=v
symmetric views
try to show that a view V is symmetric
a symmetrically materialized view can be stored efficiently as a
kernel
we can replace V by l(ker(V)) inside larger queries: enables more
query optimization options
inferring symmetry
uses the idea of containment mappings
employees sharing the same manager
Q(K1,K2) :- E(K1,M), E(K2,M).
Q^T(K1,K2) :- E(K2,M), E(K1,M).
where Q^T is a containment mapping. Here, just the identity.
employees whose managers are siblings
q(k1,k2) :- e(k1,m1), e(k2,m2), s(m1,m2)
q^t(k1,k2) :- e(k2,m1), e(k1,m2), s(m1,m2)
not contained unless you know the symmetry of s.
can get sound and complete mappings for certain queries
experimental evaluation
k(Y, x1, x2, x3), x1 <= x2 <= x3
500,000 rows
q1: select * from K where
(x1='foo' and x2='bar' and x3='baz') or
... all permutations
as a view:
create view v as
select Y, X1, x2, x3 from K union
select Y, x2, x1, x3 from k union ...
q3: select m1.x, m2.x, m3.x, s.y
from m m1, m m2, m m3, s
where m1.id=s.id and ...
q4: select * from k where {'foo','bar','baz'} among {x1,x2,x3}
rewrite to: select * from K where (X1='bar' and X2='baz' and
x3='foo')
[i am a little lost on their explanation why they can rewrite
their query to this ... have to read the paper ... but obviously
they will do a lot better!]
[but of course, they are going to tell us about it]
claim their approach scales: syntax linear in set/query size, etc.
conclusions: multisets as columns
[this approach really just introduced a symmetric multiset, and
non-symmetric multiset constraint over relations ... which is cool,
but they don't appear to view the problem in this way; so I wonder
if there are any problems with this type of constraint "interfering"
with constraints for reasoning -- they did mention that you can
infer for some queries -- seems weak on abstraction/theory and heavy
on trying it out; maybe the paper goes into more detail about the
theory side]
GridFields: Algebraic manipulations of scientific datasets
----------------------------------------------
Bill Howe
CORIE: sensor area and simulations used to verify, recue, transform,
visualization
sensors produce 5 mb/day, simulation 5-10 gb / day
grid-equipped scientific datasets
1-d, 2-d, 3-d, etc. gridded data sets
examples
H = 2d horizontal grid
T = 1d time grid
V = 1d vertical grid
contributions
familar activities in a new domain:
logical data model with ops; derive some algebraic laws,
implement physical ops, devise exectuion strategies, query
language
model
grid topology: a collection of cells of various dimensions;
implicit or explicit incidence relationships
2-cells = {A,B}
1-cells = {m,n,p,p,q,r}
0-cells = ...
Grid + Bound data = GridField
total function over cells of dimension k
tuples of numeric primitives
two gridfields may share a grid
representation
attributes stored column-wise; values linked to cells positionally
(grid, k, attributes)
operations
bind(b) -- associate gridfileds with data
intersection, cross prod =-- combine gridfields
restrict(r) --
aggregrate (project, join, apply, slice) --
rewrite rules
cross commutes with restrict, apply
... flashed the slide
conventional approaches
relational limitations
3 tables: node data, cell data, incidence table
load time approaches data generation time
OLTP paradigm inappropriate for primarily read-only data
data redundancy due to trivial join dependency embedded in key
spatial db limitations
2 tables: node data (with point data type), cell data (polygon type)
incidence rel computed rather than explicitly recorded
geometry redundantly defined in nodes and cells
no concept of "grid" impedence mismatch
3d visualization libraries
grid -> restrict -> new grid
VTK is a main app (extract geom, threshold, extract grid, ...,
types of restrict ops)
limitations distilled
grids aren't first class
topology conflated with geometry
equivalence obscured
"same topology, different geometry"
examples of optimization
H : (x,y,b) ~ x,y horizontal coords, b is depth
V : (z)
H cross V, restrict (z > b), bind(salinity data set), project on
x,y to view a region of interest
can push restrict, then take cross product, then restrict to
bottom, etc.
compared to vtk, and rdbms [... vtk outperforms until higher
selectivity]
transect (vertical sclice)
another example ...
these are claimed as queries, but they are more like workflows
... dataflow-based manipulations of the grids; of course you can
do some optimization
related work
baumann 99, marathe + salem 97 (multidim. arrays)
data models for sci. data (butler 89, collins 91)
vtk
gis
conclusions
gridfield expressions provide succinct "recipes" for practical
data products
equilv provide optimization opps.
exp. with hand-optimizers
[NOTE: there was an invited paper session on issues in data
warehousing that I also wanted to attend, but couldn't do both ...]
More information about the Seek-kr-sms
mailing list