[seek-kr-sms] More notes: first day of VLDB
Shawn Bowers
bowers at sdsc.edu
Wed Sep 1 07:51:53 PDT 2004
Paper: A framework for using materialized xpath views in xml query proc.
------------------------------------------------------------------------
Motivation
for $i in collection("T")//order (node reference)
where $I/@price > 100 (data, typed value)
return <order> {$i//lineitem </order> (copy of xml subtree)
Modeling XPath views
a view is an xpath expresion, from one to four "extraction types"
data, copy, reference, and path
a view is modeled as a table containing one row per XML node, one
column per the extraction type
Challenges
XPath query containment problem (Q \subseteq V) is a
nec. condition
co-NP complete for XP{/, //, *, []} subset of XPath
Compensation construcxtion
if Q \neq V, we need Qc s.t. Q-Qc(V)
Does not always exist, e.g., cannot be ansered from a "copy"
view V=//b
Multible componesation are usually possible
Framework
XPath matching, detects potentially usable views (based on tree
mappings), Incomplete but sound and efficient
Takes V and Q
Initial Compensation, inverts the query, eliminates unecessary
query conitions. V=/a[b]; Q=/a[b]/c => Q'=*/c
takes match matrix
Optimization
XPath matching
Step representation: a tree of XPath steps (XPS)
axis, test, predicate child, next child
single top-down pass
handls OR, all axis, valud preds
finds all possible mappings of V into Q
(It is very hard for me to hear this speaker ... he doesn't talk
loud enough ...)
There matching seems overly complex ... need to see the paper
Initial compensation
identify compensation root (query tree node mapped by the view's
extraction)
construct inverted Q' and V' expresion
v=//a[b]//d[e] -> v' = self::d[e and ancestor::a[b]]
q=//a[b and c]//d[e]/f -> q'=self::d[e and ancestor::a[b and c]]/f
...
optimizing compensation (example continues ...)
and here come the heuristics: four for data, path, copy and
reference
experiments:
one view: v = //@*
various query sizes (n from 4 to 64)
qn=/a[@a1=1 and @a2=2 and ... @an=n]
fixed query, four attribute selections
... experiments just measure number of views to query size to
compensation time (fraction)
query time grows linear to "query depth"
conclusion
xpath matching (disjunctio, local and intra doc join value
predicates)
compensation construction
heurisitic opt. uses all matching views
would like to extend to xquery views and queries, view selection
problem, cost-based compensation optimization
Paper: Schema-Free XQuery
-------------------------
Presenter: Yunyao Li
XQuery is powerful (flexible, accurate), but what if you don't know
the structure of the document (well .... then you are hosed, right?
unless you have semantics :)
Data evolution, heterogeneity
Use keyword search: discard all tags, e.g., "author Mary year title"
to search an XML document for "all titles and years that mary was an
author of a publication"
Keyword search: no knowledge of doc structure requried (but really,
you need the tags). Don't distinguish tag names from value, cannot
express complex semantic knowledge, e.g., year < 2000. cannot
specify the desired result (as XML I think they mean, which doesn't
seem true, it would just be hard).
Schema-free xquery: expoit partial knowledge of the schema ...
specify nodes that are meaningfully related, e.g., author to Mary.
Observations: a node usually represetns a real-world entity
two nodes related by lowest common ancestor
But, ..., not all lowest common anceestors are meaningful.
Define the lowest common ancestor.
bib.year
bib.book.{title, author}
bib.article.{title, author, author}
MLCAS (meaningful lowest common ancestor structure) is the most
specific xml structure in which the nodes are related
our proposal: nodes within a MLCAS are meaningfully related
Queries are written against these MLCASs, which are valid as long as
that structure doesn't change, even if the rest does. the queries
expressed via xquery. Basically, this lets you write a query against
a particular known structure in the document, and their algorithm
rewrites the query to correctly access the structure in an actual
document.
MLCAS transformation: introduce mlcas taggged variables in xquery
which specializes/rewrites the query to a valid xquery over the
specific document
insert an "expand" operation that expands terms like "writer" to
"author" and so on.
Define their algorithm for doing this ...
Future work: ontology-based term expansion, etc.
Title: On testing satisfiability of tree pattern queries
--------------------------------------------------------
User uses a stronger constraint that what they had in mind
Tree Pattern Queries (Twig queries)
How to test TPQ satisfiability?
They contribute classes of TQP satisfiability, computable in
polynomial
time, and classes that are intractible
"pattern match": h : Q -> D, for query Q and database D.
To be satisfiable, h must preserve structural rels
Wildcards and disjunction introduce complexity
proposition: A tqp w/o join conditions, possibly with wildcards:
no value-based constraints, always sat. q is sat iff every query
note its local value-based cosntraints are sat
Approach
From q, build "constraint graph Gq"
Chase Gq using inference rules until no more inferences are
possible
q is sat iff the resulting graph contains no violations
Inference rules (w/o schema)
... see paper; there are 22 of these
a tpq containing node id constrants, but no wildcards is sat iff
the chased cg of q is violation-free
tqp w/ node id constraints and value-based constraints w/o
disjunction, but no wildcards; testing whether q is sat can be done
in ptime
W.r.t. schemas
Title: Containment of nested XML queries
----------------------------------------
Presenter: Xia (Luna) Dong
Input D: An XML instance tree
Output q(D): An XML instance tree
Q(D) subset Q'(D) if tree homomorphhism
Q contained in Q' if for every import XML instance tree, Q(D) subset
Q'(D)
Fragment of XML queries:
Conjunctive, no two sibling query blocks return the same tag
Have: /, *, []
For fanout: =1, PTIME, etc.
Deciding q contained in q'
standard technique: use canonical databases for all possible
bindings
there are an infinite number of canonical dbs for xml
a canonical answer: a minimal xml instance, where no two sibling
subtrees where one is contained in the other
canoncial answer: a minimal xml instance contained in the head tree
a canonical db is the minimal xml instance to generate the canonical
answer
my comment: this is another reason why xquery is way too complex
... to me, these results are negative
they argue that coNPComplete isn't really that bad ...
because in-practice, we can analyze element card. to reduce the
number of canoncial answers for containment checking
use simulation mapping (levy suciu pods'97).
[short paper section]
Title: Efficient XML to SQL Query Translation
---------------------------------------------
Presenter: Jeffrey Naughton
Many apps use XML
Much data stored in rdbms
One solution to the mismatch: XML publishing
"pretend" relational data is XML, apps use XML queries, middleware
layer deals with XML to SQL translation
problem: inefficient sql can result
question: where and how should we fix this?
xml publishing
xml enablement layer (make rel look xml)
problem of suboptimal sql
unnecessary joins (only need suffix of them)
unions of complex queries equivlant to a simpler single query
where should you fix this
in the SQL optimizer? (fix the sql)
in the xml to sql translator? (don't gen. bad xml)
Makes most sense to fix it at the sql query translation time
reason:
the equiv. to simpler sql depends on a comb. of: props of te xml
to rel mapping, and relational integrity constraints
by the time the rdbms optimizer gets the sql, useful info about
the mapping has been forgotten
does all of this matter?
experiment goal: compare sql queries optimized by our techniques
against sql queries produced by previous algorithms
"how many ads appear in area1"
youch! (naive sql is 50 lines; the optimized is one liner)
what about optimizing the sql after it is generated? theoretically
possible.
why translation-time optimization better?
no changes to sql optimizer
mapping information is explicitly available, which can focus search
for better queries
can pre-compute useful info. and substantially reduce the run-tim
overhead of this optimization
[this is the best talk i've seen so far in terms of presentation]
Title: The NEXT framework for logical xquery optimization
---------------------------------------------------------
Alin Deutch
Motivation: XQuery needs a declarative language for optimization
xquery -> normalization -> nested xml tableau (NEXT) -> minimization
(first step in logical optimization)
tree patterns are used in next
normalization reduces multiple xquery syntactic features into next
deals with nesting, set semantics, bag semantics, existential
quantifier, multiple, redundant navigation points.
next expoits new groupby construct
the ccc algorithm: collapse and check containment
results: unique minimal equivalent of a NEXT, np-hard (or worse),
polynomial for acyclic tree patterns, efficient by careful
implementation
Panel:
Biological data management: research, practice, and opportunities
-----------------------------------------------------------------
Thodorros Topaloglou
Susan B. Davidson
H.V. Jagadish
Victor M. Markowitz
Evan Steeg
Mike Tyers
This was a great session, but it was packed and I had to stand, and so
didn't take notes.
Alon Halevy Keynote
-------------------
Data integration: a higher-level abstraction than query lanuages
like SQL
independence of source and location, data model, syntax, semantic
variations
Application area 1: business
EII apps: CRM, ERP, Portals
legacy, enterprise databases, services and applications
50% of all IT $$$ is spend here (my note: people complain about
this in business that it is wasted money)
Application area 2: science
Hundreds of biomed data sources available, growing rapidly
Application area 3: web
The semantic web: vision #2
knowledge sharing at web scale
web resources are described by ontologies
rich domain models; allow reasoning
rdf/owl emerging standards; owl-lite may actually be useful
issues
too complex for users?
killer apps?
scalability of reasoning?
proposal: let's build the sw bottom up
he is skeptical... about description logics, etc. (of course,
he also proposed these, and many of his papers suggest there
use, but ...)
New Decade(s), Old Problem
data integration is on every 5-yuear db introspective (Lowell,
2003 latest)
data integration is anecessity; competitive strategy, bsb, science
significant new twists on the problem: number of sources, types of
data, queries, and answers, users skills
it's plain hard! (my position too)
Why is it hard
system reasons: managing diff. platforms, query proc. across
multiple systems
social reasons: locating and capturing relevant data in the
enterprise; convincing peopel to shar (data fiefdoms); privacy and
performance implications
logic reasons: schema and data heterogeneity; challenge
independent of integration architecture! (the system isn't the
issue; its dealing with the heterogeneities)
Explaining the title
structures, semantics, statistics
if we understnad the structures and semantics, we can solve the
problem
statistics can bridge the gap
statistics on structure, previous actions, data instances, via
machine learning
(every keynote has to cite a philosopher): nelson goodman is
Alon's choice.
Fact (structures), fiction (semantics), forecast (statistics)
system vision
design time
mediated schema (top)
data sources (bottom)
first step: build wrappers (around sources)
sec. step: build mappings from wrappers (sources) to schema
you need: mapping tools, merge, compose, model management (the
more "general" framework), mediation language
run time
given a query over mediated schema, need to reformulate it so
that it is expressed against the sources, optimize and execute
it. This is where xml fits in (hmmm)
he is going to touch on each of these in the rest of the talk and
especially talk about the mapping tool (where statistics come in),
and about the user who interfaces with the mediated schema
wrappers
lots of work on this (IMO: although, it isn't that great)
screen scraping web pages as an example
take an amazon web page and generate an xml document
successful since lots of companies sell these tools (lixto,
Fetch, XQWrap, Wrapper induction)
mediation language and query reformulation
a language for describing relationships between mediated schema
and wrapped schemas
logic
mappings as query expressions, relationships more than mappings
GAV: global as view. goes back to the eighties
query reformulation is very easy (note: in easy cases, if you
have constraints it gets much harder)
hard to manage: new sources are hard to add
LAV: local as view; GLAV: global/local as view
reverse approach. assume mediated schema is a data source, and
describe local schemas as views.
IMO: a warning. this is how it is usually described, but
technically, this is not completely true, because these are
really all just mappings, whether gav, lav, or glav, and there
are more details like sound, complete, and so on, that isn't
typically considered until things get more complex
the mediated schema is hard to maintain
peer data management systems
the deeper problem is determining what the mediated schema would
contain. trying to model everything is too hard to do. people
want to share subsets of data, but don't want to build the
mediated schema
peers can share data, p1 maps its schema to p2, p2 to p3, i can
get data from p1 and p3, by composing mappings. they use
gav/glav to give the mappings
piazza, hyperion, peerdb, local relational models, etc., etc.
challenges
semantics: careful about cycles (or else query answering
becomes undecidable)
optimization: compose paths, prune
manage networks: consistency, quality, caching
IMO: but, what are the fundamental assumptions in this
architecture, and are they really realistic?
model management
generic infrastructure to manage schemas
manipulate models and mappings as bulk objects
ops to create and compose mappings, merge and diff models
short operator scripts can solve schema integration, schema
evolution, reverse engineering, etc.
an algebra for model management
general oportunity for fundamental theory and systems work
IMO: ... but is it the right abstraction?
query processing
problems: DQP, few statistics, if any, network behavior issues:
latency, burstiness
solution: adaptive query processing
stonebraker saw it comin in Ingres; revivals by Graefe (93) and
DeWitt (98)
recent ideas: query scrambling, eddies, adaptive data
partitioning, inter-query processing
challenges: reduce wasted work; great opportunity for some theory
XML
The "data integration appetizer"; it doesn't solve any integration
problems, it only makes it worse (IMO: i am beginning to agree
with ... cheap shots at business ... cheap shots at business this)
industry went ahead of research; nimble, enosys, xqrl, inspiration
from tukwila, mix, strudel/agora
some issues: designing the internal algebra; dealing with evolving
xquery standard
our community has served an impressive smorgasbord of xml
techniques
his quote: There is a whole conference now dedicated to xml ... it
is called "SIGMOD" -- didn't get a big laugh out of that one
Comments about commerce
until 5 years ago, data integration = data warehousing
since then: a wave of startups, nimble, enosys, metamatrix,
calixa, composite
big gues made announcements (IBM, BEA); delay ... big guys
released products
success: analysts hav new buzzword - eii; new addition to acronym
soup, along with eai. (enter. info. integr; enter. app. int)
mapping tool
half effort today on projects spent on these tools
semantic mappings: attributes in source to attributes in target
why is it so hard (automatic matching)?
schemas never fully capture their intended meaning: they are
just symbols and structures; automatic schema matching is
AI-complete (solve the problem, you have solved all the problems
in AI); our goal is to reduce the human effort.
comparison-based matching
build a model for every element in the schema; and compare models
models based on: names of elements, data types, data instances,
text descriptions, integrity constraints (Rahm Bernstein survey, 01)
insufficient information is the problem
key hypothesis: SSS
statistics offer clues about the semantics of the structures
statistics can be gleaned from collections of schemas and
mappings; a collection provides alternate representations of
domain concepts
human experts do this unconsciously
obtaining more evidence
IMO: this stuff is very naive ...
statistics for schema design patterns
the corpus: details
learn model ensemble for each corpus element
build a classifier that does machine learning and will tell
you how similar that attribute to another attribute [madhaven]
need a training example, to generalize, then pick a machine
learning algorithm -- lots of these, each with different
properties. so, need many different algorithms (model
ensemble), employ a set of these and each makes a prediction
where to get the training data? it comes from the corpus,
gleaned from schema and mappings, and positive examples
(elements of the schema itself; previous mappings)
cluster
challenges
building corpora, in steps? shell of 1000, outer shell of
10,000, the grand expanse of 450,000?
engineering a corpus. tuning: removing noise, selecting content;
domain specific?
The user
hard to find information
find my vldb04 paper, and the powerpoint (maybe in an
attachment); find emails from my california friends, which ozsu
paper did i cite in my paper, etc
on the fly integration: who published at sigmod but was not
recently on the pc? data integration systems are built for
queries run over and over again, not necessarily for user queries
the deeper problem
we're not integrated in the user's habitat. the user comes to us,
we don't go to the user.
the most profound technologies are those that disapper - mark weiser
but we are very visible: schema always comes firts; dicotomy
between structured and unstructred data (the structure chasm),
integration: only for high-volume needs
user-centered data management
bring the benefits of data man. to users, in their habitat
create associations between disparate objects and data sources
(evey piece of data should be somehow connected to some other
data)
create an ecology of cooperating personal Memex's (v. bush)
Semex (semantic explorer)
give user a domain model to work with (like a schema)
create associations between sources and domain model
shows a simple user interface
looks like a fancy bookmarks browser with keyword query
also contains links
and has association queries
with the domain model, can import sources into the domain model,
to query the domain model for info
Principles of user-centered dm
create associations for the user; every data item should be
associated with others
adapt to and learn about the user: use statistics to leverage
previous user activities
manipulate any kind of data
data and "schema" evolve over time
life-long data management
modeling ...
IMO: I think this is very close to some of the ideas we have for
SEEK ... and there are some interesting ideas for us as well.
More information about the Seek-kr-sms
mailing list