[seek-kr-sms] More notes: first day of VLDB

Wed Sep 1 07:51:53 PDT 2004

Paper: A framework for using materialized xpath views in xml query proc.
------------------------------------------------------------------------

  Motivation

    for $i in collection("T")//order		(node reference)
    where $I/@price > 100			(data, typed value)
    return <order> {$i//lineitem </order>	(copy of xml subtree)

  Modeling XPath views

    a view is an xpath expresion, from one to four "extraction types"
    data, copy, reference, and path

    a view is modeled as a table containing one row per XML node, one
    column per the extraction type

  Challenges

    XPath query containment problem (Q \subseteq V) is a
    nec. condition

      co-NP complete for XP{/, //, *, []} subset of XPath

    Compensation construcxtion

      if Q \neq V, we need Qc s.t. Q-Qc(V)

      Does not always exist, e.g., cannot be ansered from a "copy"
      view V=//b

      Multible componesation are usually possible

  Framework

    XPath matching, detects potentially usable views (based on tree
    mappings), Incomplete but sound and efficient

      Takes V and Q

    Initial Compensation, inverts the query, eliminates unecessary
    query conitions. V=/a[b]; Q=/a[b]/c => Q'=*/c

      takes match matrix

    Optimization

  XPath matching

    Step representation: a tree of XPath steps (XPS)

      axis, test, predicate child, next child
      single top-down pass
      handls OR, all axis, valud preds
      finds all possible mappings of V into Q

   (It is very hard for me to hear this speaker ... he doesn't talk
   loud enough ...)

   There matching seems overly complex ... need to see the paper

  Initial compensation

    identify compensation root (query tree node mapped by the view's
    extraction)

    construct inverted Q' and V' expresion

      v=//a[b]//d[e]   -> v' = self::d[e and ancestor::a[b]]

      q=//a[b and c]//d[e]/f -> q'=self::d[e and ancestor::a[b and c]]/f

      ...

      optimizing compensation (example continues ...)

    and here come the heuristics: four for data, path, copy and
    reference

    experiments:

      one view: v = //@*

      various query sizes (n from 4 to 64)

      qn=/a[@a1=1 and @a2=2 and ... @an=n]

      fixed query, four attribute selections

      ... experiments just measure number of views to query size to
      compensation time (fraction)

      query time grows linear to "query depth"

  conclusion

    xpath matching (disjunctio, local and intra doc join value
    predicates)

    compensation construction
      heurisitic opt. uses all matching views

    would like to extend to xquery views and queries, view selection
    problem, cost-based compensation optimization

Paper: Schema-Free XQuery
-------------------------
Presenter: Yunyao Li

  XQuery is powerful (flexible, accurate), but what if you don't know
  the structure of the document (well .... then you are hosed, right?
  unless you have semantics :)

  Data evolution, heterogeneity

  Use keyword search: discard all tags, e.g., "author Mary year title"
  to search an XML document for "all titles and years that mary was an
  author of a publication"

  Keyword search: no knowledge of doc structure requried (but really,
  you need the tags). Don't distinguish tag names from value, cannot
  express complex semantic knowledge, e.g., year < 2000. cannot
  specify the desired result (as XML I think they mean, which doesn't
  seem true, it would just be hard).

  Schema-free xquery: expoit partial knowledge of the schema ...

  specify nodes that are meaningfully related, e.g., author to Mary.

  Observations: a node usually represetns a real-world entity
  two nodes related by lowest common ancestor

  But, ..., not all lowest common anceestors are meaningful.

  Define the lowest common ancestor.

  bib.year
  bib.book.{title, author}
  bib.article.{title, author, author}

  MLCAS (meaningful lowest common ancestor structure) is the most
  specific xml structure in which the nodes are related

  our proposal: nodes within a MLCAS are meaningfully related

  Queries are written against these MLCASs, which are valid as long as
  that structure doesn't change, even if the rest does. the queries
  expressed via xquery. Basically, this lets you write a query against
  a particular known structure in the document, and their algorithm
  rewrites the query to correctly access the structure in an actual
  document.

  MLCAS transformation: introduce mlcas taggged variables in xquery
  which specializes/rewrites the query to a valid xquery over the
  specific document

  insert an "expand" operation that expands terms like "writer" to
  "author" and so on.

  Define their algorithm for doing this ...

  Future work: ontology-based term expansion, etc.

Title: On testing satisfiability of tree pattern queries
--------------------------------------------------------

  User uses a stronger constraint that what they had in mind

  Tree Pattern Queries (Twig queries)

    How to test TPQ satisfiability?

    They contribute classes of TQP satisfiability, computable in
polynomial
    time, and classes that are intractible

    "pattern match": h : Q -> D, for query Q and database D.
    To be satisfiable, h must preserve structural rels

    Wildcards and disjunction introduce complexity

    proposition: A tqp w/o join conditions, possibly with wildcards:
    no value-based constraints, always sat. q is sat iff every query
    note its local value-based cosntraints are sat

  Approach

    From q, build "constraint graph Gq"

    Chase Gq using inference rules until no more inferences are
    possible

    q is sat iff the resulting graph contains no violations

  Inference rules (w/o schema)

    ... see paper; there are 22 of these

    a tpq containing node id constrants, but no wildcards is sat iff
    the chased cg of q is violation-free

   tqp w/ node id constraints and value-based constraints w/o
   disjunction, but no wildcards; testing whether q is sat can be done
   in ptime

  W.r.t. schemas

Title: Containment of nested XML queries
----------------------------------------
Presenter: Xia (Luna) Dong

  Input D: An XML instance tree

  Output q(D): An XML instance tree

  Q(D) subset Q'(D) if tree homomorphhism

  Q contained in Q' if for every import XML instance tree, Q(D) subset
  Q'(D)

  Fragment of XML queries:

    Conjunctive, no two sibling query blocks return the same tag
    Have: /, *, []

    For fanout: =1, PTIME, etc.

  Deciding q contained in q'

  standard technique: use canonical databases for all possible
  bindings

  there are an infinite number of canonical dbs for xml

  a canonical answer: a minimal xml instance, where no two sibling
  subtrees where one is contained in the other

  canoncial answer: a minimal xml instance contained in the head tree

  a canonical db is the minimal xml instance to generate the canonical
  answer

  my comment: this is another reason why xquery is way too complex
  ... to me, these results are negative

  they argue that coNPComplete isn't really that bad ...

  because in-practice, we can analyze element card. to reduce the
  number of canoncial answers for containment checking

  use simulation mapping (levy suciu pods'97).

[short paper section]
Title: Efficient XML to SQL Query Translation
---------------------------------------------
Presenter: Jeffrey Naughton

  Many apps use XML
  Much data stored in rdbms
  One solution to the mismatch: XML publishing

    "pretend" relational data is XML, apps use XML queries, middleware
    layer deals with XML to SQL translation

  problem: inefficient sql can result
  question: where and how should we fix this?

  xml publishing

    xml enablement layer (make rel look xml)

  problem of suboptimal sql

    unnecessary joins (only need suffix of them)

    unions of complex queries equivlant to a simpler single query

  where should you fix this

    in the SQL optimizer? (fix the sql)

    in the xml to sql translator? (don't gen. bad xml)

  Makes most sense to fix it at the sql query translation time

  reason:

    the equiv. to simpler sql depends on a comb. of: props of te xml
    to rel mapping, and relational integrity constraints

    by the time the rdbms optimizer gets the sql, useful info about
    the mapping has been forgotten

  does all of this matter?

    experiment goal: compare sql queries optimized by our techniques
    against sql queries produced by previous algorithms

      "how many ads appear in area1"

        youch!  (naive sql is 50 lines; the optimized is one liner)

  what about optimizing the sql after it is generated? theoretically
  possible.

  why translation-time optimization better?

   no changes to sql optimizer

   mapping information is explicitly available, which can focus search
   for better queries

   can pre-compute useful info. and substantially reduce the run-tim
   overhead of this optimization

  [this is the best talk i've seen so far in terms of presentation]

Title: The NEXT framework for logical xquery optimization
---------------------------------------------------------
Alin Deutch

  Motivation: XQuery needs a declarative language for optimization

  xquery -> normalization -> nested xml tableau (NEXT) -> minimization
  (first step in logical optimization)

  tree patterns are used in next

  normalization reduces multiple xquery syntactic features into next

    deals with nesting, set semantics, bag semantics, existential
    quantifier, multiple, redundant navigation points.

  next expoits new groupby construct

  the ccc algorithm: collapse and check containment

  results: unique minimal equivalent of a NEXT, np-hard (or worse),
  polynomial for acyclic tree patterns, efficient by careful
  implementation

Panel:
Biological data management: research, practice, and opportunities
-----------------------------------------------------------------
Thodorros Topaloglou
Susan B. Davidson
H.V. Jagadish
Victor M. Markowitz
Evan Steeg
Mike Tyers

This was a great session, but it was packed and I had to stand, and so
didn't take notes.

Alon Halevy Keynote
-------------------

  Data integration: a higher-level abstraction than query lanuages
  like SQL

  independence of source and location, data model, syntax, semantic
  variations

  Application area 1: business

    EII apps: CRM, ERP, Portals
    legacy, enterprise databases, services and applications

    50% of all IT $$$ is spend here (my note: people complain about
    this in business that it is wasted money)

  Application area 2: science

    Hundreds of biomed data sources available, growing rapidly

  Application area 3: web

    The semantic web: vision #2

      knowledge sharing at web scale

      web resources are described by ontologies

        rich domain models; allow reasoning
	rdf/owl emerging standards; owl-lite may actually be useful

      issues

        too complex for users?
	killer apps?
	scalability of reasoning?

      proposal: let's build the sw bottom up

        he is skeptical... about description logics, etc. (of course,
        he also proposed these, and many of his papers suggest there
        use, but ...)

  New Decade(s), Old Problem

    data integration is on every 5-yuear db introspective (Lowell,
    2003 latest)

    data integration is anecessity; competitive strategy, bsb, science

    significant new twists on the problem: number of sources, types of
    data, queries, and answers, users skills

    it's plain hard! (my position too)

  Why is it hard

    system reasons: managing diff. platforms, query proc. across
    multiple systems

    social reasons: locating and capturing relevant data in the
    enterprise; convincing peopel to shar (data fiefdoms); privacy and
    performance implications

    logic reasons: schema and data heterogeneity; challenge
    independent of integration architecture! (the system isn't the
    issue; its dealing with the heterogeneities)

  Explaining the title

    structures, semantics, statistics

      if we understnad the structures and semantics, we can solve the
      problem

      statistics can bridge the gap

        statistics on structure, previous actions, data instances, via
        machine learning

        (every keynote has to cite a philosopher): nelson goodman is
        Alon's choice.

        Fact (structures), fiction (semantics), forecast (statistics)

  system vision

    design time

      mediated schema (top)

      data sources (bottom)

      first step: build wrappers (around sources)

      sec. step: build mappings from wrappers (sources) to schema

        you need: mapping tools, merge, compose, model management (the
        more "general" framework), mediation language

    run time

      given a query over mediated schema, need to reformulate it so
      that it is expressed against the sources, optimize and execute
      it. This is where xml fits in (hmmm)

    he is going to touch on each of these in the rest of the talk and
    especially talk about the mapping tool (where statistics come in),
    and about the user who interfaces with the mediated schema

  wrappers

    lots of work on this (IMO: although, it isn't that great)

    screen scraping web pages as an example

      take an amazon web page and generate an xml document

      successful since lots of companies sell these tools (lixto,
      Fetch, XQWrap, Wrapper induction)

  mediation language and query reformulation

    a language for describing relationships between mediated schema
    and wrapped schemas

    logic

    mappings as query expressions, relationships more than mappings

    GAV: global as view. goes back to the eighties

      query reformulation is very easy (note: in easy cases, if you
      have constraints it gets much harder)

      hard to manage: new sources are hard to add

    LAV: local as view; GLAV: global/local as view

      reverse approach. assume mediated schema is a data source, and
      describe local schemas as views.

      IMO: a warning. this is how it is usually described, but
      technically, this is not completely true, because these are
      really all just mappings, whether gav, lav, or glav, and there
      are more details like sound, complete, and so on, that isn't
      typically considered until things get more complex

    the mediated schema is hard to maintain

    peer data management systems

      the deeper problem is determining what the mediated schema would
      contain. trying to model everything is too hard to do. people
      want to share subsets of data, but don't want to build the
      mediated schema

      peers can share data, p1 maps its schema to p2, p2 to p3, i can
      get data from p1 and p3, by composing mappings. they use
      gav/glav to give the mappings

      piazza, hyperion, peerdb, local relational models, etc., etc.

      challenges

        semantics: careful about cycles (or else query answering
        becomes undecidable)

        optimization: compose paths, prune

        manage networks: consistency, quality, caching

      IMO: but, what are the fundamental assumptions in this
      architecture, and are they really realistic?

  model management

    generic infrastructure to manage schemas

      manipulate models and mappings as bulk objects

      ops to create and compose mappings, merge and diff models

      short operator scripts can solve schema integration, schema
      evolution, reverse engineering, etc.

      an algebra for model management

      general oportunity for fundamental theory and systems work

         IMO: ... but is it the right abstraction?

  query processing

    problems: DQP, few statistics, if any, network behavior issues:
    latency, burstiness

    solution: adaptive query processing

      stonebraker saw it comin in Ingres; revivals by Graefe (93) and
      DeWitt (98)

      recent ideas: query scrambling, eddies, adaptive data
      partitioning, inter-query processing

    challenges: reduce wasted work; great opportunity for some theory

  XML

    The "data integration appetizer"; it doesn't solve any integration
    problems, it only makes it worse (IMO: i am beginning to agree
    with   ... cheap shots at business   ... cheap shots at business this)

    industry went ahead of research; nimble, enosys, xqrl, inspiration
    from tukwila, mix, strudel/agora

    some issues: designing the internal algebra; dealing with evolving
    xquery standard

    our community has served an impressive smorgasbord of xml
    techniques

    his quote: There is a whole conference now dedicated to xml ... it
    is called "SIGMOD" -- didn't get a big laugh out of that one

  Comments about commerce

    until 5 years ago, data integration = data warehousing

    since then: a wave of startups, nimble, enosys, metamatrix,
    calixa, composite

    big gues made announcements (IBM, BEA); delay ... big guys
    released products

    success: analysts hav new buzzword - eii; new addition to acronym
    soup, along with eai. (enter. info. integr; enter. app. int)

  mapping tool

    half effort today on projects spent on these tools

    semantic mappings: attributes in source to attributes in target

    why is it so hard (automatic matching)?

      schemas never fully capture their intended meaning: they are
      just symbols and structures; automatic schema matching is
      AI-complete (solve the problem, you have solved all the problems
      in AI); our goal is to reduce the human effort.

    comparison-based matching

      build a model for every element in the schema; and compare models

      models based on: names of elements, data types, data instances,
      text descriptions, integrity constraints (Rahm Bernstein survey, 01)

      insufficient information is the problem

    key hypothesis: SSS

      statistics offer clues about the semantics of the structures

      statistics can be gleaned from collections of schemas and
      mappings; a collection provides alternate representations of
      domain concepts

      human experts do this unconsciously

      obtaining more evidence

        IMO: this stuff is very naive ...

      statistics for schema design patterns

    the corpus: details

      learn model ensemble for each corpus element

        build a classifier that does machine learning and will tell
        you how similar that attribute to another attribute [madhaven]

        need a training example, to generalize, then pick a machine
        learning algorithm -- lots of these, each with different
        properties. so, need many different algorithms (model
        ensemble), employ a set of these and each makes a prediction

        where to get the training data? it comes from the corpus,
        gleaned from schema and mappings, and positive examples
        (elements of the schema itself; previous mappings)

	cluster

    challenges

      building corpora, in steps? shell of 1000, outer shell of
      10,000, the grand expanse of 450,000?

      engineering a corpus. tuning: removing noise, selecting content;
      domain specific?

  The user

    hard to find information

     find my vldb04 paper, and the powerpoint (maybe in an
     attachment); find emails from my california friends, which ozsu
     paper did i cite in my paper, etc

     on the fly integration: who published at sigmod but was not
     recently on the pc?  data integration systems are built for
     queries run over and over again, not necessarily for user queries

   the deeper problem

     we're not integrated in the user's habitat. the user comes to us,
     we don't go to the user.

     the most profound technologies are those that disapper - mark weiser

     but we are very visible: schema always comes firts; dicotomy
     between structured and unstructred data (the structure chasm),
     integration: only for high-volume needs

    user-centered data management

      bring the benefits of data man. to users, in their habitat

      create associations between disparate objects and data sources
      (evey piece of data should be somehow connected to some other
      data)

      create an ecology of cooperating personal Memex's (v. bush)

    Semex (semantic explorer)

      give user a domain model to work with (like a schema)

      create associations between sources and domain model

      shows a simple user interface

      looks like a fancy bookmarks browser with keyword query

      also contains links

      and has association queries

      with the domain model, can import sources into the domain model,
      to query the domain model for info

    Principles of user-centered dm

      create associations for the user; every data item should be
      associated with others

      adapt to and learn about the user: use statistics to leverage
      previous user activities

      manipulate any kind of data

      data and "schema" evolve over time

        life-long data management

	modeling ...

    IMO: I think this is very close to some of the ideas we have for
    SEEK ... and there are some interesting ideas for us as well.