[seek-kr-sms] more notes from wednesday

Wed Sep 1 14:32:48 PDT 2004

Just a note ... bertram you might get a kick out of the last paper

shawn

---------------

Supporting semantics-based semantic matching in RDBMS
-----------------------------------------------------
NOTE: This was part of the industrial session.

  Example: Restaurant guid application

  "ONT_EXPAND" operator, returns a set of rows, where each row is
  filled in as part of the ontology

  Retrieves subset of onto indirectly

  ONT_RELATED, ONT_PATH, ONT_DISTANCE, ONT_INDEXTYPE

  ONT_RELATED operator

    takes term1, reltype, term, ontoname, returns integer

    if term1 and term2 are related via reltype, returns 1, otherwise 0

    reltype can be isa, or boolean exprs. like isa or eqv

  ONT_DISTANCE operator: takes label, returns length of shortest path
  length

    select * from served_food where ONT_RELATED( 'cuisine', 'isa',
      western'; ...)

    query ontology directly to find relationships

  ont_expand: takes term1, reltype, term2, ontoname, returns a table
  of <term1, property, term2, distance, path> tuples

     nulls represent wildcards (ugh)

  select * from table ( ont_expand( null, 'isa', 'latin american',
    cuisine_ontology)

  implementation

    simple rels handled by table lookup

    transfitive rels handled by using oracl'es connect by, which
    allows traversing hierarchical data stored in tables

    symmetric relationships handled as part of the above steps

    more complex owl (DL and lite) inferencing is handled by an
    initial phase of preprocessing dones as part of loading the owl
    ontology

      NOTE: I was just reading in the owl api paper why this has
      generally been considered a bad idea.  mainly due to editing the
      ontology / updating the onto (which is of course inevitable)

    speeding up these queries

      an "extensible" indexing type.  precompute transitive closures

      looks like a regular create index statement, but add parameters
      that you want to index (ontology=cuisine_ontology), e.g.

      index table: (cuisine, row_id)

  performance results

    oracle 9i, lots of terms (he went too fast through that slide...)

    performance is linear for ont_expand

  applications for semantic matching

    homeland security application

      a person x has rented a truck, and a person y has bought
      fertilizers and x and y stay at the same address

      select * from Activy_x, Activy_y
      where x.action = 'rent' and y.action = 'buy' and ...
      ont_related (x.object isa ...)

    a life science example

      ... see paper

  conclusions

    ontologies must be stored in databases for efficient maintenance,
    qsharing, and scalability

    equally important is the capability of ontology-based semantic
    matching

    future work: address ontology merging and supporting defined rules

  NOTE: this is exactly the kind of integration GEON is doing ...

Similarity Search for Web Services
----------------------------------
Xin Luna Dong (UW)

  This is "Woogle"

  Web service search

    first-generation web-service search engines do keyword search on
    web-service descriptions (WSDL?)

      binding point, web central, salcentral, web service of the day,
      remote methods, et.

      they say this isn't enough

    doesn't capture enough underlying semantics

    does not accurately specify user's information needs

    also, don't specify the services, only the description of all of
    them for a description

      need to "drill" down

  how to improve this?

    offer more flexibility by offering more similarity

    via semantics (not just keywords)

  example

    four services, all take zip code (but different outputs)

    some outputs are inputs to others

  searching with woogle

    returns list of operations with similar functionalities, a list
    with similar inputs, a list with similar outputs, a list of those
    that can be composed

  elementary problems

    operation matching: given a web-service operation, return a list
    of similar operations

    input/output matching: given the i/o of a web-service op, return a
    list of we service ops with similar i/os

  can we apply previous work?

    software comp. matching: require the knowledge of implementation;
    we only know the interface

    schema matching: similarity on diff. granularity; web-services are
    more loosely related (NOTE: what? isn't wsdl just xml? why doesn't
    this work? this doesn't make sense)

    text-document matching: nope this won't work either they say. only
    a coupld of sentences, etc. web services are structured (NOTE: so
    why not the previous...)

    NOTE: this is a little hard to follow; their reasoning here

  Approach:

    Part 1: Exploit structure (wsdl, op name + description, i/o
    param namess, etc.)

      cluster i/o param names ...

  Custering parameter names

    heuristic: parameter terms tend to express the same concept if
    they occur together often

    strategy: cluster parameter terms into concepts based on their
    co-occurence.

      given terms p and q, similarity from p to q:

        sim(p->q) = P(q|p)

      ideal clustering: high cohesion and low correlation; cohesion
      measures intra-cluster term simil. correlation measures the
      inter-cluster term sim; cohesion/corr score = (avg(coh) /
      avg(corr))

    algorithm: a series of refinements of the classic agglomerative
    clustering; basic agglomerative clustering: merge clusters ...

    cohesion condition: each term in the result cluster is close to
    most (half) of the other terms in the cluster. Refined algo: merge
    clusters

    a bunch of examples ... sound pretty good, but not apparent how
    what they are doing actually does this, haven't shown how what
    they do can solve these examples (probably have to read the paper)
    (WARNING: I am not a big fan of these kinds of "soft" approaches
    for schema matching; so my comments may be a bit biased, but I
    have found these approaches have big "wow" factors, so I am in the
    minority a bit)

  measure top-k precision

    benchmark: 25 web-service ops; from several domains; with
    diff. input/outpu, etc.

    top-k precision for op matching is 85% (quoted as "pretty high").

    ignoring structure or just text matching, results "not that good"

    (WARNING: those statistically inclined should look away now)

  measure precision/recall

    8 web-services and 15 in/out

    6 domains, diff. popularity, etc.

    woogle (compared to only description or structure) is better

  conclusions

    defined primitives for web-service search

    algorithms for sim. search on web-service ops

    etc.

  on-going work

    template search

    specify input and output and brief description of operations

    (I want the input to be city state, output to be weather) -- isn't
    this another "AI complete" problem? They claim it is pretty close
    ...

    they claim what they have is better than google, since you can ask
    questions like "what is the temperature in seattle" -- since
    presumably they can compose and run the services

Data sharing through query translation in autonomous sources
------------------------------------------------------------
A. Kementsietsidis

  A network of autonomous sources

  each can have: its own data types (genes, proteins, articles); it
  own different schema; its own vocabulary (gene names, prot. names,
  etc); its own set of data sharing sources (a subset of sources)

  treat each source as an access

  requirements: share data without a common schema and where schemas
  haven't been rreconciled; translate and propagate query between
  sources

  use mapping tables (sigmod last year)

    associates values between sources; can be multiple tables;
    associate values within or across domains; recorded assocaitions
    are m:n in general; they support variables to express

    want to use mapping tables in structured query translation

  query lang: SPJ, no negation, selection is disjunction or
  conjunctions of atoms

  characterize translations: given q and q' and a mapping table m; we
  define what it means for sound and complete translations

  compute translations: propose efficient algorithms to compute
  translations

  with multiple mapping tables, decide which mapping tables to use
  during translation. solution: let the query guide the selection
  process

  our algorithms manipulate queries and mapping tables. it proves
  beneficial to have a uniform representation for both
  constructs. solution: represent queries as T-queries (tableau-like
  rep. of queries that uses the same formalism with mapping tables)

  reduce run-time execution costs

  solution: reuse previous translations

  seems pretty simple...

Linear Road: A stream data management benchmark
-----------------------------------------------

  contribs

    novel sdms benchmark, some implementations

  inspiration

    variable tolling

      cars have gps, emit "position reports"

      tolls - f(speed, conjestion, accidents, ...)

  linear road

    every car emits a "position report" every 30 sec

    set of 5 continuous, historical queries

    each w/ max response time constraint

    l-rating: number of express ways (1000 PRs/sec)

  challenges/solution

    semantically valid input (streams should be meaningful
    time-series)

      input generated by traffic simulator (MITSim)

    proper performance metrics (not "completion time" but "response
    time")

      L-rating: supported load while meeting requirements

    many correct results possible (answer can depend on evolving
    historical state)

      validation tools

    no query language standard (expressed in a high-level lang)

  caveat

    a stress test, not a real traffic model

    simplistic assumptions: global clock, no position interpolation,
    etc.

  backdrop

    linear city: 10 horizontal express ways, with 100 mile long
    segments

    3 lanes, 1 entrance and exit ramp

  input

    simulator produces 3 hrs of traffic data

      each car: one position report every 30 seconds

    data consists of: position reports (car, time, position data)
    piggyback ...

  toll alert query

    car crosses into new segment

    charge toll for seg #i. calculate toll for seg#i+1 based on
    previous minute stats: avg speed, # of cars, accident near; 5
    sec. response time

  accident alert query

    car emits 4 reports from same position -> "stopped"

    two cars stopped at same position -> "accident"

    alert all vehicles entering a segment ...

  historical query

    account balance (5 sec reponse)

      what do i owe? more than 1 correct result (depends on when
      answered)

    daily expenditure (10 sec response)

      what did i spend on xway x on day d? day d any day in last 10
      weeks

    travel time/toll estimate (30 sec. response)

      predict travel from p1 to p2 on xway x, day d, time t. use last
      10 weeks for prediction

  implementations

    Aurora

  comments

    some stuff doesn't make sense: like two cars stationary for a
    wreck, it would be more of a stress-test

Query languages and data models for database sequences and data
streams
---------------------------------------------------------------
Carlo Zaniolo

  most dsms use sql; queries span both streams and dbs will be easier

  but sql suffers from major new problems; designed for persistent
  data, not transitive data

  for persistent data, far from ideal. areas not supported: data
  mining, sequence queries

  this talk covers shortcomings of sql on data streams, and how to
  overcome them

  blocking operators

    current dbms's make heavy use of blocking operators

    characterize blocking and non-blocking operators

  partial ordering

    let s be a sequence [t1, ..., tn]

    then, [t1, ..., tn] is the presequence of s, of length k, denoted
    Sk

    if for some k, L-Sk, we say that L is a presequence of S

    presequence  defines a partial order

    generalizes to subset when order and duplicates are immaterial

    the empty sequence is a subsequence of every other sequence

  theorem: queries can be expressed via nonblocking computations iff
  they are monotonic

  a query lang. L can express a given set of functions on its input;
  to avaoid blocking queries, only the monotonic functions expressible
  by L should be allowed on data streams

  But ALL of them are expressible using the nb-operators of L?

  L contains nb-operators and blocking operators: only the former can
  be used on data streams -- so, can those express all the monotonic
  queries expressible in L

  L nb-complete ...

  example

    bidstream(item, bidval, time)

    select item
    from bidstream
    group by item
    having sum(bidval) > 100000;

    this is monotonic. it can be expressed in a lnaguage containing
    suitable query ops, but not in sql-2; which is not nb-complete;
    thus it is ill-suited for continuous queries on data streams

      i think he is saying sum is implemented as a blocking operator,
      but it doesn't have to be implemented this way

  relational algebra

    nonmonotonic ra operators: set difference and division

    we are left with: select, project, join, and union. can these
    express all FO monotonic queries?

    some interesting temporal queries: coalesce and until

      they are expressible in ra (by double negation)

      they are monotonic

      they cannot be expressed in nb-relational algrebra

  "review"

    sql's lack of expressivity is a major problem for database-centric
    apps

    these problems significantly more serious for data streams since:
    only monotonic; etc.

  embedding sql in a prog. lang.

    ops not expressible in sql can be expressed in pl

    but cursors are 'pull-based' and not usable on streams

      cannot hold the current tuple and all that follow waiting for
      the pl to request it by a getNext statement

  punctuation: source of the data inserts punctuations to denote the
  end of certain kind of data, when the auction closes on item95, e.g.

    queries find the total volume of bids for item95 can then be
    answered, but little benefit for our example query (because you
    have to wait) ...

    windows as synopses.  approximations, e.g., compute the continuous
    sum.

  user defined aggregates as a solution

    supported by Obj-Rel systems, were they are defined in an external
    PL

    Aurora suppors specific sublanguage

    can be defined in SQL itself; native extensibility (but ... )

  example

    aggregrate myavg(next int) : Real
    ( table state(tsum int, cnt int);

      initialize: ( ... insert into state values (next, 1) )

      iterate: ( update state set tsum-tsum+next, cnt=cnt_1 ...)

      terminate: insert into return select tsum/cnt from state
    )

  theorem

    sql with uda's defined in sql is turing complete --- i.e., it can
    express every possible db query

    this result holds if we allow both blocking and nb uda's

    but what about data streams where we can only us nb operators?

      there are query languages that are turing complete on stored
      data but not on data streams ...

    this is called a "tumbling window" -- various kinds of windows
    used with data streams

  nb-udas

    those with an empty, or missing, terminate clause

    thereom: sql with nb-udas is nb-complete: it can express all
    monotonic queries

    nb-udas natively in sql: cornerstone of their expressive streams
    language ESL, that supports well times series queries, data
    mining, etc., etc.

  conclusion

    db approach shows promises, but the challenges should not be
    underestimated

    we provided a formal analysis of the language problems of
    sql-based cqls

    proposed an effective solution: udas that make sql turing
    complete, but nb-complete (on data streams)

  ESL is up and running on the stream mill system
  (http://wis.cs.ucla.edu)

  here come the questions ...

      dave maier and jennifer widom go to the mike...

      as they walk there, dave says: "you go ahead; do you want me to
      lift you up to the mike?" (the fun begins ... :-)

      widom: agrees with approach; disagrees that sql is a stream
      query language: quote "I haven't seen anyone propose this, have
      you?" ... "no one has proposed sql semantics, maybe syntax, but not
      semantics"

      carlo: "then why use sql syntax?"

      widom: "interesting results, but wrong wrapping"

      carlo: "your comment shows some form of lack of thought"

      carlo: "if i was an nsf reviewer i would scratch it"

      widom: "oh, that was you!"

      widom: "does esl extend atlas?"

      ... didn't get answer ....

      widom: "ok, here comes mister punctuation"

      dave: "assuming sql would be implemented in a certain way
      underneath; would it be possible to translate it into operators
      implemented in a different way?", i.e., as non-blocking...

      carlo: "that is what we propose; the can do it"

      dave: "then you have the problem of translating into nb
      operators; then I have to do it"