[seek-kr-sms] more notes from wednesday
    Shawn Bowers 
    bowers at sdsc.edu
       
    Wed Sep  1 14:32:48 PDT 2004
    
    
  
Just a note ... bertram you might get a kick out of the last paper
shawn
---------------
Supporting semantics-based semantic matching in RDBMS
-----------------------------------------------------
NOTE: This was part of the industrial session.
  Example: Restaurant guid application
  "ONT_EXPAND" operator, returns a set of rows, where each row is
  filled in as part of the ontology
  Retrieves subset of onto indirectly
  ONT_RELATED, ONT_PATH, ONT_DISTANCE, ONT_INDEXTYPE
  ONT_RELATED operator
    takes term1, reltype, term, ontoname, returns integer
    if term1 and term2 are related via reltype, returns 1, otherwise 0
    reltype can be isa, or boolean exprs. like isa or eqv
  ONT_DISTANCE operator: takes label, returns length of shortest path
  length
    select * from served_food where ONT_RELATED( 'cuisine', 'isa',
      western'; ...)
    query ontology directly to find relationships
  ont_expand: takes term1, reltype, term2, ontoname, returns a table
  of <term1, property, term2, distance, path> tuples
     nulls represent wildcards (ugh)
  select * from table ( ont_expand( null, 'isa', 'latin american',
    cuisine_ontology)
  implementation
    simple rels handled by table lookup
    transfitive rels handled by using oracl'es connect by, which
    allows traversing hierarchical data stored in tables
    symmetric relationships handled as part of the above steps
    more complex owl (DL and lite) inferencing is handled by an
    initial phase of preprocessing dones as part of loading the owl
    ontology
      NOTE: I was just reading in the owl api paper why this has
      generally been considered a bad idea.  mainly due to editing the
      ontology / updating the onto (which is of course inevitable)
    speeding up these queries
      an "extensible" indexing type.  precompute transitive closures
      looks like a regular create index statement, but add parameters
      that you want to index (ontology=cuisine_ontology), e.g.
      index table: (cuisine, row_id)
  performance results
    oracle 9i, lots of terms (he went too fast through that slide...)
    performance is linear for ont_expand
  applications for semantic matching
    homeland security application
      a person x has rented a truck, and a person y has bought
      fertilizers and x and y stay at the same address
      select * from Activy_x, Activy_y
      where x.action = 'rent' and y.action = 'buy' and ...
      ont_related (x.object isa ...)
    a life science example
      ... see paper
  conclusions
    ontologies must be stored in databases for efficient maintenance,
    qsharing, and scalability
    equally important is the capability of ontology-based semantic
    matching
    future work: address ontology merging and supporting defined rules
  NOTE: this is exactly the kind of integration GEON is doing ...
Similarity Search for Web Services
----------------------------------
Xin Luna Dong (UW)
  This is "Woogle"
  Web service search
    first-generation web-service search engines do keyword search on
    web-service descriptions (WSDL?)
      binding point, web central, salcentral, web service of the day,
      remote methods, et.
      they say this isn't enough
    doesn't capture enough underlying semantics
    does not accurately specify user's information needs
    also, don't specify the services, only the description of all of
    them for a description
      need to "drill" down
  how to improve this?
    offer more flexibility by offering more similarity
    via semantics (not just keywords)
  example
    four services, all take zip code (but different outputs)
    some outputs are inputs to others
  searching with woogle
    returns list of operations with similar functionalities, a list
    with similar inputs, a list with similar outputs, a list of those
    that can be composed
  elementary problems
    operation matching: given a web-service operation, return a list
    of similar operations
    input/output matching: given the i/o of a web-service op, return a
    list of we service ops with similar i/os
  can we apply previous work?
    software comp. matching: require the knowledge of implementation;
    we only know the interface
    schema matching: similarity on diff. granularity; web-services are
    more loosely related (NOTE: what? isn't wsdl just xml? why doesn't
    this work? this doesn't make sense)
    text-document matching: nope this won't work either they say. only
    a coupld of sentences, etc. web services are structured (NOTE: so
    why not the previous...)
    NOTE: this is a little hard to follow; their reasoning here
  Approach:
    Part 1: Exploit structure (wsdl, op name + description, i/o
    param namess, etc.)
      cluster i/o param names ...
  Custering parameter names
    heuristic: parameter terms tend to express the same concept if
    they occur together often
    strategy: cluster parameter terms into concepts based on their
    co-occurence.
      given terms p and q, similarity from p to q:
        sim(p->q) = P(q|p)
      ideal clustering: high cohesion and low correlation; cohesion
      measures intra-cluster term simil. correlation measures the
      inter-cluster term sim; cohesion/corr score = (avg(coh) /
      avg(corr))
    algorithm: a series of refinements of the classic agglomerative
    clustering; basic agglomerative clustering: merge clusters ...
    cohesion condition: each term in the result cluster is close to
    most (half) of the other terms in the cluster. Refined algo: merge
    clusters
    a bunch of examples ... sound pretty good, but not apparent how
    what they are doing actually does this, haven't shown how what
    they do can solve these examples (probably have to read the paper)
    (WARNING: I am not a big fan of these kinds of "soft" approaches
    for schema matching; so my comments may be a bit biased, but I
    have found these approaches have big "wow" factors, so I am in the
    minority a bit)
  measure top-k precision
    benchmark: 25 web-service ops; from several domains; with
    diff. input/outpu, etc.
    top-k precision for op matching is 85% (quoted as "pretty high").
    ignoring structure or just text matching, results "not that good"
    (WARNING: those statistically inclined should look away now)
  measure precision/recall
    8 web-services and 15 in/out
    6 domains, diff. popularity, etc.
    woogle (compared to only description or structure) is better
  conclusions
    defined primitives for web-service search
    algorithms for sim. search on web-service ops
    etc.
  on-going work
    template search
    specify input and output and brief description of operations
    (I want the input to be city state, output to be weather) -- isn't
    this another "AI complete" problem? They claim it is pretty close
    ...
    they claim what they have is better than google, since you can ask
    questions like "what is the temperature in seattle" -- since
    presumably they can compose and run the services
Data sharing through query translation in autonomous sources
------------------------------------------------------------
A. Kementsietsidis
  A network of autonomous sources
  each can have: its own data types (genes, proteins, articles); it
  own different schema; its own vocabulary (gene names, prot. names,
  etc); its own set of data sharing sources (a subset of sources)
  treat each source as an access
  requirements: share data without a common schema and where schemas
  haven't been rreconciled; translate and propagate query between
  sources
  use mapping tables (sigmod last year)
    associates values between sources; can be multiple tables;
    associate values within or across domains; recorded assocaitions
    are m:n in general; they support variables to express
    want to use mapping tables in structured query translation
  query lang: SPJ, no negation, selection is disjunction or
  conjunctions of atoms
  characterize translations: given q and q' and a mapping table m; we
  define what it means for sound and complete translations
  compute translations: propose efficient algorithms to compute
  translations
  with multiple mapping tables, decide which mapping tables to use
  during translation. solution: let the query guide the selection
  process
  our algorithms manipulate queries and mapping tables. it proves
  beneficial to have a uniform representation for both
  constructs. solution: represent queries as T-queries (tableau-like
  rep. of queries that uses the same formalism with mapping tables)
  reduce run-time execution costs
  solution: reuse previous translations
  seems pretty simple...
Linear Road: A stream data management benchmark
-----------------------------------------------
  contribs
    novel sdms benchmark, some implementations
  inspiration
    variable tolling
      cars have gps, emit "position reports"
      tolls - f(speed, conjestion, accidents, ...)
  linear road
    every car emits a "position report" every 30 sec
    set of 5 continuous, historical queries
    each w/ max response time constraint
    l-rating: number of express ways (1000 PRs/sec)
  challenges/solution
    semantically valid input (streams should be meaningful
    time-series)
      input generated by traffic simulator (MITSim)
    proper performance metrics (not "completion time" but "response
    time")
      L-rating: supported load while meeting requirements
    many correct results possible (answer can depend on evolving
    historical state)
      validation tools
    no query language standard (expressed in a high-level lang)
  caveat
    a stress test, not a real traffic model
    simplistic assumptions: global clock, no position interpolation,
    etc.
  backdrop
    linear city: 10 horizontal express ways, with 100 mile long
    segments
    3 lanes, 1 entrance and exit ramp
  input
    simulator produces 3 hrs of traffic data
      each car: one position report every 30 seconds
    data consists of: position reports (car, time, position data)
    piggyback ...
  toll alert query
    car crosses into new segment
    charge toll for seg #i. calculate toll for seg#i+1 based on
    previous minute stats: avg speed, # of cars, accident near; 5
    sec. response time
  accident alert query
    car emits 4 reports from same position -> "stopped"
    two cars stopped at same position -> "accident"
    alert all vehicles entering a segment ...
  historical query
    account balance (5 sec reponse)
      what do i owe? more than 1 correct result (depends on when
      answered)
    daily expenditure (10 sec response)
      what did i spend on xway x on day d? day d any day in last 10
      weeks
    travel time/toll estimate (30 sec. response)
      predict travel from p1 to p2 on xway x, day d, time t. use last
      10 weeks for prediction
  implementations
    Aurora
  comments
    some stuff doesn't make sense: like two cars stationary for a
    wreck, it would be more of a stress-test
Query languages and data models for database sequences and data
streams
---------------------------------------------------------------
Carlo Zaniolo
  most dsms use sql; queries span both streams and dbs will be easier
  but sql suffers from major new problems; designed for persistent
  data, not transitive data
  for persistent data, far from ideal. areas not supported: data
  mining, sequence queries
  this talk covers shortcomings of sql on data streams, and how to
  overcome them
  blocking operators
    current dbms's make heavy use of blocking operators
    characterize blocking and non-blocking operators
  partial ordering
    let s be a sequence [t1, ..., tn]
    then, [t1, ..., tn] is the presequence of s, of length k, denoted
    Sk
    if for some k, L-Sk, we say that L is a presequence of S
    presequence  defines a partial order
    generalizes to subset when order and duplicates are immaterial
    the empty sequence is a subsequence of every other sequence
  theorem: queries can be expressed via nonblocking computations iff
  they are monotonic
  a query lang. L can express a given set of functions on its input;
  to avaoid blocking queries, only the monotonic functions expressible
  by L should be allowed on data streams
  But ALL of them are expressible using the nb-operators of L?
  L contains nb-operators and blocking operators: only the former can
  be used on data streams -- so, can those express all the monotonic
  queries expressible in L
  L nb-complete ...
  example
    bidstream(item, bidval, time)
    select item
    from bidstream
    group by item
    having sum(bidval) > 100000;
    this is monotonic. it can be expressed in a lnaguage containing
    suitable query ops, but not in sql-2; which is not nb-complete;
    thus it is ill-suited for continuous queries on data streams
      i think he is saying sum is implemented as a blocking operator,
      but it doesn't have to be implemented this way
  relational algebra
    nonmonotonic ra operators: set difference and division
    we are left with: select, project, join, and union. can these
    express all FO monotonic queries?
    some interesting temporal queries: coalesce and until
      they are expressible in ra (by double negation)
      they are monotonic
      they cannot be expressed in nb-relational algrebra
  "review"
    sql's lack of expressivity is a major problem for database-centric
    apps
    these problems significantly more serious for data streams since:
    only monotonic; etc.
  embedding sql in a prog. lang.
    ops not expressible in sql can be expressed in pl
    but cursors are 'pull-based' and not usable on streams
      cannot hold the current tuple and all that follow waiting for
      the pl to request it by a getNext statement
  punctuation: source of the data inserts punctuations to denote the
  end of certain kind of data, when the auction closes on item95, e.g.
    queries find the total volume of bids for item95 can then be
    answered, but little benefit for our example query (because you
    have to wait) ...
    windows as synopses.  approximations, e.g., compute the continuous
    sum.
  user defined aggregates as a solution
    supported by Obj-Rel systems, were they are defined in an external
    PL
    Aurora suppors specific sublanguage
    can be defined in SQL itself; native extensibility (but ... )
  example
    aggregrate myavg(next int) : Real
    ( table state(tsum int, cnt int);
      initialize: ( ... insert into state values (next, 1) )
      iterate: ( update state set tsum-tsum+next, cnt=cnt_1 ...)
      terminate: insert into return select tsum/cnt from state
    )
  theorem
    sql with uda's defined in sql is turing complete --- i.e., it can
    express every possible db query
    this result holds if we allow both blocking and nb uda's
    but what about data streams where we can only us nb operators?
      there are query languages that are turing complete on stored
      data but not on data streams ...
    this is called a "tumbling window" -- various kinds of windows
    used with data streams
  nb-udas
    those with an empty, or missing, terminate clause
    thereom: sql with nb-udas is nb-complete: it can express all
    monotonic queries
    nb-udas natively in sql: cornerstone of their expressive streams
    language ESL, that supports well times series queries, data
    mining, etc., etc.
  conclusion
    db approach shows promises, but the challenges should not be
    underestimated
    we provided a formal analysis of the language problems of
    sql-based cqls
    proposed an effective solution: udas that make sql turing
    complete, but nb-complete (on data streams)
  ESL is up and running on the stream mill system
  (http://wis.cs.ucla.edu)
  here come the questions ...
      dave maier and jennifer widom go to the mike...
      as they walk there, dave says: "you go ahead; do you want me to
      lift you up to the mike?" (the fun begins ... :-)
      widom: agrees with approach; disagrees that sql is a stream
      query language: quote "I haven't seen anyone propose this, have
      you?" ... "no one has proposed sql semantics, maybe syntax, but not
      semantics"
      carlo: "then why use sql syntax?"
      widom: "interesting results, but wrong wrapping"
      carlo: "your comment shows some form of lack of thought"
      carlo: "if i was an nsf reviewer i would scratch it"
      widom: "oh, that was you!"
      widom: "does esl extend atlas?"
      ... didn't get answer ....
      widom: "ok, here comes mister punctuation"
      dave: "assuming sql would be implemented in a certain way
      underneath; would it be possible to translate it into operators
      implemented in a different way?", i.e., as non-blocking...
      carlo: "that is what we propose; the can do it"
      dave: "then you have the problem of translating into nb
      operators; then I have to do it"
    
    
More information about the Seek-kr-sms
mailing list