[seek-kr-sms] more notes from wednesday

Sat Sep 4 15:22:52 PDT 2004

On Fri, 3 Sep 2004, Bertram Ludaescher wrote:

>
> Seems like VLDB starts to be fun after all ;-)
>
> Thanks for sharing your notes!
>
> (and making them googable too ;-)

Yeah ... i left out the good stuff because of this.

shawn

>
> Bertram
>
> Shawn Bowers writes:
>  >
>  > Just a note ... bertram you might get a kick out of the last paper
>  >
>  > shawn
>  >
>  > ---------------
>  >
>  > Supporting semantics-based semantic matching in RDBMS
>  > -----------------------------------------------------
>  > NOTE: This was part of the industrial session.
>  >
>  >   Example: Restaurant guid application
>  >
>  >   "ONT_EXPAND" operator, returns a set of rows, where each row is
>  >   filled in as part of the ontology
>  >
>  >   Retrieves subset of onto indirectly
>  >
>  >   ONT_RELATED, ONT_PATH, ONT_DISTANCE, ONT_INDEXTYPE
>  >
>  >   ONT_RELATED operator
>  >
>  >     takes term1, reltype, term, ontoname, returns integer
>  >
>  >     if term1 and term2 are related via reltype, returns 1, otherwise 0
>  >
>  >     reltype can be isa, or boolean exprs. like isa or eqv
>  >
>  >   ONT_DISTANCE operator: takes label, returns length of shortest path
>  >   length
>  >
>  >     select * from served_food where ONT_RELATED( 'cuisine', 'isa',
>  >       western'; ...)
>  >
>  >     query ontology directly to find relationships
>  >
>  >   ont_expand: takes term1, reltype, term2, ontoname, returns a table
>  >   of <term1, property, term2, distance, path> tuples
>  >
>  >      nulls represent wildcards (ugh)
>  >
>  >   select * from table ( ont_expand( null, 'isa', 'latin american',
>  >     cuisine_ontology)
>  >
>  >   implementation
>  >
>  >     simple rels handled by table lookup
>  >
>  >     transfitive rels handled by using oracl'es connect by, which
>  >     allows traversing hierarchical data stored in tables
>  >
>  >     symmetric relationships handled as part of the above steps
>  >
>  >     more complex owl (DL and lite) inferencing is handled by an
>  >     initial phase of preprocessing dones as part of loading the owl
>  >     ontology
>  >
>  >       NOTE: I was just reading in the owl api paper why this has
>  >       generally been considered a bad idea.  mainly due to editing the
>  >       ontology / updating the onto (which is of course inevitable)
>  >
>  >     speeding up these queries
>  >
>  >       an "extensible" indexing type.  precompute transitive closures
>  >
>  >       looks like a regular create index statement, but add parameters
>  >       that you want to index (ontology=cuisine_ontology), e.g.
>  >
>  >       index table: (cuisine, row_id)
>  >
>  >   performance results
>  >
>  >     oracle 9i, lots of terms (he went too fast through that slide...)
>  >
>  >     performance is linear for ont_expand
>  >
>  >   applications for semantic matching
>  >
>  >     homeland security application
>  >
>  >       a person x has rented a truck, and a person y has bought
>  >       fertilizers and x and y stay at the same address
>  >
>  >       select * from Activy_x, Activy_y
>  >       where x.action = 'rent' and y.action = 'buy' and ...
>  >       ont_related (x.object isa ...)
>  >
>  >     a life science example
>  >
>  >       ... see paper
>  >
>  >
>  >   conclusions
>  >
>  >     ontologies must be stored in databases for efficient maintenance,
>  >     qsharing, and scalability
>  >
>  >     equally important is the capability of ontology-based semantic
>  >     matching
>  >
>  >     future work: address ontology merging and supporting defined rules
>  >
>  >   NOTE: this is exactly the kind of integration GEON is doing ...
>  >
>  > Similarity Search for Web Services
>  > ----------------------------------
>  > Xin Luna Dong (UW)
>  >
>  >   This is "Woogle"
>  >
>  >   Web service search
>  >
>  >     first-generation web-service search engines do keyword search on
>  >     web-service descriptions (WSDL?)
>  >
>  >       binding point, web central, salcentral, web service of the day,
>  >       remote methods, et.
>  >
>  >       they say this isn't enough
>  >
>  >     doesn't capture enough underlying semantics
>  >
>  >     does not accurately specify user's information needs
>  >
>  >     also, don't specify the services, only the description of all of
>  >     them for a description
>  >
>  >       need to "drill" down
>  >
>  >   how to improve this?
>  >
>  >     offer more flexibility by offering more similarity
>  >
>  >     via semantics (not just keywords)
>  >
>  >   example
>  >
>  >     four services, all take zip code (but different outputs)
>  >
>  >     some outputs are inputs to others
>  >
>  >   searching with woogle
>  >
>  >     returns list of operations with similar functionalities, a list
>  >     with similar inputs, a list with similar outputs, a list of those
>  >     that can be composed
>  >
>  >   elementary problems
>  >
>  >     operation matching: given a web-service operation, return a list
>  >     of similar operations
>  >
>  >     input/output matching: given the i/o of a web-service op, return a
>  >     list of we service ops with similar i/os
>  >
>  >   can we apply previous work?
>  >
>  >     software comp. matching: require the knowledge of implementation;
>  >     we only know the interface
>  >
>  >     schema matching: similarity on diff. granularity; web-services are
>  >     more loosely related (NOTE: what? isn't wsdl just xml? why doesn't
>  >     this work? this doesn't make sense)
>  >
>  >     text-document matching: nope this won't work either they say. only
>  >     a coupld of sentences, etc. web services are structured (NOTE: so
>  >     why not the previous...)
>  >
>  >     NOTE: this is a little hard to follow; their reasoning here
>  >
>  >   Approach:
>  >
>  >     Part 1: Exploit structure (wsdl, op name + description, i/o
>  >     param namess, etc.)
>  >
>  >       cluster i/o param names ...
>  >
>  >   Custering parameter names
>  >
>  >     heuristic: parameter terms tend to express the same concept if
>  >     they occur together often
>  >
>  >     strategy: cluster parameter terms into concepts based on their
>  >     co-occurence.
>  >
>  >       given terms p and q, similarity from p to q:
>  >
>  >         sim(p->q) = P(q|p)
>  >
>  >       ideal clustering: high cohesion and low correlation; cohesion
>  >       measures intra-cluster term simil. correlation measures the
>  >       inter-cluster term sim; cohesion/corr score = (avg(coh) /
>  >       avg(corr))
>  >
>  >     algorithm: a series of refinements of the classic agglomerative
>  >     clustering; basic agglomerative clustering: merge clusters ...
>  >
>  >     cohesion condition: each term in the result cluster is close to
>  >     most (half) of the other terms in the cluster. Refined algo: merge
>  >     clusters
>  >
>  >     a bunch of examples ... sound pretty good, but not apparent how
>  >     what they are doing actually does this, haven't shown how what
>  >     they do can solve these examples (probably have to read the paper)
>  >     (WARNING: I am not a big fan of these kinds of "soft" approaches
>  >     for schema matching; so my comments may be a bit biased, but I
>  >     have found these approaches have big "wow" factors, so I am in the
>  >     minority a bit)
>  >
>  >   measure top-k precision
>  >
>  >     benchmark: 25 web-service ops; from several domains; with
>  >     diff. input/outpu, etc.
>  >
>  >     top-k precision for op matching is 85% (quoted as "pretty high").
>  >
>  >     ignoring structure or just text matching, results "not that good"
>  >
>  >     (WARNING: those statistically inclined should look away now)
>  >
>  >   measure precision/recall
>  >
>  >     8 web-services and 15 in/out
>  >
>  >     6 domains, diff. popularity, etc.
>  >
>  >     woogle (compared to only description or structure) is better
>  >
>  >
>  >   conclusions
>  >
>  >     defined primitives for web-service search
>  >
>  >     algorithms for sim. search on web-service ops
>  >
>  >     etc.
>  >
>  >   on-going work
>  >
>  >     template search
>  >
>  >     specify input and output and brief description of operations
>  >
>  >     (I want the input to be city state, output to be weather) -- isn't
>  >     this another "AI complete" problem? They claim it is pretty close
>  >     ...
>  >
>  >     they claim what they have is better than google, since you can ask
>  >     questions like "what is the temperature in seattle" -- since
>  >     presumably they can compose and run the services
>  >
>  >
>  > Data sharing through query translation in autonomous sources
>  > ------------------------------------------------------------
>  > A. Kementsietsidis
>  >
>  >   A network of autonomous sources
>  >
>  >   each can have: its own data types (genes, proteins, articles); it
>  >   own different schema; its own vocabulary (gene names, prot. names,
>  >   etc); its own set of data sharing sources (a subset of sources)
>  >
>  >   treat each source as an access
>  >
>  >   requirements: share data without a common schema and where schemas
>  >   haven't been rreconciled; translate and propagate query between
>  >   sources
>  >
>  >   use mapping tables (sigmod last year)
>  >
>  >     associates values between sources; can be multiple tables;
>  >     associate values within or across domains; recorded assocaitions
>  >     are m:n in general; they support variables to express
>  >
>  >     want to use mapping tables in structured query translation
>  >
>  >   query lang: SPJ, no negation, selection is disjunction or
>  >   conjunctions of atoms
>  >
>  >   characterize translations: given q and q' and a mapping table m; we
>  >   define what it means for sound and complete translations
>  >
>  >   compute translations: propose efficient algorithms to compute
>  >   translations
>  >
>  >   with multiple mapping tables, decide which mapping tables to use
>  >   during translation. solution: let the query guide the selection
>  >   process
>  >
>  >   our algorithms manipulate queries and mapping tables. it proves
>  >   beneficial to have a uniform representation for both
>  >   constructs. solution: represent queries as T-queries (tableau-like
>  >   rep. of queries that uses the same formalism with mapping tables)
>  >
>  >   reduce run-time execution costs
>  >
>  >   solution: reuse previous translations
>  >
>  >   seems pretty simple...
>  >
>  >
>  >
>  > Linear Road: A stream data management benchmark
>  > -----------------------------------------------
>  >
>  >   contribs
>  >
>  >     novel sdms benchmark, some implementations
>  >
>  >   inspiration
>  >
>  >     variable tolling
>  >
>  >       cars have gps, emit "position reports"
>  >
>  >       tolls - f(speed, conjestion, accidents, ...)
>  >
>  >   linear road
>  >
>  >     every car emits a "position report" every 30 sec
>  >
>  >     set of 5 continuous, historical queries
>  >
>  >     each w/ max response time constraint
>  >
>  >     l-rating: number of express ways (1000 PRs/sec)
>  >
>  >   challenges/solution
>  >
>  >     semantically valid input (streams should be meaningful
>  >     time-series)
>  >
>  >       input generated by traffic simulator (MITSim)
>  >
>  >     proper performance metrics (not "completion time" but "response
>  >     time")
>  >
>  >       L-rating: supported load while meeting requirements
>  >
>  >     many correct results possible (answer can depend on evolving
>  >     historical state)
>  >
>  >       validation tools
>  >
>  >     no query language standard (expressed in a high-level lang)
>  >
>  >   caveat
>  >
>  >     a stress test, not a real traffic model
>  >
>  >     simplistic assumptions: global clock, no position interpolation,
>  >     etc.
>  >
>  >   backdrop
>  >
>  >     linear city: 10 horizontal express ways, with 100 mile long
>  >     segments
>  >
>  >     3 lanes, 1 entrance and exit ramp
>  >
>  >   input
>  >
>  >     simulator produces 3 hrs of traffic data
>  >
>  >       each car: one position report every 30 seconds
>  >
>  >     data consists of: position reports (car, time, position data)
>  >     piggyback ...
>  >
>  >
>  >   toll alert query
>  >
>  >     car crosses into new segment
>  >
>  >     charge toll for seg #i. calculate toll for seg#i+1 based on
>  >     previous minute stats: avg speed, # of cars, accident near; 5
>  >     sec. response time
>  >
>  >   accident alert query
>  >
>  >     car emits 4 reports from same position -> "stopped"
>  >
>  >     two cars stopped at same position -> "accident"
>  >
>  >     alert all vehicles entering a segment ...
>  >
>  >   historical query
>  >
>  >     account balance (5 sec reponse)
>  >
>  >       what do i owe? more than 1 correct result (depends on when
>  >       answered)
>  >
>  >     daily expenditure (10 sec response)
>  >
>  >       what did i spend on xway x on day d? day d any day in last 10
>  >       weeks
>  >
>  >     travel time/toll estimate (30 sec. response)
>  >
>  >       predict travel from p1 to p2 on xway x, day d, time t. use last
>  >       10 weeks for prediction
>  >
>  >   implementations
>  >
>  >     Aurora
>  >
>  >   comments
>  >
>  >     some stuff doesn't make sense: like two cars stationary for a
>  >     wreck, it would be more of a stress-test
>  >
>  >
>  > Query languages and data models for database sequences and data
>  > streams
>  > ---------------------------------------------------------------
>  > Carlo Zaniolo
>  >
>  >   most dsms use sql; queries span both streams and dbs will be easier
>  >
>  >   but sql suffers from major new problems; designed for persistent
>  >   data, not transitive data
>  >
>  >   for persistent data, far from ideal. areas not supported: data
>  >   mining, sequence queries
>  >
>  >   this talk covers shortcomings of sql on data streams, and how to
>  >   overcome them
>  >
>  >   blocking operators
>  >
>  >     current dbms's make heavy use of blocking operators
>  >
>  >     characterize blocking and non-blocking operators
>  >
>  >   partial ordering
>  >
>  >     let s be a sequence [t1, ..., tn]
>  >
>  >     then, [t1, ..., tn] is the presequence of s, of length k, denoted
>  >     Sk
>  >
>  >     if for some k, L-Sk, we say that L is a presequence of S
>  >
>  >     presequence  defines a partial order
>  >
>  >     generalizes to subset when order and duplicates are immaterial
>  >
>  >     the empty sequence is a subsequence of every other sequence
>  >
>  >   theorem: queries can be expressed via nonblocking computations iff
>  >   they are monotonic
>  >
>  >   a query lang. L can express a given set of functions on its input;
>  >   to avaoid blocking queries, only the monotonic functions expressible
>  >   by L should be allowed on data streams
>  >
>  >   But ALL of them are expressible using the nb-operators of L?
>  >
>  >   L contains nb-operators and blocking operators: only the former can
>  >   be used on data streams -- so, can those express all the monotonic
>  >   queries expressible in L
>  >
>  >   L nb-complete ...
>  >
>  >   example
>  >
>  >     bidstream(item, bidval, time)
>  >
>  >     select item
>  >     from bidstream
>  >     group by item
>  >     having sum(bidval) > 100000;
>  >
>  >     this is monotonic. it can be expressed in a lnaguage containing
>  >     suitable query ops, but not in sql-2; which is not nb-complete;
>  >     thus it is ill-suited for continuous queries on data streams
>  >
>  >       i think he is saying sum is implemented as a blocking operator,
>  >       but it doesn't have to be implemented this way
>  >
>  >   relational algebra
>  >
>  >     nonmonotonic ra operators: set difference and division
>  >
>  >     we are left with: select, project, join, and union. can these
>  >     express all FO monotonic queries?
>  >
>  >     some interesting temporal queries: coalesce and until
>  >
>  >       they are expressible in ra (by double negation)
>  >
>  >       they are monotonic
>  >
>  >       they cannot be expressed in nb-relational algrebra
>  >
>  >   "review"
>  >
>  >     sql's lack of expressivity is a major problem for database-centric
>  >     apps
>  >
>  >     these problems significantly more serious for data streams since:
>  >     only monotonic; etc.
>  >
>  >   embedding sql in a prog. lang.
>  >
>  >     ops not expressible in sql can be expressed in pl
>  >
>  >     but cursors are 'pull-based' and not usable on streams
>  >
>  >       cannot hold the current tuple and all that follow waiting for
>  >       the pl to request it by a getNext statement
>  >
>  >   punctuation: source of the data inserts punctuations to denote the
>  >   end of certain kind of data, when the auction closes on item95, e.g.
>  >
>  >     queries find the total volume of bids for item95 can then be
>  >     answered, but little benefit for our example query (because you
>  >     have to wait) ...
>  >
>  >     windows as synopses.  approximations, e.g., compute the continuous
>  >     sum.
>  >
>  >   user defined aggregates as a solution
>  >
>  >     supported by Obj-Rel systems, were they are defined in an external
>  >     PL
>  >
>  >     Aurora suppors specific sublanguage
>  >
>  >     can be defined in SQL itself; native extensibility (but ... )
>  >
>  >   example
>  >
>  >     aggregrate myavg(next int) : Real
>  >     ( table state(tsum int, cnt int);
>  >
>  >       initialize: ( ... insert into state values (next, 1) )
>  >
>  >       iterate: ( update state set tsum-tsum+next, cnt=cnt_1 ...)
>  >
>  >       terminate: insert into return select tsum/cnt from state
>  >     )
>  >
>  >   theorem
>  >
>  >     sql with uda's defined in sql is turing complete --- i.e., it can
>  >     express every possible db query
>  >
>  >     this result holds if we allow both blocking and nb uda's
>  >
>  >     but what about data streams where we can only us nb operators?
>  >
>  >       there are query languages that are turing complete on stored
>  >       data but not on data streams ...
>  >
>  >     this is called a "tumbling window" -- various kinds of windows
>  >     used with data streams
>  >
>  >   nb-udas
>  >
>  >     those with an empty, or missing, terminate clause
>  >
>  >     thereom: sql with nb-udas is nb-complete: it can express all
>  >     monotonic queries
>  >
>  >     nb-udas natively in sql: cornerstone of their expressive streams
>  >     language ESL, that supports well times series queries, data
>  >     mining, etc., etc.
>  >
>  >   conclusion
>  >
>  >     db approach shows promises, but the challenges should not be
>  >     underestimated
>  >
>  >     we provided a formal analysis of the language problems of
>  >     sql-based cqls
>  >
>  >     proposed an effective solution: udas that make sql turing
>  >     complete, but nb-complete (on data streams)
>  >
>  >   ESL is up and running on the stream mill system
>  >   (http://wis.cs.ucla.edu)
>  >
>  >   here come the questions ...
>  >
>  >       dave maier and jennifer widom go to the mike...
>  >
>  >       as they walk there, dave says: "you go ahead; do you want me to
>  >       lift you up to the mike?" (the fun begins ... :-)
>  >
>  >       widom: agrees with approach; disagrees that sql is a stream
>  >       query language: quote "I haven't seen anyone propose this, have
>  >       you?" ... "no one has proposed sql semantics, maybe syntax, but not
>  >       semantics"
>  >
>  >       carlo: "then why use sql syntax?"
>  >
>  >       widom: "interesting results, but wrong wrapping"
>  >
>  >       carlo: "your comment shows some form of lack of thought"
>  >
>  >       carlo: "if i was an nsf reviewer i would scratch it"
>  >
>  >       widom: "oh, that was you!"
>  >
>  >       widom: "does esl extend atlas?"
>  >
>  >       ... didn't get answer ....
>  >
>  >       widom: "ok, here comes mister punctuation"
>  >
>  >       dave: "assuming sql would be implemented in a certain way
>  >       underneath; would it be possible to translate it into operators
>  >       implemented in a different way?", i.e., as non-blocking...
>  >
>  >       carlo: "that is what we propose; the can do it"
>  >
>  >       dave: "then you have the problem of translating into nb
>  >       operators; then I have to do it"
>  >
>  >
>  >
>  > _______________________________________________
>  > seek-kr-sms mailing list
>  > seek-kr-sms at ecoinformatics.org
>  > http://www.ecoinformatics.org/mailman/listinfo/seek-kr-sms
>