[seek-kr-sms] more notes from wednesday
Bertram Ludaescher
ludaesch at sdsc.edu
Fri Sep 3 06:37:29 PDT 2004
Seems like VLDB starts to be fun after all ;-)
Thanks for sharing your notes!
(and making them googable too ;-)
Bertram
Shawn Bowers writes:
>
> Just a note ... bertram you might get a kick out of the last paper
>
> shawn
>
> ---------------
>
> Supporting semantics-based semantic matching in RDBMS
> -----------------------------------------------------
> NOTE: This was part of the industrial session.
>
> Example: Restaurant guid application
>
> "ONT_EXPAND" operator, returns a set of rows, where each row is
> filled in as part of the ontology
>
> Retrieves subset of onto indirectly
>
> ONT_RELATED, ONT_PATH, ONT_DISTANCE, ONT_INDEXTYPE
>
> ONT_RELATED operator
>
> takes term1, reltype, term, ontoname, returns integer
>
> if term1 and term2 are related via reltype, returns 1, otherwise 0
>
> reltype can be isa, or boolean exprs. like isa or eqv
>
> ONT_DISTANCE operator: takes label, returns length of shortest path
> length
>
> select * from served_food where ONT_RELATED( 'cuisine', 'isa',
> western'; ...)
>
> query ontology directly to find relationships
>
> ont_expand: takes term1, reltype, term2, ontoname, returns a table
> of <term1, property, term2, distance, path> tuples
>
> nulls represent wildcards (ugh)
>
> select * from table ( ont_expand( null, 'isa', 'latin american',
> cuisine_ontology)
>
> implementation
>
> simple rels handled by table lookup
>
> transfitive rels handled by using oracl'es connect by, which
> allows traversing hierarchical data stored in tables
>
> symmetric relationships handled as part of the above steps
>
> more complex owl (DL and lite) inferencing is handled by an
> initial phase of preprocessing dones as part of loading the owl
> ontology
>
> NOTE: I was just reading in the owl api paper why this has
> generally been considered a bad idea. mainly due to editing the
> ontology / updating the onto (which is of course inevitable)
>
> speeding up these queries
>
> an "extensible" indexing type. precompute transitive closures
>
> looks like a regular create index statement, but add parameters
> that you want to index (ontology=cuisine_ontology), e.g.
>
> index table: (cuisine, row_id)
>
> performance results
>
> oracle 9i, lots of terms (he went too fast through that slide...)
>
> performance is linear for ont_expand
>
> applications for semantic matching
>
> homeland security application
>
> a person x has rented a truck, and a person y has bought
> fertilizers and x and y stay at the same address
>
> select * from Activy_x, Activy_y
> where x.action = 'rent' and y.action = 'buy' and ...
> ont_related (x.object isa ...)
>
> a life science example
>
> ... see paper
>
>
> conclusions
>
> ontologies must be stored in databases for efficient maintenance,
> qsharing, and scalability
>
> equally important is the capability of ontology-based semantic
> matching
>
> future work: address ontology merging and supporting defined rules
>
> NOTE: this is exactly the kind of integration GEON is doing ...
>
> Similarity Search for Web Services
> ----------------------------------
> Xin Luna Dong (UW)
>
> This is "Woogle"
>
> Web service search
>
> first-generation web-service search engines do keyword search on
> web-service descriptions (WSDL?)
>
> binding point, web central, salcentral, web service of the day,
> remote methods, et.
>
> they say this isn't enough
>
> doesn't capture enough underlying semantics
>
> does not accurately specify user's information needs
>
> also, don't specify the services, only the description of all of
> them for a description
>
> need to "drill" down
>
> how to improve this?
>
> offer more flexibility by offering more similarity
>
> via semantics (not just keywords)
>
> example
>
> four services, all take zip code (but different outputs)
>
> some outputs are inputs to others
>
> searching with woogle
>
> returns list of operations with similar functionalities, a list
> with similar inputs, a list with similar outputs, a list of those
> that can be composed
>
> elementary problems
>
> operation matching: given a web-service operation, return a list
> of similar operations
>
> input/output matching: given the i/o of a web-service op, return a
> list of we service ops with similar i/os
>
> can we apply previous work?
>
> software comp. matching: require the knowledge of implementation;
> we only know the interface
>
> schema matching: similarity on diff. granularity; web-services are
> more loosely related (NOTE: what? isn't wsdl just xml? why doesn't
> this work? this doesn't make sense)
>
> text-document matching: nope this won't work either they say. only
> a coupld of sentences, etc. web services are structured (NOTE: so
> why not the previous...)
>
> NOTE: this is a little hard to follow; their reasoning here
>
> Approach:
>
> Part 1: Exploit structure (wsdl, op name + description, i/o
> param namess, etc.)
>
> cluster i/o param names ...
>
> Custering parameter names
>
> heuristic: parameter terms tend to express the same concept if
> they occur together often
>
> strategy: cluster parameter terms into concepts based on their
> co-occurence.
>
> given terms p and q, similarity from p to q:
>
> sim(p->q) = P(q|p)
>
> ideal clustering: high cohesion and low correlation; cohesion
> measures intra-cluster term simil. correlation measures the
> inter-cluster term sim; cohesion/corr score = (avg(coh) /
> avg(corr))
>
> algorithm: a series of refinements of the classic agglomerative
> clustering; basic agglomerative clustering: merge clusters ...
>
> cohesion condition: each term in the result cluster is close to
> most (half) of the other terms in the cluster. Refined algo: merge
> clusters
>
> a bunch of examples ... sound pretty good, but not apparent how
> what they are doing actually does this, haven't shown how what
> they do can solve these examples (probably have to read the paper)
> (WARNING: I am not a big fan of these kinds of "soft" approaches
> for schema matching; so my comments may be a bit biased, but I
> have found these approaches have big "wow" factors, so I am in the
> minority a bit)
>
> measure top-k precision
>
> benchmark: 25 web-service ops; from several domains; with
> diff. input/outpu, etc.
>
> top-k precision for op matching is 85% (quoted as "pretty high").
>
> ignoring structure or just text matching, results "not that good"
>
> (WARNING: those statistically inclined should look away now)
>
> measure precision/recall
>
> 8 web-services and 15 in/out
>
> 6 domains, diff. popularity, etc.
>
> woogle (compared to only description or structure) is better
>
>
> conclusions
>
> defined primitives for web-service search
>
> algorithms for sim. search on web-service ops
>
> etc.
>
> on-going work
>
> template search
>
> specify input and output and brief description of operations
>
> (I want the input to be city state, output to be weather) -- isn't
> this another "AI complete" problem? They claim it is pretty close
> ...
>
> they claim what they have is better than google, since you can ask
> questions like "what is the temperature in seattle" -- since
> presumably they can compose and run the services
>
>
> Data sharing through query translation in autonomous sources
> ------------------------------------------------------------
> A. Kementsietsidis
>
> A network of autonomous sources
>
> each can have: its own data types (genes, proteins, articles); it
> own different schema; its own vocabulary (gene names, prot. names,
> etc); its own set of data sharing sources (a subset of sources)
>
> treat each source as an access
>
> requirements: share data without a common schema and where schemas
> haven't been rreconciled; translate and propagate query between
> sources
>
> use mapping tables (sigmod last year)
>
> associates values between sources; can be multiple tables;
> associate values within or across domains; recorded assocaitions
> are m:n in general; they support variables to express
>
> want to use mapping tables in structured query translation
>
> query lang: SPJ, no negation, selection is disjunction or
> conjunctions of atoms
>
> characterize translations: given q and q' and a mapping table m; we
> define what it means for sound and complete translations
>
> compute translations: propose efficient algorithms to compute
> translations
>
> with multiple mapping tables, decide which mapping tables to use
> during translation. solution: let the query guide the selection
> process
>
> our algorithms manipulate queries and mapping tables. it proves
> beneficial to have a uniform representation for both
> constructs. solution: represent queries as T-queries (tableau-like
> rep. of queries that uses the same formalism with mapping tables)
>
> reduce run-time execution costs
>
> solution: reuse previous translations
>
> seems pretty simple...
>
>
>
> Linear Road: A stream data management benchmark
> -----------------------------------------------
>
> contribs
>
> novel sdms benchmark, some implementations
>
> inspiration
>
> variable tolling
>
> cars have gps, emit "position reports"
>
> tolls - f(speed, conjestion, accidents, ...)
>
> linear road
>
> every car emits a "position report" every 30 sec
>
> set of 5 continuous, historical queries
>
> each w/ max response time constraint
>
> l-rating: number of express ways (1000 PRs/sec)
>
> challenges/solution
>
> semantically valid input (streams should be meaningful
> time-series)
>
> input generated by traffic simulator (MITSim)
>
> proper performance metrics (not "completion time" but "response
> time")
>
> L-rating: supported load while meeting requirements
>
> many correct results possible (answer can depend on evolving
> historical state)
>
> validation tools
>
> no query language standard (expressed in a high-level lang)
>
> caveat
>
> a stress test, not a real traffic model
>
> simplistic assumptions: global clock, no position interpolation,
> etc.
>
> backdrop
>
> linear city: 10 horizontal express ways, with 100 mile long
> segments
>
> 3 lanes, 1 entrance and exit ramp
>
> input
>
> simulator produces 3 hrs of traffic data
>
> each car: one position report every 30 seconds
>
> data consists of: position reports (car, time, position data)
> piggyback ...
>
>
> toll alert query
>
> car crosses into new segment
>
> charge toll for seg #i. calculate toll for seg#i+1 based on
> previous minute stats: avg speed, # of cars, accident near; 5
> sec. response time
>
> accident alert query
>
> car emits 4 reports from same position -> "stopped"
>
> two cars stopped at same position -> "accident"
>
> alert all vehicles entering a segment ...
>
> historical query
>
> account balance (5 sec reponse)
>
> what do i owe? more than 1 correct result (depends on when
> answered)
>
> daily expenditure (10 sec response)
>
> what did i spend on xway x on day d? day d any day in last 10
> weeks
>
> travel time/toll estimate (30 sec. response)
>
> predict travel from p1 to p2 on xway x, day d, time t. use last
> 10 weeks for prediction
>
> implementations
>
> Aurora
>
> comments
>
> some stuff doesn't make sense: like two cars stationary for a
> wreck, it would be more of a stress-test
>
>
> Query languages and data models for database sequences and data
> streams
> ---------------------------------------------------------------
> Carlo Zaniolo
>
> most dsms use sql; queries span both streams and dbs will be easier
>
> but sql suffers from major new problems; designed for persistent
> data, not transitive data
>
> for persistent data, far from ideal. areas not supported: data
> mining, sequence queries
>
> this talk covers shortcomings of sql on data streams, and how to
> overcome them
>
> blocking operators
>
> current dbms's make heavy use of blocking operators
>
> characterize blocking and non-blocking operators
>
> partial ordering
>
> let s be a sequence [t1, ..., tn]
>
> then, [t1, ..., tn] is the presequence of s, of length k, denoted
> Sk
>
> if for some k, L-Sk, we say that L is a presequence of S
>
> presequence defines a partial order
>
> generalizes to subset when order and duplicates are immaterial
>
> the empty sequence is a subsequence of every other sequence
>
> theorem: queries can be expressed via nonblocking computations iff
> they are monotonic
>
> a query lang. L can express a given set of functions on its input;
> to avaoid blocking queries, only the monotonic functions expressible
> by L should be allowed on data streams
>
> But ALL of them are expressible using the nb-operators of L?
>
> L contains nb-operators and blocking operators: only the former can
> be used on data streams -- so, can those express all the monotonic
> queries expressible in L
>
> L nb-complete ...
>
> example
>
> bidstream(item, bidval, time)
>
> select item
> from bidstream
> group by item
> having sum(bidval) > 100000;
>
> this is monotonic. it can be expressed in a lnaguage containing
> suitable query ops, but not in sql-2; which is not nb-complete;
> thus it is ill-suited for continuous queries on data streams
>
> i think he is saying sum is implemented as a blocking operator,
> but it doesn't have to be implemented this way
>
> relational algebra
>
> nonmonotonic ra operators: set difference and division
>
> we are left with: select, project, join, and union. can these
> express all FO monotonic queries?
>
> some interesting temporal queries: coalesce and until
>
> they are expressible in ra (by double negation)
>
> they are monotonic
>
> they cannot be expressed in nb-relational algrebra
>
> "review"
>
> sql's lack of expressivity is a major problem for database-centric
> apps
>
> these problems significantly more serious for data streams since:
> only monotonic; etc.
>
> embedding sql in a prog. lang.
>
> ops not expressible in sql can be expressed in pl
>
> but cursors are 'pull-based' and not usable on streams
>
> cannot hold the current tuple and all that follow waiting for
> the pl to request it by a getNext statement
>
> punctuation: source of the data inserts punctuations to denote the
> end of certain kind of data, when the auction closes on item95, e.g.
>
> queries find the total volume of bids for item95 can then be
> answered, but little benefit for our example query (because you
> have to wait) ...
>
> windows as synopses. approximations, e.g., compute the continuous
> sum.
>
> user defined aggregates as a solution
>
> supported by Obj-Rel systems, were they are defined in an external
> PL
>
> Aurora suppors specific sublanguage
>
> can be defined in SQL itself; native extensibility (but ... )
>
> example
>
> aggregrate myavg(next int) : Real
> ( table state(tsum int, cnt int);
>
> initialize: ( ... insert into state values (next, 1) )
>
> iterate: ( update state set tsum-tsum+next, cnt=cnt_1 ...)
>
> terminate: insert into return select tsum/cnt from state
> )
>
> theorem
>
> sql with uda's defined in sql is turing complete --- i.e., it can
> express every possible db query
>
> this result holds if we allow both blocking and nb uda's
>
> but what about data streams where we can only us nb operators?
>
> there are query languages that are turing complete on stored
> data but not on data streams ...
>
> this is called a "tumbling window" -- various kinds of windows
> used with data streams
>
> nb-udas
>
> those with an empty, or missing, terminate clause
>
> thereom: sql with nb-udas is nb-complete: it can express all
> monotonic queries
>
> nb-udas natively in sql: cornerstone of their expressive streams
> language ESL, that supports well times series queries, data
> mining, etc., etc.
>
> conclusion
>
> db approach shows promises, but the challenges should not be
> underestimated
>
> we provided a formal analysis of the language problems of
> sql-based cqls
>
> proposed an effective solution: udas that make sql turing
> complete, but nb-complete (on data streams)
>
> ESL is up and running on the stream mill system
> (http://wis.cs.ucla.edu)
>
> here come the questions ...
>
> dave maier and jennifer widom go to the mike...
>
> as they walk there, dave says: "you go ahead; do you want me to
> lift you up to the mike?" (the fun begins ... :-)
>
> widom: agrees with approach; disagrees that sql is a stream
> query language: quote "I haven't seen anyone propose this, have
> you?" ... "no one has proposed sql semantics, maybe syntax, but not
> semantics"
>
> carlo: "then why use sql syntax?"
>
> widom: "interesting results, but wrong wrapping"
>
> carlo: "your comment shows some form of lack of thought"
>
> carlo: "if i was an nsf reviewer i would scratch it"
>
> widom: "oh, that was you!"
>
> widom: "does esl extend atlas?"
>
> ... didn't get answer ....
>
> widom: "ok, here comes mister punctuation"
>
> dave: "assuming sql would be implemented in a certain way
> underneath; would it be possible to translate it into operators
> implemented in a different way?", i.e., as non-blocking...
>
> carlo: "that is what we propose; the can do it"
>
> dave: "then you have the problem of translating into nb
> operators; then I have to do it"
>
>
>
> _______________________________________________
> seek-kr-sms mailing list
> seek-kr-sms at ecoinformatics.org
> http://www.ecoinformatics.org/mailman/listinfo/seek-kr-sms
More information about the Seek-kr-sms
mailing list