[seek-kr-sms] more notes from wednesday
Shawn Bowers
bowers at sdsc.edu
Sat Sep 4 15:22:52 PDT 2004
On Fri, 3 Sep 2004, Bertram Ludaescher wrote:
>
> Seems like VLDB starts to be fun after all ;-)
>
> Thanks for sharing your notes!
>
> (and making them googable too ;-)
Yeah ... i left out the good stuff because of this.
shawn
>
> Bertram
>
> Shawn Bowers writes:
> >
> > Just a note ... bertram you might get a kick out of the last paper
> >
> > shawn
> >
> > ---------------
> >
> > Supporting semantics-based semantic matching in RDBMS
> > -----------------------------------------------------
> > NOTE: This was part of the industrial session.
> >
> > Example: Restaurant guid application
> >
> > "ONT_EXPAND" operator, returns a set of rows, where each row is
> > filled in as part of the ontology
> >
> > Retrieves subset of onto indirectly
> >
> > ONT_RELATED, ONT_PATH, ONT_DISTANCE, ONT_INDEXTYPE
> >
> > ONT_RELATED operator
> >
> > takes term1, reltype, term, ontoname, returns integer
> >
> > if term1 and term2 are related via reltype, returns 1, otherwise 0
> >
> > reltype can be isa, or boolean exprs. like isa or eqv
> >
> > ONT_DISTANCE operator: takes label, returns length of shortest path
> > length
> >
> > select * from served_food where ONT_RELATED( 'cuisine', 'isa',
> > western'; ...)
> >
> > query ontology directly to find relationships
> >
> > ont_expand: takes term1, reltype, term2, ontoname, returns a table
> > of <term1, property, term2, distance, path> tuples
> >
> > nulls represent wildcards (ugh)
> >
> > select * from table ( ont_expand( null, 'isa', 'latin american',
> > cuisine_ontology)
> >
> > implementation
> >
> > simple rels handled by table lookup
> >
> > transfitive rels handled by using oracl'es connect by, which
> > allows traversing hierarchical data stored in tables
> >
> > symmetric relationships handled as part of the above steps
> >
> > more complex owl (DL and lite) inferencing is handled by an
> > initial phase of preprocessing dones as part of loading the owl
> > ontology
> >
> > NOTE: I was just reading in the owl api paper why this has
> > generally been considered a bad idea. mainly due to editing the
> > ontology / updating the onto (which is of course inevitable)
> >
> > speeding up these queries
> >
> > an "extensible" indexing type. precompute transitive closures
> >
> > looks like a regular create index statement, but add parameters
> > that you want to index (ontology=cuisine_ontology), e.g.
> >
> > index table: (cuisine, row_id)
> >
> > performance results
> >
> > oracle 9i, lots of terms (he went too fast through that slide...)
> >
> > performance is linear for ont_expand
> >
> > applications for semantic matching
> >
> > homeland security application
> >
> > a person x has rented a truck, and a person y has bought
> > fertilizers and x and y stay at the same address
> >
> > select * from Activy_x, Activy_y
> > where x.action = 'rent' and y.action = 'buy' and ...
> > ont_related (x.object isa ...)
> >
> > a life science example
> >
> > ... see paper
> >
> >
> > conclusions
> >
> > ontologies must be stored in databases for efficient maintenance,
> > qsharing, and scalability
> >
> > equally important is the capability of ontology-based semantic
> > matching
> >
> > future work: address ontology merging and supporting defined rules
> >
> > NOTE: this is exactly the kind of integration GEON is doing ...
> >
> > Similarity Search for Web Services
> > ----------------------------------
> > Xin Luna Dong (UW)
> >
> > This is "Woogle"
> >
> > Web service search
> >
> > first-generation web-service search engines do keyword search on
> > web-service descriptions (WSDL?)
> >
> > binding point, web central, salcentral, web service of the day,
> > remote methods, et.
> >
> > they say this isn't enough
> >
> > doesn't capture enough underlying semantics
> >
> > does not accurately specify user's information needs
> >
> > also, don't specify the services, only the description of all of
> > them for a description
> >
> > need to "drill" down
> >
> > how to improve this?
> >
> > offer more flexibility by offering more similarity
> >
> > via semantics (not just keywords)
> >
> > example
> >
> > four services, all take zip code (but different outputs)
> >
> > some outputs are inputs to others
> >
> > searching with woogle
> >
> > returns list of operations with similar functionalities, a list
> > with similar inputs, a list with similar outputs, a list of those
> > that can be composed
> >
> > elementary problems
> >
> > operation matching: given a web-service operation, return a list
> > of similar operations
> >
> > input/output matching: given the i/o of a web-service op, return a
> > list of we service ops with similar i/os
> >
> > can we apply previous work?
> >
> > software comp. matching: require the knowledge of implementation;
> > we only know the interface
> >
> > schema matching: similarity on diff. granularity; web-services are
> > more loosely related (NOTE: what? isn't wsdl just xml? why doesn't
> > this work? this doesn't make sense)
> >
> > text-document matching: nope this won't work either they say. only
> > a coupld of sentences, etc. web services are structured (NOTE: so
> > why not the previous...)
> >
> > NOTE: this is a little hard to follow; their reasoning here
> >
> > Approach:
> >
> > Part 1: Exploit structure (wsdl, op name + description, i/o
> > param namess, etc.)
> >
> > cluster i/o param names ...
> >
> > Custering parameter names
> >
> > heuristic: parameter terms tend to express the same concept if
> > they occur together often
> >
> > strategy: cluster parameter terms into concepts based on their
> > co-occurence.
> >
> > given terms p and q, similarity from p to q:
> >
> > sim(p->q) = P(q|p)
> >
> > ideal clustering: high cohesion and low correlation; cohesion
> > measures intra-cluster term simil. correlation measures the
> > inter-cluster term sim; cohesion/corr score = (avg(coh) /
> > avg(corr))
> >
> > algorithm: a series of refinements of the classic agglomerative
> > clustering; basic agglomerative clustering: merge clusters ...
> >
> > cohesion condition: each term in the result cluster is close to
> > most (half) of the other terms in the cluster. Refined algo: merge
> > clusters
> >
> > a bunch of examples ... sound pretty good, but not apparent how
> > what they are doing actually does this, haven't shown how what
> > they do can solve these examples (probably have to read the paper)
> > (WARNING: I am not a big fan of these kinds of "soft" approaches
> > for schema matching; so my comments may be a bit biased, but I
> > have found these approaches have big "wow" factors, so I am in the
> > minority a bit)
> >
> > measure top-k precision
> >
> > benchmark: 25 web-service ops; from several domains; with
> > diff. input/outpu, etc.
> >
> > top-k precision for op matching is 85% (quoted as "pretty high").
> >
> > ignoring structure or just text matching, results "not that good"
> >
> > (WARNING: those statistically inclined should look away now)
> >
> > measure precision/recall
> >
> > 8 web-services and 15 in/out
> >
> > 6 domains, diff. popularity, etc.
> >
> > woogle (compared to only description or structure) is better
> >
> >
> > conclusions
> >
> > defined primitives for web-service search
> >
> > algorithms for sim. search on web-service ops
> >
> > etc.
> >
> > on-going work
> >
> > template search
> >
> > specify input and output and brief description of operations
> >
> > (I want the input to be city state, output to be weather) -- isn't
> > this another "AI complete" problem? They claim it is pretty close
> > ...
> >
> > they claim what they have is better than google, since you can ask
> > questions like "what is the temperature in seattle" -- since
> > presumably they can compose and run the services
> >
> >
> > Data sharing through query translation in autonomous sources
> > ------------------------------------------------------------
> > A. Kementsietsidis
> >
> > A network of autonomous sources
> >
> > each can have: its own data types (genes, proteins, articles); it
> > own different schema; its own vocabulary (gene names, prot. names,
> > etc); its own set of data sharing sources (a subset of sources)
> >
> > treat each source as an access
> >
> > requirements: share data without a common schema and where schemas
> > haven't been rreconciled; translate and propagate query between
> > sources
> >
> > use mapping tables (sigmod last year)
> >
> > associates values between sources; can be multiple tables;
> > associate values within or across domains; recorded assocaitions
> > are m:n in general; they support variables to express
> >
> > want to use mapping tables in structured query translation
> >
> > query lang: SPJ, no negation, selection is disjunction or
> > conjunctions of atoms
> >
> > characterize translations: given q and q' and a mapping table m; we
> > define what it means for sound and complete translations
> >
> > compute translations: propose efficient algorithms to compute
> > translations
> >
> > with multiple mapping tables, decide which mapping tables to use
> > during translation. solution: let the query guide the selection
> > process
> >
> > our algorithms manipulate queries and mapping tables. it proves
> > beneficial to have a uniform representation for both
> > constructs. solution: represent queries as T-queries (tableau-like
> > rep. of queries that uses the same formalism with mapping tables)
> >
> > reduce run-time execution costs
> >
> > solution: reuse previous translations
> >
> > seems pretty simple...
> >
> >
> >
> > Linear Road: A stream data management benchmark
> > -----------------------------------------------
> >
> > contribs
> >
> > novel sdms benchmark, some implementations
> >
> > inspiration
> >
> > variable tolling
> >
> > cars have gps, emit "position reports"
> >
> > tolls - f(speed, conjestion, accidents, ...)
> >
> > linear road
> >
> > every car emits a "position report" every 30 sec
> >
> > set of 5 continuous, historical queries
> >
> > each w/ max response time constraint
> >
> > l-rating: number of express ways (1000 PRs/sec)
> >
> > challenges/solution
> >
> > semantically valid input (streams should be meaningful
> > time-series)
> >
> > input generated by traffic simulator (MITSim)
> >
> > proper performance metrics (not "completion time" but "response
> > time")
> >
> > L-rating: supported load while meeting requirements
> >
> > many correct results possible (answer can depend on evolving
> > historical state)
> >
> > validation tools
> >
> > no query language standard (expressed in a high-level lang)
> >
> > caveat
> >
> > a stress test, not a real traffic model
> >
> > simplistic assumptions: global clock, no position interpolation,
> > etc.
> >
> > backdrop
> >
> > linear city: 10 horizontal express ways, with 100 mile long
> > segments
> >
> > 3 lanes, 1 entrance and exit ramp
> >
> > input
> >
> > simulator produces 3 hrs of traffic data
> >
> > each car: one position report every 30 seconds
> >
> > data consists of: position reports (car, time, position data)
> > piggyback ...
> >
> >
> > toll alert query
> >
> > car crosses into new segment
> >
> > charge toll for seg #i. calculate toll for seg#i+1 based on
> > previous minute stats: avg speed, # of cars, accident near; 5
> > sec. response time
> >
> > accident alert query
> >
> > car emits 4 reports from same position -> "stopped"
> >
> > two cars stopped at same position -> "accident"
> >
> > alert all vehicles entering a segment ...
> >
> > historical query
> >
> > account balance (5 sec reponse)
> >
> > what do i owe? more than 1 correct result (depends on when
> > answered)
> >
> > daily expenditure (10 sec response)
> >
> > what did i spend on xway x on day d? day d any day in last 10
> > weeks
> >
> > travel time/toll estimate (30 sec. response)
> >
> > predict travel from p1 to p2 on xway x, day d, time t. use last
> > 10 weeks for prediction
> >
> > implementations
> >
> > Aurora
> >
> > comments
> >
> > some stuff doesn't make sense: like two cars stationary for a
> > wreck, it would be more of a stress-test
> >
> >
> > Query languages and data models for database sequences and data
> > streams
> > ---------------------------------------------------------------
> > Carlo Zaniolo
> >
> > most dsms use sql; queries span both streams and dbs will be easier
> >
> > but sql suffers from major new problems; designed for persistent
> > data, not transitive data
> >
> > for persistent data, far from ideal. areas not supported: data
> > mining, sequence queries
> >
> > this talk covers shortcomings of sql on data streams, and how to
> > overcome them
> >
> > blocking operators
> >
> > current dbms's make heavy use of blocking operators
> >
> > characterize blocking and non-blocking operators
> >
> > partial ordering
> >
> > let s be a sequence [t1, ..., tn]
> >
> > then, [t1, ..., tn] is the presequence of s, of length k, denoted
> > Sk
> >
> > if for some k, L-Sk, we say that L is a presequence of S
> >
> > presequence defines a partial order
> >
> > generalizes to subset when order and duplicates are immaterial
> >
> > the empty sequence is a subsequence of every other sequence
> >
> > theorem: queries can be expressed via nonblocking computations iff
> > they are monotonic
> >
> > a query lang. L can express a given set of functions on its input;
> > to avaoid blocking queries, only the monotonic functions expressible
> > by L should be allowed on data streams
> >
> > But ALL of them are expressible using the nb-operators of L?
> >
> > L contains nb-operators and blocking operators: only the former can
> > be used on data streams -- so, can those express all the monotonic
> > queries expressible in L
> >
> > L nb-complete ...
> >
> > example
> >
> > bidstream(item, bidval, time)
> >
> > select item
> > from bidstream
> > group by item
> > having sum(bidval) > 100000;
> >
> > this is monotonic. it can be expressed in a lnaguage containing
> > suitable query ops, but not in sql-2; which is not nb-complete;
> > thus it is ill-suited for continuous queries on data streams
> >
> > i think he is saying sum is implemented as a blocking operator,
> > but it doesn't have to be implemented this way
> >
> > relational algebra
> >
> > nonmonotonic ra operators: set difference and division
> >
> > we are left with: select, project, join, and union. can these
> > express all FO monotonic queries?
> >
> > some interesting temporal queries: coalesce and until
> >
> > they are expressible in ra (by double negation)
> >
> > they are monotonic
> >
> > they cannot be expressed in nb-relational algrebra
> >
> > "review"
> >
> > sql's lack of expressivity is a major problem for database-centric
> > apps
> >
> > these problems significantly more serious for data streams since:
> > only monotonic; etc.
> >
> > embedding sql in a prog. lang.
> >
> > ops not expressible in sql can be expressed in pl
> >
> > but cursors are 'pull-based' and not usable on streams
> >
> > cannot hold the current tuple and all that follow waiting for
> > the pl to request it by a getNext statement
> >
> > punctuation: source of the data inserts punctuations to denote the
> > end of certain kind of data, when the auction closes on item95, e.g.
> >
> > queries find the total volume of bids for item95 can then be
> > answered, but little benefit for our example query (because you
> > have to wait) ...
> >
> > windows as synopses. approximations, e.g., compute the continuous
> > sum.
> >
> > user defined aggregates as a solution
> >
> > supported by Obj-Rel systems, were they are defined in an external
> > PL
> >
> > Aurora suppors specific sublanguage
> >
> > can be defined in SQL itself; native extensibility (but ... )
> >
> > example
> >
> > aggregrate myavg(next int) : Real
> > ( table state(tsum int, cnt int);
> >
> > initialize: ( ... insert into state values (next, 1) )
> >
> > iterate: ( update state set tsum-tsum+next, cnt=cnt_1 ...)
> >
> > terminate: insert into return select tsum/cnt from state
> > )
> >
> > theorem
> >
> > sql with uda's defined in sql is turing complete --- i.e., it can
> > express every possible db query
> >
> > this result holds if we allow both blocking and nb uda's
> >
> > but what about data streams where we can only us nb operators?
> >
> > there are query languages that are turing complete on stored
> > data but not on data streams ...
> >
> > this is called a "tumbling window" -- various kinds of windows
> > used with data streams
> >
> > nb-udas
> >
> > those with an empty, or missing, terminate clause
> >
> > thereom: sql with nb-udas is nb-complete: it can express all
> > monotonic queries
> >
> > nb-udas natively in sql: cornerstone of their expressive streams
> > language ESL, that supports well times series queries, data
> > mining, etc., etc.
> >
> > conclusion
> >
> > db approach shows promises, but the challenges should not be
> > underestimated
> >
> > we provided a formal analysis of the language problems of
> > sql-based cqls
> >
> > proposed an effective solution: udas that make sql turing
> > complete, but nb-complete (on data streams)
> >
> > ESL is up and running on the stream mill system
> > (http://wis.cs.ucla.edu)
> >
> > here come the questions ...
> >
> > dave maier and jennifer widom go to the mike...
> >
> > as they walk there, dave says: "you go ahead; do you want me to
> > lift you up to the mike?" (the fun begins ... :-)
> >
> > widom: agrees with approach; disagrees that sql is a stream
> > query language: quote "I haven't seen anyone propose this, have
> > you?" ... "no one has proposed sql semantics, maybe syntax, but not
> > semantics"
> >
> > carlo: "then why use sql syntax?"
> >
> > widom: "interesting results, but wrong wrapping"
> >
> > carlo: "your comment shows some form of lack of thought"
> >
> > carlo: "if i was an nsf reviewer i would scratch it"
> >
> > widom: "oh, that was you!"
> >
> > widom: "does esl extend atlas?"
> >
> > ... didn't get answer ....
> >
> > widom: "ok, here comes mister punctuation"
> >
> > dave: "assuming sql would be implemented in a certain way
> > underneath; would it be possible to translate it into operators
> > implemented in a different way?", i.e., as non-blocking...
> >
> > carlo: "that is what we propose; the can do it"
> >
> > dave: "then you have the problem of translating into nb
> > operators; then I have to do it"
> >
> >
> >
> > _______________________________________________
> > seek-kr-sms mailing list
> > seek-kr-sms at ecoinformatics.org
> > http://www.ecoinformatics.org/mailman/listinfo/seek-kr-sms
>
More information about the Seek-kr-sms
mailing list