[seek-kr-sms] more notes from wednesday
Shawn Bowers
bowers at sdsc.edu
Wed Sep 1 14:32:48 PDT 2004
Just a note ... bertram you might get a kick out of the last paper
shawn
---------------
Supporting semantics-based semantic matching in RDBMS
-----------------------------------------------------
NOTE: This was part of the industrial session.
Example: Restaurant guid application
"ONT_EXPAND" operator, returns a set of rows, where each row is
filled in as part of the ontology
Retrieves subset of onto indirectly
ONT_RELATED, ONT_PATH, ONT_DISTANCE, ONT_INDEXTYPE
ONT_RELATED operator
takes term1, reltype, term, ontoname, returns integer
if term1 and term2 are related via reltype, returns 1, otherwise 0
reltype can be isa, or boolean exprs. like isa or eqv
ONT_DISTANCE operator: takes label, returns length of shortest path
length
select * from served_food where ONT_RELATED( 'cuisine', 'isa',
western'; ...)
query ontology directly to find relationships
ont_expand: takes term1, reltype, term2, ontoname, returns a table
of <term1, property, term2, distance, path> tuples
nulls represent wildcards (ugh)
select * from table ( ont_expand( null, 'isa', 'latin american',
cuisine_ontology)
implementation
simple rels handled by table lookup
transfitive rels handled by using oracl'es connect by, which
allows traversing hierarchical data stored in tables
symmetric relationships handled as part of the above steps
more complex owl (DL and lite) inferencing is handled by an
initial phase of preprocessing dones as part of loading the owl
ontology
NOTE: I was just reading in the owl api paper why this has
generally been considered a bad idea. mainly due to editing the
ontology / updating the onto (which is of course inevitable)
speeding up these queries
an "extensible" indexing type. precompute transitive closures
looks like a regular create index statement, but add parameters
that you want to index (ontology=cuisine_ontology), e.g.
index table: (cuisine, row_id)
performance results
oracle 9i, lots of terms (he went too fast through that slide...)
performance is linear for ont_expand
applications for semantic matching
homeland security application
a person x has rented a truck, and a person y has bought
fertilizers and x and y stay at the same address
select * from Activy_x, Activy_y
where x.action = 'rent' and y.action = 'buy' and ...
ont_related (x.object isa ...)
a life science example
... see paper
conclusions
ontologies must be stored in databases for efficient maintenance,
qsharing, and scalability
equally important is the capability of ontology-based semantic
matching
future work: address ontology merging and supporting defined rules
NOTE: this is exactly the kind of integration GEON is doing ...
Similarity Search for Web Services
----------------------------------
Xin Luna Dong (UW)
This is "Woogle"
Web service search
first-generation web-service search engines do keyword search on
web-service descriptions (WSDL?)
binding point, web central, salcentral, web service of the day,
remote methods, et.
they say this isn't enough
doesn't capture enough underlying semantics
does not accurately specify user's information needs
also, don't specify the services, only the description of all of
them for a description
need to "drill" down
how to improve this?
offer more flexibility by offering more similarity
via semantics (not just keywords)
example
four services, all take zip code (but different outputs)
some outputs are inputs to others
searching with woogle
returns list of operations with similar functionalities, a list
with similar inputs, a list with similar outputs, a list of those
that can be composed
elementary problems
operation matching: given a web-service operation, return a list
of similar operations
input/output matching: given the i/o of a web-service op, return a
list of we service ops with similar i/os
can we apply previous work?
software comp. matching: require the knowledge of implementation;
we only know the interface
schema matching: similarity on diff. granularity; web-services are
more loosely related (NOTE: what? isn't wsdl just xml? why doesn't
this work? this doesn't make sense)
text-document matching: nope this won't work either they say. only
a coupld of sentences, etc. web services are structured (NOTE: so
why not the previous...)
NOTE: this is a little hard to follow; their reasoning here
Approach:
Part 1: Exploit structure (wsdl, op name + description, i/o
param namess, etc.)
cluster i/o param names ...
Custering parameter names
heuristic: parameter terms tend to express the same concept if
they occur together often
strategy: cluster parameter terms into concepts based on their
co-occurence.
given terms p and q, similarity from p to q:
sim(p->q) = P(q|p)
ideal clustering: high cohesion and low correlation; cohesion
measures intra-cluster term simil. correlation measures the
inter-cluster term sim; cohesion/corr score = (avg(coh) /
avg(corr))
algorithm: a series of refinements of the classic agglomerative
clustering; basic agglomerative clustering: merge clusters ...
cohesion condition: each term in the result cluster is close to
most (half) of the other terms in the cluster. Refined algo: merge
clusters
a bunch of examples ... sound pretty good, but not apparent how
what they are doing actually does this, haven't shown how what
they do can solve these examples (probably have to read the paper)
(WARNING: I am not a big fan of these kinds of "soft" approaches
for schema matching; so my comments may be a bit biased, but I
have found these approaches have big "wow" factors, so I am in the
minority a bit)
measure top-k precision
benchmark: 25 web-service ops; from several domains; with
diff. input/outpu, etc.
top-k precision for op matching is 85% (quoted as "pretty high").
ignoring structure or just text matching, results "not that good"
(WARNING: those statistically inclined should look away now)
measure precision/recall
8 web-services and 15 in/out
6 domains, diff. popularity, etc.
woogle (compared to only description or structure) is better
conclusions
defined primitives for web-service search
algorithms for sim. search on web-service ops
etc.
on-going work
template search
specify input and output and brief description of operations
(I want the input to be city state, output to be weather) -- isn't
this another "AI complete" problem? They claim it is pretty close
...
they claim what they have is better than google, since you can ask
questions like "what is the temperature in seattle" -- since
presumably they can compose and run the services
Data sharing through query translation in autonomous sources
------------------------------------------------------------
A. Kementsietsidis
A network of autonomous sources
each can have: its own data types (genes, proteins, articles); it
own different schema; its own vocabulary (gene names, prot. names,
etc); its own set of data sharing sources (a subset of sources)
treat each source as an access
requirements: share data without a common schema and where schemas
haven't been rreconciled; translate and propagate query between
sources
use mapping tables (sigmod last year)
associates values between sources; can be multiple tables;
associate values within or across domains; recorded assocaitions
are m:n in general; they support variables to express
want to use mapping tables in structured query translation
query lang: SPJ, no negation, selection is disjunction or
conjunctions of atoms
characterize translations: given q and q' and a mapping table m; we
define what it means for sound and complete translations
compute translations: propose efficient algorithms to compute
translations
with multiple mapping tables, decide which mapping tables to use
during translation. solution: let the query guide the selection
process
our algorithms manipulate queries and mapping tables. it proves
beneficial to have a uniform representation for both
constructs. solution: represent queries as T-queries (tableau-like
rep. of queries that uses the same formalism with mapping tables)
reduce run-time execution costs
solution: reuse previous translations
seems pretty simple...
Linear Road: A stream data management benchmark
-----------------------------------------------
contribs
novel sdms benchmark, some implementations
inspiration
variable tolling
cars have gps, emit "position reports"
tolls - f(speed, conjestion, accidents, ...)
linear road
every car emits a "position report" every 30 sec
set of 5 continuous, historical queries
each w/ max response time constraint
l-rating: number of express ways (1000 PRs/sec)
challenges/solution
semantically valid input (streams should be meaningful
time-series)
input generated by traffic simulator (MITSim)
proper performance metrics (not "completion time" but "response
time")
L-rating: supported load while meeting requirements
many correct results possible (answer can depend on evolving
historical state)
validation tools
no query language standard (expressed in a high-level lang)
caveat
a stress test, not a real traffic model
simplistic assumptions: global clock, no position interpolation,
etc.
backdrop
linear city: 10 horizontal express ways, with 100 mile long
segments
3 lanes, 1 entrance and exit ramp
input
simulator produces 3 hrs of traffic data
each car: one position report every 30 seconds
data consists of: position reports (car, time, position data)
piggyback ...
toll alert query
car crosses into new segment
charge toll for seg #i. calculate toll for seg#i+1 based on
previous minute stats: avg speed, # of cars, accident near; 5
sec. response time
accident alert query
car emits 4 reports from same position -> "stopped"
two cars stopped at same position -> "accident"
alert all vehicles entering a segment ...
historical query
account balance (5 sec reponse)
what do i owe? more than 1 correct result (depends on when
answered)
daily expenditure (10 sec response)
what did i spend on xway x on day d? day d any day in last 10
weeks
travel time/toll estimate (30 sec. response)
predict travel from p1 to p2 on xway x, day d, time t. use last
10 weeks for prediction
implementations
Aurora
comments
some stuff doesn't make sense: like two cars stationary for a
wreck, it would be more of a stress-test
Query languages and data models for database sequences and data
streams
---------------------------------------------------------------
Carlo Zaniolo
most dsms use sql; queries span both streams and dbs will be easier
but sql suffers from major new problems; designed for persistent
data, not transitive data
for persistent data, far from ideal. areas not supported: data
mining, sequence queries
this talk covers shortcomings of sql on data streams, and how to
overcome them
blocking operators
current dbms's make heavy use of blocking operators
characterize blocking and non-blocking operators
partial ordering
let s be a sequence [t1, ..., tn]
then, [t1, ..., tn] is the presequence of s, of length k, denoted
Sk
if for some k, L-Sk, we say that L is a presequence of S
presequence defines a partial order
generalizes to subset when order and duplicates are immaterial
the empty sequence is a subsequence of every other sequence
theorem: queries can be expressed via nonblocking computations iff
they are monotonic
a query lang. L can express a given set of functions on its input;
to avaoid blocking queries, only the monotonic functions expressible
by L should be allowed on data streams
But ALL of them are expressible using the nb-operators of L?
L contains nb-operators and blocking operators: only the former can
be used on data streams -- so, can those express all the monotonic
queries expressible in L
L nb-complete ...
example
bidstream(item, bidval, time)
select item
from bidstream
group by item
having sum(bidval) > 100000;
this is monotonic. it can be expressed in a lnaguage containing
suitable query ops, but not in sql-2; which is not nb-complete;
thus it is ill-suited for continuous queries on data streams
i think he is saying sum is implemented as a blocking operator,
but it doesn't have to be implemented this way
relational algebra
nonmonotonic ra operators: set difference and division
we are left with: select, project, join, and union. can these
express all FO monotonic queries?
some interesting temporal queries: coalesce and until
they are expressible in ra (by double negation)
they are monotonic
they cannot be expressed in nb-relational algrebra
"review"
sql's lack of expressivity is a major problem for database-centric
apps
these problems significantly more serious for data streams since:
only monotonic; etc.
embedding sql in a prog. lang.
ops not expressible in sql can be expressed in pl
but cursors are 'pull-based' and not usable on streams
cannot hold the current tuple and all that follow waiting for
the pl to request it by a getNext statement
punctuation: source of the data inserts punctuations to denote the
end of certain kind of data, when the auction closes on item95, e.g.
queries find the total volume of bids for item95 can then be
answered, but little benefit for our example query (because you
have to wait) ...
windows as synopses. approximations, e.g., compute the continuous
sum.
user defined aggregates as a solution
supported by Obj-Rel systems, were they are defined in an external
PL
Aurora suppors specific sublanguage
can be defined in SQL itself; native extensibility (but ... )
example
aggregrate myavg(next int) : Real
( table state(tsum int, cnt int);
initialize: ( ... insert into state values (next, 1) )
iterate: ( update state set tsum-tsum+next, cnt=cnt_1 ...)
terminate: insert into return select tsum/cnt from state
)
theorem
sql with uda's defined in sql is turing complete --- i.e., it can
express every possible db query
this result holds if we allow both blocking and nb uda's
but what about data streams where we can only us nb operators?
there are query languages that are turing complete on stored
data but not on data streams ...
this is called a "tumbling window" -- various kinds of windows
used with data streams
nb-udas
those with an empty, or missing, terminate clause
thereom: sql with nb-udas is nb-complete: it can express all
monotonic queries
nb-udas natively in sql: cornerstone of their expressive streams
language ESL, that supports well times series queries, data
mining, etc., etc.
conclusion
db approach shows promises, but the challenges should not be
underestimated
we provided a formal analysis of the language problems of
sql-based cqls
proposed an effective solution: udas that make sql turing
complete, but nb-complete (on data streams)
ESL is up and running on the stream mill system
(http://wis.cs.ucla.edu)
here come the questions ...
dave maier and jennifer widom go to the mike...
as they walk there, dave says: "you go ahead; do you want me to
lift you up to the mike?" (the fun begins ... :-)
widom: agrees with approach; disagrees that sql is a stream
query language: quote "I haven't seen anyone propose this, have
you?" ... "no one has proposed sql semantics, maybe syntax, but not
semantics"
carlo: "then why use sql syntax?"
widom: "interesting results, but wrong wrapping"
carlo: "your comment shows some form of lack of thought"
carlo: "if i was an nsf reviewer i would scratch it"
widom: "oh, that was you!"
widom: "does esl extend atlas?"
... didn't get answer ....
widom: "ok, here comes mister punctuation"
dave: "assuming sql would be implemented in a certain way
underneath; would it be possible to translate it into operators
implemented in a different way?", i.e., as non-blocking...
carlo: "that is what we propose; the can do it"
dave: "then you have the problem of translating into nb
operators; then I have to do it"
More information about the Seek-kr-sms
mailing list