[kepler-dev] Replacing DataCacheManager with CacheManager.

Tue Dec 13 04:41:39 PST 2005

Hi Kevin:

In general, intermediate data results, including a result-set of a
query or the products of arbitrary acts might very well have an lsid
and you even gave the reason for it ;-)

Because the 2nd time around, when you run the same query and you get a
different result, using the different assigned lsids, you might be
able to "spot the difference" (provided the lsids can be dereferencesd
and give you the results).

More generally, as part of supporting "data provenance", lsids can be
seens as the node-ids in a "data dependency graph" that keeps track of
what data product was derived using which actors, parameter settings,
and input products (having their own lsids).

Now the details of the "provenance framework" (PF) are not all worked
out yet. It seems clear that not every intermediate data-product must
have an lsids at all times -- or if they have, dereferencing might not
always yield the original data object (since keeping all those around
might be too costly).

I would expect the PF to have some configuration mechanisms. For
example, I might want to say whether a channel so-ad-so (or port
so-and-so) shall be "logged/recorded/provenanced" ;-) and then the
corresponding data will or won't be actually recorded at some selected
data repository.

In an ideal world, the PF would be an "add on module": if you don't
need it, you can turn it off, but it's there for those who need it for
the data (/result) management...

Bertram

>>> On Fri, 09 Dec 2005 20:17:42 -0600
>>> Kevin Ruland <kruland at ku.edu> wrote: 
KR> 
KR> I like the CNN analogy.
KR> 
KR> The way we've talked about lsids (or maybe it's my misunderstanding
KR> about lsids) is an lsid is attached to a unique thing.  I equate this
KR> unique thing to a collection of bytes.  And no matter what, if I ask the
KR> authority for the thing associated with an lsid, I will always get
KR> exactly the same thing back.
KR> 
KR> This works great when we talk about something like a specific revision
KR> of a specific chunk of code.  But when we try to associate an lsid (as
KR> the immutable identifier) to the results of a query, things start to
KR> break down.  The query itself could have an lsid associated with it, but
KR> the result of the query can be different depending on when you execute
KR> it.  Particularly when we talk about large periods of time between
KR> executions of the query.
KR> 
KR> What this means to me is if we do impose an lsid to a resultset - even
KR> if it's done at the server - it is a completely meaningless number from
KR> Kepler's perspective.  What good does this do to kepler?  The problem
KR> we're trying to solve is to have a user do something - like query the
KR> digir ecogrid server - and have the results usable more than once.  We
KR> need to be able to associate this action (query digir ecogrid) with some
KR> set of results (the resultset) and be able to do this mapping
KR> consistantly over and over until the item in the cache expires.  If the
KR> lsid is computed based on the query (which makes retrival easy) then we
KR> have no guarantee of uniqueness.  If the lsid is based on the resultset
KR> contents then we cannot perform the mapping without knowing the results
KR> before hand.
KR> 
KR> As Shawn pointed out there are many other things which also need to have
KR> a local place to live for a short (or even long) period of time and do
KR> not necessarily have an lsid associated with it.
KR> 
KR> I understand that archival would be much easier if we could have guids
KR> associated with everything.  We would not have to worry about local name
KR> clashes when trying to unpackage.  But that is an issue we have not even
KR> attempted to address, and I certainly don't think it will be impossible
KR> to manage.
KR> 
KR> Kevin
KR> 
KR> 
KR> Chad Berkley wrote:
KR> 
>> I'm not sure why we would need another ID when we have a (supposedly)
>> unique lsid to use.  I've been designing all of the objectmanager
>> classes around using lsids.  If we introduce yet another id to use
>> internally, i forsee major headaches.
>> 
>> On another, yet slightly related note:
>> I'm also dubious as to how the new cache manager is going to work with
>> data coming in from ecogrid.  I was working under the assumption
>> (based on decisions made at the june kepler meeting) that all objects
>> coming into kepler would have an lsid.  Apparently the "reality on the
>> ground" (as CNN likes to put it) is much different.  Not only is the
>> object cache tied to lsids but the SMS system is too.  If we hope to
>> use SMS to search the data store, they must have lsids.
>> 
>> We could generate local lsids for these data objects pretty easily,
>> but this will cause problems later if you try to transfer the data
>> (via a kar) to another machine or if you try to upload it to another
>> repository.
>> 
>> I don't really have a good solution for this.  I kind of think that,
>> since we've designed the object manager around lsids, we should force
>> external systems to play nice with kepler by providing lsids, either
>> natively or through some external filtering system.
>> 
>> chad
>> 
>> Kevin Ruland wrote:
>> 
>>> Hi.
>>> 
>>> I've found some more information which might prove useful.
>>> 
>>> hsqldb does support auto-increment IDENTITY columns.  We could
>>> utilize such a thing for a primary key to the table and allow access
>>> to objects through that number.  So, when a new object is inserted,
>>> this integer can be returned for the caller to utilize for future
>>> queries.
>>> 
>>> Of course, this does not provide for persistance beyond the current
>>> session.
>>> 
>>> If we can leverage the NAME column more, perhaps that could be used
>>> for the persistant key.  Basically that's what the DataCacheManager
>>> does now.  It uses a name like "EcoGrid Digir Query: <magic query
string> " for the name of the object.
>>> 
>>> The current schema for the cachetable is:
>>> 
>>> name: varchar
>>> lsid: varchar
>>> date: varchar
>>> file: varchar
>>> 
>>> With no constraints.
>>> 
>>> I suggest we do this:
>>> 
>>> id: IDENTITY
>>> name: varchar
>>> lsid: varchar
>>> date: varchar
>>> file: varchar
>>> expiration: varchar (to be completed eventually)
>>> 
>>> Perhaps force:  lsid nullable unique  - because it seems that's what
>>> it should be.
>>> 
>>> Change some signatures:
>>> 
>>> integer CacheManager.insertObject( CacheObject ) - returns the id of
>>> the inserted element.
>>> 
>>> CacheObject CacheManager.getObject( int ) - returns the cache object
>>> for the given id.  Or null if not found.
>>> 
>>> vector<CacheObject> CacheManager.getObjectsByName( String ) - returns
>>> a vector of objects matching the given name string.
>>> 
>>> Also, there are a few places where sql is inlined in the
>>> application.  This includes ddl statements as well as SIUD sql. 
>>> Perhaps we should consider pulling these things together into
>>> something which looks more like a data access pattern.  I think at
>>> least we should have the application initialization code pulled
>>> together which would include initialization of the user.dir directory
>>> structure and the clean database.
>>> 
>>> Kevin
>>> 
>>> Kevin Ruland wrote:
>>> 
>>> 
>>> Hi.
>>>> 
>>> One of the tickets assigned to me is to implement the cache expiration
>>> stuff so, in particular, the ecogrid queries are better behaved.  I was
>>> expecting to try to migrate the old ecogrid mechanism from the
>>> DataCacheManager to the new CacheManager before getting this to
>>> work. However, I have some questions.
>>> The resultsets returned from the ecogrid queries do not have anything
>>> resembling an lsid which is the primary key into the CacheManager.  We
>>> could hack together an lsid based on something like the search
>>> criteria,
>>> but strictly speaking this is not an lsid.  In addition, there is no
>>> real guarantee that the resultset returned for the same query will
>>> always be the same result set.  For example, additional Digir providers
>>> are available, or new data has been added to metacat, etc.
>>>> 
>>> I'm thinking we need some kind of internal lsid generator which can
>>> return new lsids for the local application.  Either we'd have to have
>>> the objects with these internal (localhost?) lsids always have
>>> "session"
>>> lifespan, or we'll have to come up with a mechanism which always
>>> returns
>>> the same lsid for any arbitrary input.  Maybe something like:
>>>> 
>>> class LSIDGenerator {
>>>> 
>>> static LSID generate( Object o );
>>>> 
>>> }
>>>> 
>>> Some kind of magic checksum is computed on o and used in the lsid.  So
>>> when somebody does an Ecogrid quick search, the contents of the text
>>> box
>>> combined with the EcogridQueryEndpoint are passed into the generate
>>> method.  Note:  both these things are strings.
>>>> 
>>> A unique mapping from Object (or maybe the less general case String) ->
>>> LSID could then be used to lookup the resultset (or other object) from
>>> previously executed queries.
>>>> 
>>> I think we have the same problem when trying to use the CacheManager as
>>> a repository for intermediate results generated as part of a workflow.
>>>> 
>>> Kevin
>>>> 
>>> _______________________________________________
>>> Kepler-dev mailing list
>>> Kepler-dev at ecoinformatics.org
>>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-dev
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Kepler-dev mailing list
>>> Kepler-dev at ecoinformatics.org
>>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-dev
>> 
KR> 
KR> _______________________________________________
KR> Kepler-dev mailing list
KR> Kepler-dev at ecoinformatics.org
KR> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-dev