[kepler-dev] Replacing DataCacheManager with CacheManager.

Fri Dec 9 18:17:42 PST 2005

I like the CNN analogy.

The way we've talked about lsids (or maybe it's my misunderstanding
about lsids) is an lsid is attached to a unique thing.  I equate this
unique thing to a collection of bytes.  And no matter what, if I ask the
authority for the thing associated with an lsid, I will always get
exactly the same thing back.

This works great when we talk about something like a specific revision
of a specific chunk of code.  But when we try to associate an lsid (as
the immutable identifier) to the results of a query, things start to
break down.  The query itself could have an lsid associated with it, but
the result of the query can be different depending on when you execute
it.  Particularly when we talk about large periods of time between
executions of the query.

What this means to me is if we do impose an lsid to a resultset - even
if it's done at the server - it is a completely meaningless number from
Kepler's perspective.  What good does this do to kepler?  The problem
we're trying to solve is to have a user do something - like query the
digir ecogrid server - and have the results usable more than once.  We
need to be able to associate this action (query digir ecogrid) with some
set of results (the resultset) and be able to do this mapping
consistantly over and over until the item in the cache expires.  If the
lsid is computed based on the query (which makes retrival easy) then we
have no guarantee of uniqueness.  If the lsid is based on the resultset
contents then we cannot perform the mapping without knowing the results
before hand.

As Shawn pointed out there are many other things which also need to have
a local place to live for a short (or even long) period of time and do
not necessarily have an lsid associated with it.

I understand that archival would be much easier if we could have guids
associated with everything.  We would not have to worry about local name
clashes when trying to unpackage.  But that is an issue we have not even
attempted to address, and I certainly don't think it will be impossible
to manage.

Kevin

Chad Berkley wrote:

> I'm not sure why we would need another ID when we have a (supposedly)
> unique lsid to use.  I've been designing all of the objectmanager
> classes around using lsids.  If we introduce yet another id to use
> internally, i forsee major headaches.
>
> On another, yet slightly related note:
> I'm also dubious as to how the new cache manager is going to work with
> data coming in from ecogrid.  I was working under the assumption
> (based on decisions made at the june kepler meeting) that all objects
> coming into kepler would have an lsid.  Apparently the "reality on the
> ground" (as CNN likes to put it) is much different.  Not only is the
> object cache tied to lsids but the SMS system is too.  If we hope to
> use SMS to search the data store, they must have lsids.
>
> We could generate local lsids for these data objects pretty easily,
> but this will cause problems later if you try to transfer the data
> (via a kar) to another machine or if you try to upload it to another
> repository.
>
> I don't really have a good solution for this.  I kind of think that,
> since we've designed the object manager around lsids, we should force
> external systems to play nice with kepler by providing lsids, either
> natively or through some external filtering system.
>
> chad
>
> Kevin Ruland wrote:
>
>> Hi.
>>
>> I've found some more information which might prove useful.
>>
>> hsqldb does support auto-increment IDENTITY columns.  We could
>> utilize such a thing for a primary key to the table and allow access
>> to objects through that number.  So, when a new object is inserted,
>> this integer can be returned for the caller to utilize for future
>> queries.
>>
>> Of course, this does not provide for persistance beyond the current
>> session.
>>
>> If we can leverage the NAME column more, perhaps that could be used
>> for the persistant key.  Basically that's what the DataCacheManager
>> does now.  It uses a name like "EcoGrid Digir Query: <magic query
>> string>" for the name of the object.
>>
>> The current schema for the cachetable is:
>>
>> name: varchar
>> lsid: varchar
>> date: varchar
>> file: varchar
>>
>> With no constraints.
>>
>> I suggest we do this:
>>
>> id: IDENTITY
>> name: varchar
>> lsid: varchar
>> date: varchar
>> file: varchar
>> expiration: varchar (to be completed eventually)
>>
>> Perhaps force:  lsid nullable unique  - because it seems that's what
>> it should be.
>>
>> Change some signatures:
>>
>> integer CacheManager.insertObject( CacheObject ) - returns the id of
>> the inserted element.
>>
>> CacheObject CacheManager.getObject( int ) - returns the cache object
>> for the given id.  Or null if not found.
>>
>> vector<CacheObject> CacheManager.getObjectsByName( String ) - returns
>> a vector of objects matching the given name string.
>>
>> Also, there are a few places where sql is inlined in the
>> application.  This includes ddl statements as well as SIUD sql. 
>> Perhaps we should consider pulling these things together into
>> something which looks more like a data access pattern.  I think at
>> least we should have the application initialization code pulled
>> together which would include initialization of the user.dir directory
>> structure and the clean database.
>>
>> Kevin
>>
>> Kevin Ruland wrote:
>>
>>
>>> Hi.
>>>
>>> One of the tickets assigned to me is to implement the cache expiration
>>> stuff so, in particular, the ecogrid queries are better behaved.  I was
>>> expecting to try to migrate the old ecogrid mechanism from the
>>> DataCacheManager to the new CacheManager before getting this to
>>> work. However, I have some questions.
>>> The resultsets returned from the ecogrid queries do not have anything
>>> resembling an lsid which is the primary key into the CacheManager.  We
>>> could hack together an lsid based on something like the search
>>> criteria,
>>> but strictly speaking this is not an lsid.  In addition, there is no
>>> real guarantee that the resultset returned for the same query will
>>> always be the same result set.  For example, additional Digir providers
>>> are available, or new data has been added to metacat, etc.
>>>
>>> I'm thinking we need some kind of internal lsid generator which can
>>> return new lsids for the local application.  Either we'd have to have
>>> the objects with these internal (localhost?) lsids always have
>>> "session"
>>> lifespan, or we'll have to come up with a mechanism which always
>>> returns
>>> the same lsid for any arbitrary input.  Maybe something like:
>>>
>>> class LSIDGenerator {
>>>
>>> static LSID generate( Object o );
>>>
>>> }
>>>
>>> Some kind of magic checksum is computed on o and used in the lsid.  So
>>> when somebody does an Ecogrid quick search, the contents of the text
>>> box
>>> combined with the EcogridQueryEndpoint are passed into the generate
>>> method.  Note:  both these things are strings.
>>>
>>> A unique mapping from Object (or maybe the less general case String) ->
>>> LSID could then be used to lookup the resultset (or other object) from
>>> previously executed queries.
>>>
>>> I think we have the same problem when trying to use the CacheManager as
>>> a repository for intermediate results generated as part of a workflow.
>>>
>>> Kevin
>>>
>>> _______________________________________________
>>> Kepler-dev mailing list
>>> Kepler-dev at ecoinformatics.org
>>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-dev
>>>
>>>
>>>
>>
>>
>> _______________________________________________
>> Kepler-dev mailing list
>> Kepler-dev at ecoinformatics.org
>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-dev
>