[kepler-dev] Caching of Data in Kepler

Tue Sep 14 09:23:34 PDT 2004

Hi Rod,

Thanks for putting this together.  Very nice.  At this point I think it 
would be more work to use the Monarch cache, so I would stick with yours 
which is already set up to work in Kepler.  I think the cache needs to 
have (at least optionally) an automatic cache expiration policy that 
expires items from the cache when it reaches a disk space limit.  I 
wasn't sure from your email if you implemented that part, but it doesn't 
seem hard to add on given what you already implemented.

Also, I'd like to note that in some of our systems (e.g. Metacat) both 
the data objects (often binary) and metadata objects (often XML) have 
unique identifiers and will need to be cached.  So I would argue for 
having the cache support both types of objects (there's really no 
distinction anyways).  Yours probably does already, but I just wanted to 
make sure.

There will be a need for local data storage for data.  In morpho we use 
the same storage manager to handle both the cached network and 
local-only files.  Do you think that would be possible here?  Basically, 
its just a persistent part of the cache which is outside of the part of 
the cache that can be deleted (you don't want people to delete something 
thinking that it is just a cached item when in fact its their only copy 
of the data).  The data that might be stored locally would be the 
outputs or intermediate products of any stage of a model (we would also 
be enabling these products to be stored on EcoGrid as well once 'put' is 
complete).

Recently I've been discussing with Shawn and Chad on IRC the need for a 
more coherent management strategy for storing actors and moml documents 
so that the search mechanism can reliably locate both the moml and its 
associated semantic metadata.  If the cache manager were generalized to 
an overall 'Storage Manager' that could store local data, metadata, and 
moml as well as cache network-based data, metadata, and moml then we 
would have the basis for basing our search mechanism.

Finally, in Morpho we have the problem of trying to search XML documents 
that are on the local disk with XPath.  This requires creating a DOM 
representation of the XML document, which is expensive if done 
repeatedly.  So in Morpho we cache these DOM trees once they are loaded 
  so that they need not be recreated after the XML has been parsed once. 
  We might want to consider a similar mechanism for MoML to improve 
search performance -- we'll need to ask Chad about this some.  It will 
also possibly apply to EML and other XML doctypes when we start 
implementing the local search capabilities for data -- only network 
search is implemented now for data.

Matt

Rod Spears wrote:
> Yesterday morning I put together a caching mechanism for the Ecogrid 
> DataSources.
> 
> It is a hybred memory cache and file cache with threading. Here is how 
> it works:
> 
> The CacheManager maintains a list of cached items. The base class is 
> abstract enabling the implementing classes to implement "how" the data 
> is obtained. The base class is responsible for threading, loading old 
> data and saving out new data.
> 
> When a request is made the cache item is created on its own thread and 
> begins to download the data, in the mean time it marks itself as "busy." 
> When it finishes  it notifies any listeners that it is done and marks 
> itself "complete"
> 
> The cache manager serializes itself out as an XML file, each entry in 
> the cache is saved in a separate file thus making it simple and flexible.
> 
> The items keep track of their creation date and I could easily add the 
> capability for them to automatically retrieve a newer version of their 
> contents. The impl I have now keeps track of the ecogrid info necessary 
> to retrieve the data.
> 
> So at the moment when a DataSource needs its data it just asks for it, 
> then the cache will get it and notify them when it is there, it is all 
> very transparent to the DS. The big difference is that it is more 
> asynchronous than before.
> 
> I also created a quick little "Data Cache Viewer" that displays the 
> entries in the cache "catalog"
> 
> Under the File menu item you can:
> * Refresh a selected cached item
> * Refresh all the items
> * Delete a single cache item
> * Delete all the cache items
> 
> I could easily add to the viewer a way to view the actual contents of a 
> cached item (if we need it). Some scientists may want that....
> 
> 
> 
> After I got this working with the EML200DataSource, Jing informs me that 
> Monarch has a "generic" memory cache and file cache. I haven't had the 
> time yet to review the impls. Now, we can go with this specific impl 
> that is tailored to our DataSource objects, or I could adapt it to use 
> Monarch's file cache for serializing the output.
> 
> Any thoughts? Or maybe we just use this for now and look at the issue 
> again after our Oct. deadline.
> 
> Also, it seems that we will also want to cache some of the metadata 
> (table entity info) so a user could actually run "offline" if they 
> wanted to or needed to.
> 
> Rod
> 

-- 
-------------------------------------------------------------------
Matt Jones                                     jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Fax: 425-920-2439    Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)
University of California Santa Barbara
Interested in ecological informatics? http://www.ecoinformatics.org
-------------------------------------------------------------------