[kepler-dev] Caching of Data in Kepler

Tue Sep 14 09:44:53 PDT 2004

The MoML representation is loaded into memory whenever a MoML segment 
gets hit by the parser.  So, basically, the caching of the Moml is 
already done for us in the ptolemy code.  You'll notice that when you 
start up vergil (from kepler) that there is a small delay.  That's 
because I'm prereading the whole actor library so that the searches are 
  fast.

chad

Matt Jones wrote:
> Hi Rod,
> 
> Thanks for putting this together.  Very nice.  At this point I think it 
> would be more work to use the Monarch cache, so I would stick with yours 
> which is already set up to work in Kepler.  I think the cache needs to 
> have (at least optionally) an automatic cache expiration policy that 
> expires items from the cache when it reaches a disk space limit.  I 
> wasn't sure from your email if you implemented that part, but it doesn't 
> seem hard to add on given what you already implemented.
> 
> Also, I'd like to note that in some of our systems (e.g. Metacat) both 
> the data objects (often binary) and metadata objects (often XML) have 
> unique identifiers and will need to be cached.  So I would argue for 
> having the cache support both types of objects (there's really no 
> distinction anyways).  Yours probably does already, but I just wanted to 
> make sure.
> 
> There will be a need for local data storage for data.  In morpho we use 
> the same storage manager to handle both the cached network and 
> local-only files.  Do you think that would be possible here?  Basically, 
> its just a persistent part of the cache which is outside of the part of 
> the cache that can be deleted (you don't want people to delete something 
> thinking that it is just a cached item when in fact its their only copy 
> of the data).  The data that might be stored locally would be the 
> outputs or intermediate products of any stage of a model (we would also 
> be enabling these products to be stored on EcoGrid as well once 'put' is 
> complete).
> 
> Recently I've been discussing with Shawn and Chad on IRC the need for a 
> more coherent management strategy for storing actors and moml documents 
> so that the search mechanism can reliably locate both the moml and its 
> associated semantic metadata.  If the cache manager were generalized to 
> an overall 'Storage Manager' that could store local data, metadata, and 
> moml as well as cache network-based data, metadata, and moml then we 
> would have the basis for basing our search mechanism.
> 
> Finally, in Morpho we have the problem of trying to search XML documents 
> that are on the local disk with XPath.  This requires creating a DOM 
> representation of the XML document, which is expensive if done 
> repeatedly.  So in Morpho we cache these DOM trees once they are loaded 
>  so that they need not be recreated after the XML has been parsed once. 
>  We might want to consider a similar mechanism for MoML to improve 
> search performance -- we'll need to ask Chad about this some.  It will 
> also possibly apply to EML and other XML doctypes when we start 
> implementing the local search capabilities for data -- only network 
> search is implemented now for data.
> 
> Matt
> 
> Rod Spears wrote:
> 
>> Yesterday morning I put together a caching mechanism for the Ecogrid 
>> DataSources.
>>
>> It is a hybred memory cache and file cache with threading. Here is how 
>> it works:
>>
>> The CacheManager maintains a list of cached items. The base class is 
>> abstract enabling the implementing classes to implement "how" the data 
>> is obtained. The base class is responsible for threading, loading old 
>> data and saving out new data.
>>
>> When a request is made the cache item is created on its own thread and 
>> begins to download the data, in the mean time it marks itself as 
>> "busy." When it finishes  it notifies any listeners that it is done 
>> and marks itself "complete"
>>
>> The cache manager serializes itself out as an XML file, each entry in 
>> the cache is saved in a separate file thus making it simple and flexible.
>>
>> The items keep track of their creation date and I could easily add the 
>> capability for them to automatically retrieve a newer version of their 
>> contents. The impl I have now keeps track of the ecogrid info 
>> necessary to retrieve the data.
>>
>> So at the moment when a DataSource needs its data it just asks for it, 
>> then the cache will get it and notify them when it is there, it is all 
>> very transparent to the DS. The big difference is that it is more 
>> asynchronous than before.
>>
>> I also created a quick little "Data Cache Viewer" that displays the 
>> entries in the cache "catalog"
>>
>> Under the File menu item you can:
>> * Refresh a selected cached item
>> * Refresh all the items
>> * Delete a single cache item
>> * Delete all the cache items
>>
>> I could easily add to the viewer a way to view the actual contents of 
>> a cached item (if we need it). Some scientists may want that....
>>
>>
>>
>> After I got this working with the EML200DataSource, Jing informs me 
>> that Monarch has a "generic" memory cache and file cache. I haven't 
>> had the time yet to review the impls. Now, we can go with this 
>> specific impl that is tailored to our DataSource objects, or I could 
>> adapt it to use Monarch's file cache for serializing the output.
>>
>> Any thoughts? Or maybe we just use this for now and look at the issue 
>> again after our Oct. deadline.
>>
>> Also, it seems that we will also want to cache some of the metadata 
>> (table entity info) so a user could actually run "offline" if they 
>> wanted to or needed to.
>>
>> Rod
>>
>