[kepler-dev] versioned workflows

Matthew Jones jones at nceas.ucsb.edu
Tue Apr 15 10:34:53 PDT 2008


Hi Paul,

A few addendums to Bertram's response...

Bertram Ludaescher wrote:
> Hi Paul:
> 
> You're raising important issues (and ones that have come up repeatedly).
> 
> I'd like to mention only a few aspects, and just briefly for now:
> 
> First, in Kepler you can use Kepler Archive Files (KAR files) to create 
> self-contained versions of Kepler workflows. The use of such 
> self-contained archive files can give you a "snapshot" version of a 
> workflow (and in a sense "immunizes" you against evolving versions of 
> actors). Alternatively, you can choose to use the current version of 
> actors.

KAR files are the mechanism we chose to archive both actors and 
workflows.  We envisioned two types of KAR files, so-called 'package' 
KAR files and 'archive' KAR files.

Package KAR files are designed to encapsulate a workflow component and 
all dependencies for execution (such as jar files, native libraries, 
etc) and give it an explicit version via an LSID.  This allows component 
developers to distribute specific versions of actors by multiple 
mechanisms, including by uploading to the Kepler Repository.  This is 
working now.

'Archive' KAR files are designed to encapsulate the versioned workflow 
and all actors and data needed to reproduce a workflow run.  Thus, an 
archive file would contain  copies of all KAR files for components used 
(or references to those components) and copies of all data used (or 
references to those data).  If references are used, they must be 
persistently available, so at least initially we envisioned that Archive 
KAR files would represent a 'deep copy' of the workflow, to ensure that 
all executable components are available.  This functionality is not 
present in Kepler, although the infrastructure for it is in place -- we 
just need to create the 'archive' function to wrap everything together 
and is one of the top work items for the REAP project.

> Overall, as I see it, the problem contains the problem of software 
> configuration management as a special case and thus can be tricky to say 
> the least ...
Agreed...tricky to say the least.

> Also, as far as I recall, LSIDs are used for identifying actors and 
> workflows, but I'm not sure whether a versioning feature is used as well 
> (here is some earlier info on KAR files): 
> http://kepler-project.org/Wiki.jsp?page=KeplerObjectManager
We do support versioning, via LSIDs. The hard part in my mind is making 
sure that component authors are religious about versioning their actor 
changes.  Thus far, in the pre-1.0.0 release situation, we have allowed 
actor authors to change their actor signatures and implementations 
without revising the LSID.  Post 1.0.0 this would not be recommended, as 
it would break old workflows that incorporated early versions of that 
actor.  Thus, 1.0.0 represents a watershed in versioning.  Separating 
actors from the Kepler CORE will be very positive as it will allow 
actors to be versioned independently of the core release.

Note that the wiki page Bertram points to is a little old but still very 
relevant.  It says we haven't handled the 'dynamic class loading', but 
in fact we have partially solved this problem now, so jar files included 
inthe KAR file are indeed loaded at runtime, so a complete actor with 
its jar dependencies can be archived in a KAR.  The major items left to 
finish are to 1) manage versioing conflicts when different actors need 
to load conflicting jar files, and 2) support native code dependencies, 
such as actors that wrap native simulation models or other C code that 
have library dependencies and platform-specific runtime dependiencies. 
We have developed further requirements and plans for the REAP project in 
this area.

> 
> Conceptually, it might be helpful to distinguish between a static 
> snapshot/archival version of a workflow, where the goal might be 
> reproducibility, and an "evolving workflow" where the user's goal is to 
> (mostly) use the current versions of actors.
Yep.  Thus our different 'types' of KAR files.

> 
> The problem becomes even more interesting when considering that not only 
> workflows evolve, but also the data that is associated with particular 
> workflow runs. Sometimes data is implicitly referenced via remote 
> queries and services (say via a remote Blast).
> In the general case, the functionality of a workflow thus can depend 
> also on snapshots of external entities. When recording provenance 
> information, such dependencies can be captured and can, in principle, be 
> made part of an archive as well.
KAR files can contain data.  They can also contain references to data 
that is archived, such as data in the KNB archive.  Once Kepler 
'Archive' files are implemented, we should be able to save and load the 
data in a KAR file in order to reproduce a run.  Data have an LSID too, 
so the data can be versioned as well.

> 
> The areas data provenance (~ data lineage and processing history) and 
> workflow evolution (aka workflow provenance) are active areas of 
> research and development, in Kepler, as well as in several other projects.
Very active indeed.

Paul -- if you have interest in working with us on this, we'd welcome 
your input.  Chad's been leading this effort for us on our end, and will 
be picking it back up in a couple of months for REAP.

Matt

> 
> So much for now...
> 
> Bertram
> 
> On Tue, Apr 15, 2008 at 5:50 AM, Paul Allen <pea1 at cornell.edu 
> <mailto:pea1 at cornell.edu>> wrote:
> 
>     Hello all,
> 
>     I'm wondering if there have been any thoughts about the versioning of
>     workflows that reside in a repository. The idea would be to make sure
>     that, if a workflow from a repository is referenced externally, it will
>     always work in a manner similar (and produce similar output) as when it
>     was referenced. I think that this is important if people are sharing
>     workflows, yet those workflows continue to be improved or updated.
> 
>     I'm not sure if versioning workflows implies that actors are also
>     versioned.
> 
>     Has anybody thought about this?
> 
>     Thanks,
>     -Paul
> 
> 
> 
> 
>     _______________________________________________
>     Kepler-dev mailing list
>     Kepler-dev at ecoinformatics.org <mailto:Kepler-dev at ecoinformatics.org>
>     http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-dev
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Kepler-dev mailing list
> Kepler-dev at ecoinformatics.org
> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-dev

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Matthew B. Jones
Director of Informatics Research and Development
National Center for Ecological Analysis and Synthesis (NCEAS)
UC Santa Barbara
jones at nceas.ucsb.edu                       Ph: 1-907-523-1960
http://www.nceas.ucsb.edu/ecoinfo
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



More information about the Kepler-dev mailing list