[kepler-dev] KAR use-cases, requirements, implementation discussion

Mon Nov 7 07:31:35 PST 2005

Hi all.

After the conversation with Matt & Chad, and further discussions with
Chad, I believe we've come up with something.  I committed to writing up
my understanding of these conversations with the intent of
asympotitically comming to a decision as time increases without bounds.

Use Cases:

UC-1)  Facilitate transport of workflows to grid/distributed/server/p2p
systems.  Scientist builds workflow on local system.  S then does some
clicking on UI to have workflow execute on remote system.  The local
system determines the components (actors, directors, libraries)
necessary to execute the workflow.  This is transferred to the remote
system.  If the remote system does not have all the components, it will
request copies of the components (from repository or from local system)
and make them available to the process which will execute the workflow.

UC-2)  Preserve an analysis to allow replication, examination and
further experimentation.  Scientist builds workflow which does something
magical.  S does some clicking on the UI and the entire workflow,
datasets, dependencies, initial configuration, etc, is perserved in a
file.  This file can then be loaded on another computer and the analysis
reexecuted.

Use Case 1 is more important in the short term.

Before dissecting into requirements, I need to point out that under
strict interpretation of the use cases, a monolithic Kepler system can
fulfill both use cases.  For the first use case, all the executable
codes (class files, libraries, dlls/so) would be distributed to the
local workstatations and the server systems.  The only information which
needs to be communicated is the description of the workflow and its
configuration (moml text description) and the dataset information
(either the contents or the way to retrieve the contents).  For the
second use case, the only information which needs to be archived is the
dataset contents and workflow configuration.

My understanding of Matt's vision is this solution is inadequate because
it does not allow dynamic discovery and utilization of user-developed
binary (compiled java source) actors.  This means we really have an
uncaptured use case which might read as follows:

UC-3)  Allow the development and distribution of components
(actors/directors) which can be released on a schedule independently
from Kepler itself.  Scientist/developer determines that in order to
perform a certain step in a workflow, new binary code is required.  S/d
develops a Kepler actor using the Java language (and perhaps jni stubs
into a native library).  This actor can then be distributed to another
Kepler system which will be able to utilize it in workflows.

Now the interesting requirements begin to fall out.

Functional Requirements:

FR-1) Mechanism to package resources required to implement a component
in kepler system.
FR-1a) must be able to contain java class files
FR-1b) must be able to contain native binary executable files
FR-1c) must be able to conatin native library files
FR-1d) must be able to contain MoML and other XML based text
FR-1e) must be able to contain data in binary and ascii formats
including zipped data.

FR-2) Must describe the contained components so they can be utilized in
a Kepler system.
FR-2a) each component must have a unique LSID identifier which is tied
to the specific implementation of the component.
FR-2b) must contain an OWL document with semantic ordering for the
contained objects
FR-2c) each component must list its dependencies in terms of LSIDs.

FR-3) Kepler must be able to utilize the components contained within the
package.

FR-4) Kepler must be able to detect missing dependencies when loading a
packaged resource.
FR-4a) Kepler must alert the user to missing dependencies and provide
the user with the alternative to not use the resource or attempt to
discover the missing dependencies.

Discussion:

It has already been decided these packages are to be called KAR (Kepler
ARchives) files and have an extension of .kar.  There is really no need
to open this up for discussion.

FR-1 is pretty easy to satify using any number of container mechanisms.
Since we play in java-land, jar is the most natural.  Again, this
decision was already made and there is no need to readdress it.  The
benefits of jars are:  relatively descent support in the Java Language
through java.util.jar.JarFile, java.util.jar.JarEntry,
java.util.jar.Manifest; supports a MANIFEST programmatically which
provides a place to put attribute/value type metadata.

My understanding of Kepler is there are 2 or 3 different types of
components which are being targeted for this functionality:  Actors,
Directors, and Workflows.  I say "or 3" because I am unsure if a
Workflow is fundimentally different from a Composite Actor (which is an
Actor).  Since I don't understand this, I'll leave it to somebody else
to fill in the details.  Also, I believe that an additional category of
components could benefit from this facility:  Cached data object
representations.  One could argue that the cached data object is
actually a helper object for a specific type of actor -- but the data
caching mechanism still needs to be aware of

It would be useful to look at a few examples of Actors which already
have been developed and use this as a guide.  One of the ontology
specialists might come up with a better categorization of the different
possible combinations of sources which comprise an Actor.  This list is
simply a way to highlight that the programming model for actors can be
quite varied.

AT-1)  Actor is implemented in a single java class.  One example is the
Ptolemy Constant actor which is implemented in ptolemy.actor.lib.Const.java

AT-2)  Actor is implemented in  mutliple classes including inner classes
and non-public non-inner classes defined in the same translation unit. 
One example is the Ptolemy AbsoluteValue actor which is implemented in
ptolemy.actor.lib.AbsoluteValue.java.  It has two classes
ptolemy.actor.lib.AbsoluteValue.class and
ptolemy.actor.lib.AbsoluteValue$FunctionTerm.class.

AT-3) Actor is implemented in multiple java source files.  One java
source file contains the Actor implementation (and perhaps inner classes
or non-public non-inner classes) and the remaining java source files are
"utility" classess used by the Actor.  One example may be the Geon
actor: GetPoint which is implemented in org.geon.GetPoint.java and uses
the helper class org.geon.RockSample.java.  This actor might not be the
best example of this case.

AT-4) Actor is implemented entirely in MoML.  One example is the "Anyof
Parameter" actor which is implemented in
org.resurgence.moml.AnyofParameter.xml.  It is not completely true that
the actor is completely implemented in Moml because it is actually a
specification of a ptolemy.actor.TypedCompositeActor.

AT-5) Actor requires JNI stubs and native library.  One example are the
two GDAL actors:  GDALTranslateActor and GDALWarpActor.  Both of these
actors use the JNI stub class:
org.ecoinformatics.seek.gis.gdal.GDALJniGlue.class, it's corresponding
stub implementation gdalactor.dll and the underlying gdal library
gdal12.dll.

AT-6) Actor requires multiple 3rd party jars which are not part of the
base kepler system.  We can argue what it means for a jar to not be part
of the base kepler system.  I've been using the definition that it is
not required to compile & execute Kepler without any actors.  One
example of this kind of actor could be the Email actor from
org.sdm.spa.Email.java which uses Sun's mail.jar.

Solution #1:  Attempt to use default ClassLoader and other trickery.

It should be noted that if a KAR file follows standard jar file
conventions (classes are packaged directly in the jar with no
intermediate jars), the JVM will be able to load classes directly from
KAR files.  This leads to one possible KAR organization:

AbsoluteValueActor.kar:
META-INF/MANIFEST.MF - supplemental metadata attributes - lsid, moml
filename in jar, etc.
/ptolemy/actor/lib/AbsoluteValue.class
/ptolemy/actor/lib/AbsoluteValue$FunctionTerm.class
/AbsoluteValueActorMoML.xml

If this kar file is in the CLASSPATH on application startup, the default
JVM will have no difficulty finding the class implementation files
through the standard class loader.  In addition, the actor scanning
process could be written to spin through all the entries in the
classpath and extract the supplemental information from the Manifest and
the moml file.  This technique could easily be used for AT-1, AT-2, AT-3
and AT-4 type actors.

Actors which require native libraries would have to have an additional
extraction step which pulls the dll/so's out of the kar file and stores
them in a predetermined location (on LD_LIBRARY_PATH or PATH).

If an actor requires an additional 3rd party jar (which we must assume
does not conflict with *any* other jar required by kepler-base or other
actor), it could be possible to enhance the 3rd party jar by adding
entries to the manifest and not nest it in another jar.  For example,
the wsdl4j.jar file (required by org.sdm.spa.WebService and
org.sdm.spa.WebServiceStub) could be modified thusly:

  - rename to wsdl4j.kar (to indicate that it is no longer the same file
distributed by ibm.)
  - add lsid and other metadata information to the META-INF/MANIFEST.MF

The org.sdm.spa.WebSerivce actor kar file would explicitly state that it
depends on the lsid assigned to wsdl4j.kar.  When the Kepler system
starts up, it would ensure that some kar file in the classpath has this
lsid.

Advantages:

1) Extremely simple.
2) Achievable in current time frame.
3) Pretty simple uninstall process - remove the kar file.

Disadvantages:

1) This solution would require a startup script to dynamically generate
the classpath at runtime (which is not that hard).
2) If a dependency is not available at startup, it cannot be added
during the current runtime.  The default java classloader does not allow
one to "add" jars to the classpath after startup.
3) Does not provide any isolation of dependencies - if two actors
require different versions of the same library (or class implementation)
which ever is listed first in the classpath wins.  Controlling this will
be difficult.
4) Extra burdon on actor developers  -  manually list dependencies which
are typically automatically handled through the java build.  We can
probably provide some custom ant tasks to aid in generating the
dependency list and creating the MANIFEST.
5) Management difficulties for 3rd party jars.  I don't see how this set
up would be any better than the current state of chaos in
kepler/lib/jars.  Two different development groups might inadvertantly
assign different lsids to the same 3rd party jar.
6) Illusion of independence.  The previous disadvange might manifest
itself differently.  Actor developers might believe they can control the
version of the 3rd party jar used by having different lsids for
different versions.

Solution #2:  Use a single custom classloader for the entire application.

Write a ClassLoader implementation which enhances the java class loading
process by allowing the classpath to be changed at runtime.  We could
leverage this to allow kar files to contain jar files in two different
ways:  Either have the classloader look in kar files for classes in jar
files (and not explode the contents), or explode the contents of the kar
files into some working space (.kepler/kar/classes, .kepler/kar/jars?)
and have the classloader dynamically look for classes in these jars when
asked for them.

We now have different alternatives for handling 3rd party jars and even
for the simple actor cases (AT-1 -- AT-3).  A 3rd party jar could be
packaged this way:

wsdl4j.kar:
/META-INF/MANIFEST.MF = lsid dependenecies, yada yada.
/wsdl4j.jar - the IBM produced wsdl4j.jar file.

When Kepler starts up, it could detect that the wsdl4j.kar file has a
nested jar file.  This jar file could be extracted to .kepler/kar/jars
and the appropriate jar name added to the classpath of the custom
classloader.

Similarly, we can bundle actor implementation class files in a nested
jar file.

Advantages:

1) Still fairly simple.  Although not as simple as above.
2) Does not require a script generated classpath.
3) Unresolved dependencies could be added at runtime through the custom
classloader.

Disadvantages:

1) More complex uninstall process.  Probably requires Kepler UI or
supplemental UI tool to uninstall.
2) -- 5)  Same as Disadvantages 3-6 in Solution #1.
6) If two different kar files contain the same named jar file (perhaps
different versions of same jar) we need to be able to alert the user to
possible overwriting of files in .kepler/kar/jars or otherwise prevent
possible JVM corruption.

Solution #3:  KSR Wacky Idea.

This solution has greater documentation which I can provide if needed.

Use the kar file as a "packaging" mechanism controlled by the end
developer.  Components developed by an organization/developer can be
bundled together.  Have the developer manage consistancy within the kar
file itself.

Have Kepler use a different custom classloader for each kar file
contained in the system.  Create an "Actor factory" (ie lsid
resolver/ActorMetadata) which can instantiate actors within the custom
classloader.

Allow the archive kar file format to contain complete kars that it
depends on.  If a workflow which is being archived, uses one actor for a
collection, the entire kar file is included in the archive kar.  On
startup, Kepler could detect nested kars and automatically extract them
to the local kar repository and process them through normal means.

Handling of native libraries (such as gdal12.dll but not those loaded
through System.load()) will have to be handled through a mechanism like
that described in Solution #1.  Native libraries loaded through
System.load() do not have this restriction because those will utilize
the custom classloader to find the binary code -- although they will
have to be extracted from the kar.

Advantages.

1) places the burdon of internal consistency on the actor developer.
2) Does not require the actor developer to list dependencies to
"standard" things like 3rd party jars.
3) Provides for different versions of the same actor (provided they have
different lsids).
4) Provides for different versions of 3rd party jars to coexist in the
same Kepler instance.

Disadvantages:

1) Complexity.  Definitely more complex but not without precedence.
2) Time line.  Can it be done in a month?
3) Duplication of 3rd party dependencies.  If two different kars contain
the exact same 3rd party jar, then in theory both copies could be loaded
into memory simultaneously.  This would only happen if actors from both
kars are loaded into the same workflow.