[kepler-dev] KAR use-cases, requirements, implementation discussion

Mon Nov 7 16:26:09 PST 2005

Thanks Kevin.  This is good stuff.  Can you work with Chad to get this 
captured more permenantly on the wiki, and integrate it in or replace 
the existing Object Manager and KAR descriptions that are there?  Thanks.

Matt

Kevin Ruland wrote:
> Hi all.
> 
> After the conversation with Matt & Chad, and further discussions with
> Chad, I believe we've come up with something.  I committed to writing up
> my understanding of these conversations with the intent of
> asympotitically comming to a decision as time increases without bounds.
> 
> Use Cases:
> 
> UC-1)  Facilitate transport of workflows to grid/distributed/server/p2p
> systems.  Scientist builds workflow on local system.  S then does some
> clicking on UI to have workflow execute on remote system.  The local
> system determines the components (actors, directors, libraries)
> necessary to execute the workflow.  This is transferred to the remote
> system.  If the remote system does not have all the components, it will
> request copies of the components (from repository or from local system)
> and make them available to the process which will execute the workflow.
> 
> UC-2)  Preserve an analysis to allow replication, examination and
> further experimentation.  Scientist builds workflow which does something
> magical.  S does some clicking on the UI and the entire workflow,
> datasets, dependencies, initial configuration, etc, is perserved in a
> file.  This file can then be loaded on another computer and the analysis
> reexecuted.
> 
> Use Case 1 is more important in the short term.
> 
> Before dissecting into requirements, I need to point out that under
> strict interpretation of the use cases, a monolithic Kepler system can
> fulfill both use cases.  For the first use case, all the executable
> codes (class files, libraries, dlls/so) would be distributed to the
> local workstatations and the server systems.  The only information which
> needs to be communicated is the description of the workflow and its
> configuration (moml text description) and the dataset information
> (either the contents or the way to retrieve the contents).  For the
> second use case, the only information which needs to be archived is the
> dataset contents and workflow configuration.
> 
> My understanding of Matt's vision is this solution is inadequate because
> it does not allow dynamic discovery and utilization of user-developed
> binary (compiled java source) actors.  This means we really have an
> uncaptured use case which might read as follows:
> 
> UC-3)  Allow the development and distribution of components
> (actors/directors) which can be released on a schedule independently
> from Kepler itself.  Scientist/developer determines that in order to
> perform a certain step in a workflow, new binary code is required.  S/d
> develops a Kepler actor using the Java language (and perhaps jni stubs
> into a native library).  This actor can then be distributed to another
> Kepler system which will be able to utilize it in workflows.
> 
> Now the interesting requirements begin to fall out.
> 
> Functional Requirements:
> 
> FR-1) Mechanism to package resources required to implement a component
> in kepler system.
> FR-1a) must be able to contain java class files
> FR-1b) must be able to contain native binary executable files
> FR-1c) must be able to conatin native library files
> FR-1d) must be able to contain MoML and other XML based text
> FR-1e) must be able to contain data in binary and ascii formats
> including zipped data.
> 
> FR-2) Must describe the contained components so they can be utilized in
> a Kepler system.
> FR-2a) each component must have a unique LSID identifier which is tied
> to the specific implementation of the component.
> FR-2b) must contain an OWL document with semantic ordering for the
> contained objects
> FR-2c) each component must list its dependencies in terms of LSIDs.
> 
> FR-3) Kepler must be able to utilize the components contained within the
> package.
> 
> FR-4) Kepler must be able to detect missing dependencies when loading a
> packaged resource.
> FR-4a) Kepler must alert the user to missing dependencies and provide
> the user with the alternative to not use the resource or attempt to
> discover the missing dependencies.
> 
> Discussion:
> 
> It has already been decided these packages are to be called KAR (Kepler
> ARchives) files and have an extension of .kar.  There is really no need
> to open this up for discussion.
> 
> FR-1 is pretty easy to satify using any number of container mechanisms.
> Since we play in java-land, jar is the most natural.  Again, this
> decision was already made and there is no need to readdress it.  The
> benefits of jars are:  relatively descent support in the Java Language
> through java.util.jar.JarFile, java.util.jar.JarEntry,
> java.util.jar.Manifest; supports a MANIFEST programmatically which
> provides a place to put attribute/value type metadata.
> 
> My understanding of Kepler is there are 2 or 3 different types of
> components which are being targeted for this functionality:  Actors,
> Directors, and Workflows.  I say "or 3" because I am unsure if a
> Workflow is fundimentally different from a Composite Actor (which is an
> Actor).  Since I don't understand this, I'll leave it to somebody else
> to fill in the details.  Also, I believe that an additional category of
> components could benefit from this facility:  Cached data object
> representations.  One could argue that the cached data object is
> actually a helper object for a specific type of actor -- but the data
> caching mechanism still needs to be aware of
> 
> It would be useful to look at a few examples of Actors which already
> have been developed and use this as a guide.  One of the ontology
> specialists might come up with a better categorization of the different
> possible combinations of sources which comprise an Actor.  This list is
> simply a way to highlight that the programming model for actors can be
> quite varied.
> 
> AT-1)  Actor is implemented in a single java class.  One example is the
> Ptolemy Constant actor which is implemented in ptolemy.actor.lib.Const.java
> 
> AT-2)  Actor is implemented in  mutliple classes including inner classes
> and non-public non-inner classes defined in the same translation unit. 
> One example is the Ptolemy AbsoluteValue actor which is implemented in
> ptolemy.actor.lib.AbsoluteValue.java.  It has two classes
> ptolemy.actor.lib.AbsoluteValue.class and
> ptolemy.actor.lib.AbsoluteValue$FunctionTerm.class.
> 
> AT-3) Actor is implemented in multiple java source files.  One java
> source file contains the Actor implementation (and perhaps inner classes
> or non-public non-inner classes) and the remaining java source files are
> "utility" classess used by the Actor.  One example may be the Geon
> actor: GetPoint which is implemented in org.geon.GetPoint.java and uses
> the helper class org.geon.RockSample.java.  This actor might not be the
> best example of this case.
> 
> AT-4) Actor is implemented entirely in MoML.  One example is the "Anyof
> Parameter" actor which is implemented in
> org.resurgence.moml.AnyofParameter.xml.  It is not completely true that
> the actor is completely implemented in Moml because it is actually a
> specification of a ptolemy.actor.TypedCompositeActor.
> 
> AT-5) Actor requires JNI stubs and native library.  One example are the
> two GDAL actors:  GDALTranslateActor and GDALWarpActor.  Both of these
> actors use the JNI stub class:
> org.ecoinformatics.seek.gis.gdal.GDALJniGlue.class, it's corresponding
> stub implementation gdalactor.dll and the underlying gdal library
> gdal12.dll.
> 
> AT-6) Actor requires multiple 3rd party jars which are not part of the
> base kepler system.  We can argue what it means for a jar to not be part
> of the base kepler system.  I've been using the definition that it is
> not required to compile & execute Kepler without any actors.  One
> example of this kind of actor could be the Email actor from
> org.sdm.spa.Email.java which uses Sun's mail.jar.
> 
> Solution #1:  Attempt to use default ClassLoader and other trickery.
> 
> It should be noted that if a KAR file follows standard jar file
> conventions (classes are packaged directly in the jar with no
> intermediate jars), the JVM will be able to load classes directly from
> KAR files.  This leads to one possible KAR organization:
> 
> AbsoluteValueActor.kar:
> META-INF/MANIFEST.MF - supplemental metadata attributes - lsid, moml
> filename in jar, etc.
> /ptolemy/actor/lib/AbsoluteValue.class
> /ptolemy/actor/lib/AbsoluteValue$FunctionTerm.class
> /AbsoluteValueActorMoML.xml
> 
> If this kar file is in the CLASSPATH on application startup, the default
> JVM will have no difficulty finding the class implementation files
> through the standard class loader.  In addition, the actor scanning
> process could be written to spin through all the entries in the
> classpath and extract the supplemental information from the Manifest and
> the moml file.  This technique could easily be used for AT-1, AT-2, AT-3
> and AT-4 type actors.
> 
> Actors which require native libraries would have to have an additional
> extraction step which pulls the dll/so's out of the kar file and stores
> them in a predetermined location (on LD_LIBRARY_PATH or PATH).
> 
> If an actor requires an additional 3rd party jar (which we must assume
> does not conflict with *any* other jar required by kepler-base or other
> actor), it could be possible to enhance the 3rd party jar by adding
> entries to the manifest and not nest it in another jar.  For example,
> the wsdl4j.jar file (required by org.sdm.spa.WebService and
> org.sdm.spa.WebServiceStub) could be modified thusly:
> 
>   - rename to wsdl4j.kar (to indicate that it is no longer the same file
> distributed by ibm.)
>   - add lsid and other metadata information to the META-INF/MANIFEST.MF
> 
> The org.sdm.spa.WebSerivce actor kar file would explicitly state that it
> depends on the lsid assigned to wsdl4j.kar.  When the Kepler system
> starts up, it would ensure that some kar file in the classpath has this
> lsid.
> 
> Advantages:
> 
> 1) Extremely simple.
> 2) Achievable in current time frame.
> 3) Pretty simple uninstall process - remove the kar file.
> 
> Disadvantages:
> 
> 1) This solution would require a startup script to dynamically generate
> the classpath at runtime (which is not that hard).
> 2) If a dependency is not available at startup, it cannot be added
> during the current runtime.  The default java classloader does not allow
> one to "add" jars to the classpath after startup.
> 3) Does not provide any isolation of dependencies - if two actors
> require different versions of the same library (or class implementation)
> which ever is listed first in the classpath wins.  Controlling this will
> be difficult.
> 4) Extra burdon on actor developers  -  manually list dependencies which
> are typically automatically handled through the java build.  We can
> probably provide some custom ant tasks to aid in generating the
> dependency list and creating the MANIFEST.
> 5) Management difficulties for 3rd party jars.  I don't see how this set
> up would be any better than the current state of chaos in
> kepler/lib/jars.  Two different development groups might inadvertantly
> assign different lsids to the same 3rd party jar.
> 6) Illusion of independence.  The previous disadvange might manifest
> itself differently.  Actor developers might believe they can control the
> version of the 3rd party jar used by having different lsids for
> different versions.
> 
> Solution #2:  Use a single custom classloader for the entire application.
> 
> Write a ClassLoader implementation which enhances the java class loading
> process by allowing the classpath to be changed at runtime.  We could
> leverage this to allow kar files to contain jar files in two different
> ways:  Either have the classloader look in kar files for classes in jar
> files (and not explode the contents), or explode the contents of the kar
> files into some working space (.kepler/kar/classes, .kepler/kar/jars?)
> and have the classloader dynamically look for classes in these jars when
> asked for them.
> 
> We now have different alternatives for handling 3rd party jars and even
> for the simple actor cases (AT-1 -- AT-3).  A 3rd party jar could be
> packaged this way:
> 
> wsdl4j.kar:
> /META-INF/MANIFEST.MF = lsid dependenecies, yada yada.
> /wsdl4j.jar - the IBM produced wsdl4j.jar file.
> 
> When Kepler starts up, it could detect that the wsdl4j.kar file has a
> nested jar file.  This jar file could be extracted to .kepler/kar/jars
> and the appropriate jar name added to the classpath of the custom
> classloader.
> 
> Similarly, we can bundle actor implementation class files in a nested
> jar file.
> 
> Advantages:
> 
> 1) Still fairly simple.  Although not as simple as above.
> 2) Does not require a script generated classpath.
> 3) Unresolved dependencies could be added at runtime through the custom
> classloader.
> 
> Disadvantages:
> 
> 1) More complex uninstall process.  Probably requires Kepler UI or
> supplemental UI tool to uninstall.
> 2) -- 5)  Same as Disadvantages 3-6 in Solution #1.
> 6) If two different kar files contain the same named jar file (perhaps
> different versions of same jar) we need to be able to alert the user to
> possible overwriting of files in .kepler/kar/jars or otherwise prevent
> possible JVM corruption.
> 
> Solution #3:  KSR Wacky Idea.
> 
> This solution has greater documentation which I can provide if needed.
> 
> Use the kar file as a "packaging" mechanism controlled by the end
> developer.  Components developed by an organization/developer can be
> bundled together.  Have the developer manage consistancy within the kar
> file itself.
> 
> Have Kepler use a different custom classloader for each kar file
> contained in the system.  Create an "Actor factory" (ie lsid
> resolver/ActorMetadata) which can instantiate actors within the custom
> classloader.
> 
> Allow the archive kar file format to contain complete kars that it
> depends on.  If a workflow which is being archived, uses one actor for a
> collection, the entire kar file is included in the archive kar.  On
> startup, Kepler could detect nested kars and automatically extract them
> to the local kar repository and process them through normal means.
> 
> Handling of native libraries (such as gdal12.dll but not those loaded
> through System.load()) will have to be handled through a mechanism like
> that described in Solution #1.  Native libraries loaded through
> System.load() do not have this restriction because those will utilize
> the custom classloader to find the binary code -- although they will
> have to be extracted from the kar.
> 
> Advantages.
> 
> 1) places the burdon of internal consistency on the actor developer.
> 2) Does not require the actor developer to list dependencies to
> "standard" things like 3rd party jars.
> 3) Provides for different versions of the same actor (provided they have
> different lsids).
> 4) Provides for different versions of 3rd party jars to coexist in the
> same Kepler instance.
> 
> Disadvantages:
> 
> 1) Complexity.  Definitely more complex but not without precedence.
> 2) Time line.  Can it be done in a month?
> 3) Duplication of 3rd party dependencies.  If two different kars contain
> the exact same 3rd party jar, then in theory both copies could be loaded
> into memory simultaneously.  This would only happen if actors from both
> kars are loaded into the same workflow.
> 
> _______________________________________________
> Kepler-dev mailing list
> Kepler-dev at ecoinformatics.org
> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-dev
> 

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Matt Jones
jones at nceas.ucsb.edu                         Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)
UC Santa Barbara     http://www.nceas.ucsb.edu/ecoinformatics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~