[seek-kr-sms] Re: Transformation Steps (Bug 1070)

Sun Nov 9 06:19:37 PST 2003

Hi Rod et al:

I'm also interspersing some comments below -- after this, I guess the
chopped up discussion will be really hard to read ;-)

As a follow up, maybe we can have a joint phone call later this coming 
week. 

>>>>> "SB" == Shawn Bowers <bowers at sdsc.edu> writes:
SB> 
SB> Hi, 
SB> 
SB> I had some comments as well, which I intersperse below.
SB> 
SB> - Shawn
SB> 
SB> 
SB> On Thu, 6 Nov 2003, Matt Jones wrote:
SB> 
>> Hey Rod,
>> 
>> You raise some interesting points about transformation.  We really 
>> haven't talked through the implementation strategies very well.  I agree 
>> that implementation can impose some major constraints on design :)  So 
>> its about time that we considered it.  I forwarded this to seek-kr-sms 
>> so that Bertram and Shawn and Rich could benefit from the conversation 
>> as well.  The bug describing the need for a transformation system 
>> (http://bugzilla.ecoinformatics.org/show_bug.cgi?id=1070) is really just 
>> a placeholder for what we need -- a thorough design proposal for a 
>> transformation system.  My assumption is that Bertram, Shawn, and the 

Indeed that thorough design proposal is something that Shawn and I are 
working on.

>> SMS group are mainly responsible for that design and implementation, so 
>> I have reassigned the bug to Shawn.
>> 
>> My comments inline...
>> 
>> Rod Spears wrote:
>> > Matt,
>> > 
>> > I have been doing a lot of thinking (and a little reading) about SM. It 
>> > appears there is all this research and theories on how to match up 
>> > certain "nodes" between one or more ontologies. And there are these 
>> > special algorithms for doing this kind of "matching". Which I am 
>> > assuming is a quick summary of Bertram's research/field of study. There 
>> > also seems to be some software systems already working that does this. 
>> > Is this correct? Has Bertram written software that works? Or is it still 
>> > being developed?
>> 
>> There is a lot of software that does reasoning.  Much of it is 
>> proprietary.  I don't think Bertram has written these engines, but I 
>> could easily be wrong.  Currently he seems to prefer systems like Prolog 
>> for developing prototypes -- I'm not sure if this will scale to our 
>> application, but we'll see.

correct. We have done some prototyping in Prolog of the unit
conversion system. That is yet to be integrated into Ptolemy (or
Kepler as we are about to call this ;-)

On the efficiency/scalability issue: I wouldn't be too worried as of
yet. If we use Prolog carefully and for the right tasks, I don't think 
it'll be worse the Java. In fact, it's often considerably faster (not
surprisingly, since Prolog was designed to do certain things well):
We have conducted some reasoning examples for query containment,
comparing the Prolog code described in the following report
	http://www.sdsc.edu/~ludaesch/Paper/birn-di-tn-2003-01.pdf
(available from: http://kbi.sdsc.edu/BIRN/birn_pubs.html )
with a hand crafted solution in Java. The Prolog code was 2-3 times
faster than the Java (at least for the class of problems we tested;
this may be different for other classes).

While this query containment is not related to exeuting data
transformations, it is however related to *reasoning* about those, as
well as to type checking. 

Also: we may resort to incorporating other reasoning components such
as FACT or others when doing subsumption checking for OWL ontologies.

(Aside: I've discussed with Shawn the inclusion of a generic "Prolog
actor" in Ptolemy. It would work similar to a generic "Command line
actor" and take a Prolog program and a goal as *parameters*, and can
be used to implement querying and reasoning tasks, e.g., over RDF
"ontologies"... object bases rather).

SB> You might want to take a look at the slides that we presented in Santa 
SB> Barbara a few weeks ago.  (Currently, we don't have a write-up of the 
SB> ideas presented there, but we are working towards a few papers that 
SB> describe these ideas in more detail.)  

correct. There will be a research version (paper) from which a
technical report/design document will be derived. The goal of the
paper is to understand and resolve some of the underlying technical
issues; the design document than will spell out the speific
"recommendations". 

SB> You can get the slides at:
SB> 
SB> http://cvs.ecoinformatics.org/cvs/cvsweb.cgi/seek/projects/kr-sms/presentations/SantaBarbaraOct2003/semtypes-oct23.ppt
SB> 
SB> Basically, there are two issues of concern.  One is structural 
SB> transformation, e.g., converting datasets and other information passed 
SB> among components so that we can run a "scientific pipeline."  These 
SB> structural transformations can be fairly simple to really complex. We want 
SB> to, when possible, perform automatic transformation so that a user can 
SB> simply "hook-up" components to form a workflow or ecological model without 
SB> having to worry about all of the transformation details (which I am under 
SB> the impression is the major bottleneck for ecologists).

Rod et al: note the word "all". In practice I expect still a number of 
"manual" data transformation & massaging steps. For that we actually
are close to having a number of different data transformation actors
in Ptolemy, e.g., for XSLT (Ilkay from SciDAC is working on those; cc-ed)

SB> We believe that ontological information can help us get towards automatic 
SB> transformation.  By "attaching" semantic concepts to substructure, we have 
SB> the opportunity of "matching" substructure for more complex 
SB> transformation.  These ideas are briefly presented in the slides (e.g., 
SB> see slides 41--47).  
SB> 
SB> One issue in doing this, is determining whether the ontological concepts
SB> are compatible, which as Matt mentions, is a fairly straightforward
SB> computation.  

The principle is straightforward. The computational complexity often
is very high though and we have to balance expressiveness of concept
definitions with complexity of checking concept subsumption.

Also: we still have to see how good people will be at "semantically
typing" their actors... 

SB> There are some issues of efficiency and some very
SB> theoretical issues in terms of languages (for ontologies), decidability
SB> (of checking compatibility), and so on.  But, for the simple cases all of
SB> this is worked out and there are off-the-shelf components that exist.
SB> 
>> 
>> > I haven't done enough reading on SM to know if once it "matches" two 
>> > "nodes" whether it has any ability or knowledge on how to translate from 
>> > one to the other (Bug 1070). Is that aspect of the problem addressed in 
>> > any of the papers? It is mentioned in Bug 1070 as item #1:
>> > 
>> > "/1) use the SMS to locate candidate transformation steps T1..TN based 
>> > on type signature and ontologies/"
>> > 
>> > To me this means SMS is capable of conceptually getting from T1 to TN 
>> > but does it imply that there are the necessary conversion implements to 
>> > get there?

I expect that at least unit conversions and some data format
conversions will be automatically insertable. For example we may have
a table of the form
	FROM	TO	ACTOR
	f1	f2	f1-to-f2

stating that there is a format conversion actor f1-to-f2 for
translating the binary/proprietary format f1 to f2.

For structural transformations, Shawn has already mentioned the
semantics-guided structure transformer, but this is still research.
We hope to have some things worked out over the next few weeks.

For semantic transformations, it's even less clear how that may work
(or whether it's even meaningful or desirable to do "semantic
conversions"). 

My guess is that we want to do automatic conversions/transformations
mostly for format conversions and simple restructuring, while at the
level of semantic types, we may confine ourselves to just checking of
type compatibility (so the system can suggest actors and data sets
that match a given type), and leave everything else to the user.

>> Well, first of all, T1...TN were meant to indicate a series of 
>> transformations needed to transform some output (e.g., of Step S1) to 
>> some input (e.g., of Step S2).  So it could really be represented by one 
>> transformation step, but the multiple were indicated to show that there 
>> might be several distinct phases in the transformation (e.g., first 
>> convert the units, then scale the values).
>> 
>> In terms of implementation, it seems to me that we could use any system 
>> that can handle the calculation, and we need not limit ourselves to just 
>> one.  The transformation step gets inserted in the workflow as just 
>> another step, and so the system (in this case Ptolemy) will take care of 
>> marshalling values into the right format to deliver them from step to 
>> step.  So, for example, we could write a SAS step that does some 
>> standard statistical transformations (such as normalizing data), and 
>> some Java steps for another series of transforms, and some matlab steps 
>> for matrix operations (e.g., identity transform).  Then, when a user 
>> tries to link two steps, the reasoning engine can determine which of the 
>> transformations needs to be applied.
>> 
>> Lets refer to the conceptualized set of operations needed to get from an 
>> output to an input as the "transformation plan".  This is generated by 
>> the reasoning engine.  There is still a need for an "execution plan" 
>> which is an exact series of steps to be executed in order to accomplish 
>> the transformation plan.  Presumably there are multiple potential 
>> execution plans for every transformation plan (e.g., transformation 
>> steps can be implemented in multiple languages).  So choosing a 
>> particular execution plan isn't trivial either, and it involves both 
>> satisfying the transformation plan and optimizing for efficient execution.
SB> 
SB> I agree here with what Matt says.
SB> 
>> 
>> > ------------------------------------------------------------------------
>> > 
>> > Item #2 - /"determine how to generate transformation steps automatically 
>> > for simple transforms such as unit conversions"/
>> > 
>> > This seems straight forward, it could just be a service with a bunch of 
>> > mappings from one to the other.
>> > 
>> > But it begs the question of once we know the mapping how do we get it 
>> > mapped?
>> > 
>> > Meaning the service has the knowledge that T1 can be "easily" mapped to 
>> > T2, but how do you get the implementation of that mapping to place where 
>> > it can be done effeciently? (sort of like item #1)
>> > 
>> > Who does the translation of the value? (I assume the SMS module?)
>> 
>> This is basically what I was discussing.  Lets take the simple example 
>> of unit conversions.  EML has a unit disctionary, which is easily 
>> translatable into an ontology with quantified relations among the units 
>> (e.g., the formula for converting between two compatible units in the 
>> dictionary is known or can be derived).  The SMS reasoner would first 
>> determine if two units are convertible (e.g., both are 
>> VolumetricDensity), and then could write a transformation step to do the 
>> conversion.  Writing the transformation step could be as simple as 
>> wirting a Matlab expression for the Matlab actor in Ptolemy.  Or it 
>> might be generating and compiling some custom code.  Either way, a 
>> transformation step is generated and inserted into the workflow for the 
>> user.
SB> 
SB> I agree with Matt here too.  I would add a few observations.  First, we
SB> envision a library, or repository, of common transformations and knowledge
SB> of what circumstances the transformations can be applied (much of this
SB> should be automatic -- i.e., determining when a transformation *could* be
SB> applied).  Note that transformation becomes interesting when the
SB> tranformation cannot be applied in all cases, only certain situations.
SB> 
SB> Unit converion is a good example. In particular, two units may have
SB> dimensions that could result in a functional transformation. However, just
SB> because dimensions are "compatible" via a transformation, does not
SB> necessarily mean the conversion is valid in all situations. For example,
SB> we can perform a reciprocal operation to convert from hertz to seconds,
SB> however, performing such an operation in all cases isn't necessarily
SB> appropriate.  

Or we can have Hertz and Becquerel, both of which have the dimension
1/T, yet do not match. So unit convertability gives us necessary (but
not sufficient) conditions for "linkability".

SB> I think of these cases as type casting in programming
SB> languages. Languages such as Java only allow value substitution when going
SB> from subclasses to superclasses. If I want to go in the other direction, I
SB> have to explicitly "cast" the value down to the subclass.  Similarly,
SB> there may be a number of transformations that are analogous to
SB> "downcasting," where we need the user's permission (or additional
SB> reasoning) to perform the transformation. In general, I think these
SB> non-straightforward transformations are more interesting transformations,
SB> since they are probably more useful in general. (Everyone knows how to do
SB> the simple unit conversions, however, conversions that rely on laws of
SB> physics or dimensional analysis are probably not as obvious and take time
SB> for a scientist to figure out, making transformation a time consuming
SB> process.)

correct. We also may do some "dependency chasing" of semantic types
(including physical quantities) based on dependencies in mathematical
equations involving quantities (aka "parameters"). But again, that's a
bit futuristic/research. For the SM *system* we should start with some
simple stuff (such as unit conversion) first (so we can do the
research in parallel)

SB> 
SB> We give a very simple example of this type of conversion on slides 
SB> 17--20. Note that the example is for considering types that allow null 
SB> values, not units.
SB> 
SB> 
>> 
>> > ------------------------------------------------------------------------
>> > 
>> > Item #3 - /"create a simple GUI for creating transformation steps that 
>> > map between two existing steps"/
>> > 
>> > The idea here is that a user can provide a certain level of "missing" 
>> > knowledge that T1 can be converted to T2 which can be converted T3. Well 
>> > first, it seems that if we have a bunch of mappings that a 
>> > lookup-algorithm could just as easy do a bunch of lookups to get from T1 
>> > to T3. So to me it seems that this is really a tool for taking some new 
>> > specialized "value" in some unknown domain and getting it converted to a 
>> > known "domain" so the automatic mapping can take place. Does this sound 
>> > correct?
>> 
>> Sounds right.
SB> 
SB> The verbage in the item is a bit ambiguous -- I am assuming "existing
SB> step" means, e.g., a ptolemy actor or some computation in an ecological
SB> model. Anwyay, I don't have a good feeling for whether a simple lookup
SB> algorithm is all that is needed. In general, it seems like it could be 
SB> more complex. I am not really sure...
SB> 
SB> 
>> 
>> > If that is true or if that isn't the intent of item #3, certainly what I 
>> > have described needs happen.
>> > 
>> > So assuming there is a domain of values that currently doesn't have a 
>> > conversion to a known domain, how do we get that implementation into the 
>> > system? Who would provide the implemention? Maybe the GUI tool 
>> > referenced above enables the user to describe "how" the value could be 
>> > converted into a known domain value. The tool is then capable of 
>> > generating the implemention, compiling it and registering it. Hmmmm, I 
>> > can how this can be done easily for a scripting language or Java, but C 
>> > or C++ would be more problematic as a general XP solution
>> 
>> Many transformations will be a combination of casting, simple 
>> conversions (such as unit conversions), and schema rearrangement 
>> (database operations).  I am hoping that the user won't have to write 
>> too many transform steps by hand.  We should talk about this further.
SB> 
SB> I really think we want to use a repository of existing, known,
SB> conversions.  What you describe sounds very ad-hoc, either the desired
SB> conversion is not available, or the user has bypasssed the system to
SB> create their own conversion.  An interesting question that you bring up is
SB> what should happen when the system does not have the ability to perform a
SB> conversion, but the user knows the items can be converted.  How should
SB> this case be handled? A scripting language (such as prolog ;-) would
SB> handle what you describe above better than compilation and all of that.

The idea of the conversion repository is a good one. We need to find
its schema (and scope) first. Clearly the above FROM/TO/ACTOR schema
is a start and works, e.g., for simple things such as binary formats
as well as for units (FromUnit/ToUnit/ConversionFormula)

>> > ------------------------------------------------------------------------
>> > Item #4 - /"determine the pros and cons of having transformation steps 
>> > be directly associated with links (e.g., a link property) rather than 
>> > simply introducing new transform steps that do the same tasks directly 
>> > into the pipeline"/
>> > 
>> > I don't understand what is meant by "links"
>> 
>> Links are the edges in the workflow graph.  In terms of modeling the 
>> workflows, one could consider the link (edge) as a real object that can 
>> "do" computation itself -- ie, a link could be a step.  Alternatively, 
>> the new transformation calculations can be inserted in the graph as new 
>> steps.  I think it is more or less a UI issue, but there may be a some 
>> reasoning implications of doing it one way or another.  I prefer the 
>> latter.  Jenny Wang preferred the former, or at least she did a year ago 
>> at the San Diego meeting.  Here's an illustration of the two:
>> 
>> T1         T2
>> Transformations are links:    S1 ------> S2 ------> S3
>> 
>> Transformations are steps:    S1 --> T1 --> S2 --> T2 --> S3
SB> 
SB> I actually prefer transformations as explicit steps, at least for
SB> ecologists/scientists, since they deal with these transformations all of
SB> the time (so it wouldn't scare people off to have them represented
SB> explicitly).  The only caveat would be for the totally trivial
SB> transformations (like converting feet to meters), it might be a bit
SB> overkill to treat them as separate steps.  Instead, these could just be
SB> built into Ptolemy directly as special services that occur behind the
SB> scene.  A separate issue in the debate, is if we consider transformations
SB> as links, how can a user "debug" a misfunctioning workflow.

There are several ways to deal with "trivial transformations":
1. leave explicit transf. actor
2. hide transf. actor as part of a composite actor (so T1 is grouped
with S1 or S2 above to form S1' or S2', respectively)
3. color-code/annotate the edge from S1 to S2.

I would say, let's go with (1) first and have an explicit T1 between
S1 and S2. Also, it's easier to show what SMS does (e.g. using a
pop-up message for T1 such as "brought to you by SMS" ;-)
Of course in an ideal world (don't have it, sorry) SMS would manifest
its existence by the user being unaware of it....

>> > ------------------------------------------------------------------------
>> > I think there are some interesting requirements of the translation 
>> > implementation:
>> > a) Node domain mapper module
>> > b) Tool to provide "new domain" to "exsiting domain" mappings AND 
>> > implementations
>> > c) Cross platform
>> > d) Fairly effecient at runtime.
>> > e) Dynamically extensibly (see item b)
>> > 
>> > Although we always hate to let the implementation cloud our thinking 
>> > about design, the translation system may be bettered served by selecting 
>> > an implementation language up front and it seems that a scripting 
>> > language may not be best.

Obviously the main implementation language is Java (chosen by
Ptolemy-II). For the various pieces of SMS, let's wait and see, since
we don't even have the details of its scope and workings. We have to
have that first. I view Ptolemy (or rather Kepler) also as a "visual
plumbing" to glue together different applications executing on
different platforms at remote sites etc. So some of its own components 
may very well be non-local/non-Java.

>> I think we should start with one, but not limit ourselves to one.  We 
>> already have a couple available in Ptolemy (the Ptolemy expression 
>> language, and Matlab expressions).  We can also write new actors in 
>> Ptolemy that support expression languages or more complex code.  Hey, we 
>> can even have an actor that dynamically writes , compiles, and executes 
>> Java or C code if we want (security implications notwitstanding).
>> 
>> > I could envision a translation system that was implemented in Java where 
>> > all the mappings were individual classes. Certainly there could be a 
>> > common interface and/or even XMLSchema to describe a mapping class. Java 
>> > would also enable us to use introspection of any given mapping class to 
>> > determine what it does and how to register it.  It would be platform 
>> > independent and dynamically scalable.
>> > 
SB> 
SB> Java is a nice language. But I don't really understand what you mean in 
SB> terms of introspection.  Java provides a reflection capability, e.g., I 
SB> can see what class an object is an instance of, and what methods an 
SB> object supports, but reflection doesn't tell me what the object computes. 
SB> I think that it is a semantic issue and brings up a very good 
SB> point.   How do we capture "what" a transformation does, to automatically 
SB> apply the transformation. Is this information buried in the "SMS Module" 
SB> or is there a declarative and extensible way to describe this information.  
SB> In many ways, this is what we want to capture in semantic types. 
SB> 
SB> 
SB> I would be interested in hearing more of your ideas / intuitions about the 
SB> problem in general.  Thanks for starting the thread!
SB> 
SB> Shawn
SB> 
SB> 
SB> 
>> 
>> Sure.
>> 
>> > 
>> > So anyway, I hope these thoughts are helpful.
>> > Rod
>> > 
>> 
>> 
>> They certainly were.  I think you and I are similar in that we want to 
>> build a functional implementation.  So far, the SMS work has been 
>> focused on fairly theoretical issues.  Grounding it in implementation 
>> now I think is very appropriate :-0

Matt, we hear you! I think we're seeing some implementation efforts
starting as we speak (e.g. a unit conversion actor, RDF querying
actor; possibly both built into Ptolemy using a generic Prolog actor;
some of the semantic type checking will probably be implemented at the 
director level).

>> 
>> Matt
>> 
SB> 
SB> _______________________________________________
SB> seek-kr-sms mailing list
SB> seek-kr-sms at ecoinformatics.org
SB> http://www.ecoinformatics.org/mailman/listinfo/seek-kr-sms