[seek-dev] data typing in ptolemy

Thu Oct 9 10:46:13 PDT 2003

Hi,

Matt and I had a conversation on IRC the other day that we thought might 
be of interrest to those on these lists.

Basically, I am now dealing with typing issues within ptolemy.  Problems 
arise when you get missing values in the data.  Ptolemy's type heirarchy 
does not allow missing values in a data tokens so Matt and I were 
talking about extending the ptolemy typing system to allow missing 
values.  It occured to us that the typing system will need to be 
extended to allow for semantic typing in the future.

The type class hierarchy currently looks like the following:

               Token
                 |
      --------------------------
      |                        |
ScalarToken             AbstractConvertableToken
        |                               |
---------------...*              -----------------
  |            |                 |               |
DoubleToken  IntToken         BooleanToken  StringToken

*Note that ScalarToken also includes LongToken and ComplexToken.

In addition to this Token hierarchy (Tokens are the means by which you 
pass data between actors over ports) there is also a port typing 
hierarchy implemented in the class BaseType.  BaseType is the means by 
which you actually specify a port's type.  It looks like this:

                               BaseType
                                  |
      ---------------------------------------------------------....
      |             |             |          |          |
BooleanType   ComplexType   GeneralType   IntType   DoubleType ....*

* BaseType also includes EventType, LongType, NumericalType, ObjectType, 
SCalarType, StringType, UnknownType, UnsignedByteType

Basically, in order to extend this typing system, we must extend both of 
these hierarchies since Tokens are the means by which data is transfered 
between ports and BaseTypes are the means by which you allow (or 
disallow) a port to accept different types of data.

Extending the hierarchy
-----------------------
There are two different ways that I see to extend the hierarchy.  The 
first is to extend the base class Token with our own tree of token types 
extending from the root of the tree.  This will probably allow us the 
most flexibility in implementing types the way we need to, however, the 
main drawback I see to doing this is that we would not be able to use 
most existing actors because their ports are typed according to the 
current hierarchy.  I think that one fact pretty much eliminates this 
approach from the options.

The second approach I see is to extend each of the leaf token types. 
For example, extend DoubleToken to ExtendedDoubleToken and add our 
additional functionality there.  This keeps our type system within the 
bounds of the current ptolemy hierarchy but limits our flexibility in 
extension.  we are basically limited to the hierarchy that already 
exists.  It is still unclear to me what the affects of doing this will 
be on existing actors.  For instance, if we extend DoubleToken to allow 
missing values, and an actor with a port of BaseType.DoubleType gets an 
ExtendedDoubleToken, it would still need to be able to handle whatever 
value we assign as a missing value code.  This is problematic, because 
we are then restricted to using an actual double value as a missing 
value code (i.e. -999.999) which we've always maintained was bad data 
practice.  This could also cause problems because the actor cannot 
differentiate -999.999 from a normal value and will operate on it 
normally.

This same problem comes up (but to a lesser extent) when you think about 
extending this system for semantics.  What does an existing actor do 
with semantic information stored in the token?  It can ignore it, but 
that may be detrimental to the analysis.

Another possible option
-----------------------
The other possible solution to the missing value problem is to simply 
not send any data over the port when a missing value is encountered.  I 
have modified the EML ingestion actor to dynamically create one typed 
port for each attribute in the data package.  These ports can then be 
hooked up to other actors.  The data is sent asyncronously and depends 
on the receiving ports to queue the data until all the input data is 
present to run the analysis.

If I simply do not send a token when a missing value comes up, I forsee 
major timing problems.  For instance, port A and port B are mapped to 
input ports X and Y (res.) of a plotter.  port A sends a token to X, 
then B gets a missing value.  It sends nothing.  The plotter is then 
waiting for its second input.  the next record is iterated into.  port A 
sends another token to X.  This causes an exception.  The other scenario 
is, on the second iteration, A is a missing value but B is not.  Then we 
are plotting two values from different records when Y recieves data from 
B in the second record.  This would be a nightmare to deal with given 
the current directors.

So, does anyone see something that I'm missing here?  What are the needs 
of the semantic typing going to be as far as ptolemy goes?  Anyone have 
a better solution than the three that I've layed out?  This is a complex 
issue that I need to deal with before I can continue moving forward with 
AMS.  I don't want to do anything that will hinder the future semantic 
extensions of ptolemy and this is just too much of a basic 
infrastructure item to try to hack.  If anyone want to have an IRC chat 
about this, I'm on #seek.

chad

-- 
-----------------------
Chad Berkley
National Center for
Ecological Analysis
and Synthesis (NCEAS)
berkley at nceas.ucsb.edu
-----------------------