[kepler-dev] Bug 2240: add support for null values to data

Dan Higgins higgins at nceas.ucsb.edu
Wed Feb 1 13:58:57 PST 2006


Hi All,
I would like to add some thoughts to this issue of null/nil tokens in 
Kepler/Ptolemy.

Consider first some comments from the R system about what it calls 'not 
available' or 'missing value' elements

from R documentation ------
"In some cases the components of a vector may not be completely known. 
When an element
or value is “not available” or a “missing value” in the statistical 
sense, a place within a vector
may be reserved for it by assigning it the special value NA. In general 
any operation on an NA
becomes an NA. The motivation for this rule is simply that if the 
specification of an operation
is incomplete, the result cannot be known and hence is not available.
The function is.na(x) gives a logical vector of the same size as x with 
value TRUE if and
only if the corresponding element in x is NA.
 > z <- c(1:3,NA); ind <- is.na(z)
Notice that the logical expression x == NA is quite different from 
is.na(x) since NA is not
really a value but a marker for a quantity that is not available. Thus x 
== NA is a vector of the
same length as x all of whose values are NA as the logical expression 
itself is incomplete and
hence undecidable.
Note that there is a second kind of “missing” values which are produced 
by numerical computation,
the so-called Not a Number, NaN, values. Examples are
 > 0/0
 > Inf - Inf
which both give NaN since the result
In summary, is.na(xx) is TRUE
is.nan(xx) is only TRUE for NaNs."
----

Note that R typically works with vectors (arrays) of values. NAs are 
often kept as part of these vectors so that sizes do not change. The NAs 
can be explicitly removed from a vector when required with a statement like

 > y <- x[!is.na(x)]

which creates a new vector y which will contain the non-missing values 
of x, in the same order.

So I suggest that in Kepler/Ptolemy we need to consider nil/null values 
in arrays as well as the stand-alone nil/null token. For example, if we 
are working with a data table that has some NA values in some columns we 
may be working with individual cell, table rows, or table columns. If 
Kepler automatically drops null/nil values, this may cause problems 
relating rows or columns. And columns are usually the same structural 
type, so we may well be working with a TokenArray to handle the whole 
column. All tokens in a token array need to be of the same type, so how 
do we handle nilTokens in an array of, say, DoubleTokens?

So it seems to me that we really do not want a NilToken; we really want 
tokens of any type that have a nil/null property [Christopher has 
already added this to the Token class, so this should apply to any of 
the classes derived from Token.] This is the approach that R has taken 
to handle NA value - simple set one double/float, integer, or character 
value to represent a NA value. {See my email of 12/22/2005). One can 
then create an actor to do things like strip NAs from an array. And 
existing actors that do not consider nils will continue to work without 
change.

So a column in a table with type 'double' that has one missing value 
would have one DoubleToken with a property of nil/null. But we would 
also like any mathematical operation applied to the token to give a 
result that is also nil/null. This would imply that the various operator 
methods of Tokens (e.g. _add or _multiply) should check the nil property 
and respond appropriately. And then there is the problem of an actor 
that just gets the value of a Token and does not check the nil property. 
This might be handled by setting the value to some predefined value that 
can be recognized (e.g. setting a float/double nil value to NaN or some 
very large number; NaN is very convenient since primitive mathematical 
operations on NaN return NaN.)

Note that some of the Java GIS actors that are already in Kepler must 
handle missing data values in spatial rasters. In that case, I simply 
defined any double value over a very large threshold to represent a 
'missing value'. This is a custom case of the nil/null problem and it 
works with the current version of Kepler because the rasters are passed 
by reference to filenames, rather than through tokens directly 
containing data.

So much for my 2 cents ;)

Dan

----

Christopher Brooks wrote:

>One of the things that will come up next week is the null value issue.
>
>It might be interesting to get the conversation going a little early.
>
>
>Below are two piece of email from December.
>
>Below is Professor Arne Huseby's comments about missing values:
>
>  
>
>>I actually implemented support for missing values in my simulation 
>>program, Riscue, earlier this fall. Normally, when I run an ordinary 
>>simulation, I would never encounter missing data. However, I also use 
>>Riscue to process and analyze real statistical data, and such data 
>>often come with a lot of missing values.
>>
>>When I load a table of statistical data into Riscue, Riscue generates 
>>a set of "data nodes", one for each column in the table. I can then 
>>do a lot of descriptive statistics using these data nodes, e.g., 
>>cumulative distribution plots, histograms, scatter plots, regression 
>>analysis, correlation analysis, etc. I can also do postprocessing of 
>>the data, i.e, transformations, merging, filtering etc.
>>
>>In addition to using data nodes as input to such things, data nodes 
>>can also be used as elements in a model. When I run a simulation on a 
>>model containing data nodes, Riscue will sample random values from 
>>the values contained in each of the data nodes. In statistical 
>>terminology this is usually called "resampling".
>>
>>Thus, since I now support data nodes with missing values, I also need 
>>to deal with these missing values in the simulations. Say that I 
>>e.g., want to simulate a very simple model with only two nodes, A and 
>>B, where A is a data node, and B is a node taking the sampled value 
>>from A, and transforming this into some other value, using some 
>>function f. Thus, when a "good" value, say x, is sampled from A, then 
>>B gets the value f(x). If on the other hand, a missing value is 
>>sampled from A, there is a question how this should be treated in B. 
>>At least in my applications, the natural thing would be to say that B 
>>gets a missing value too, i.e., that the missing value state is 
>>propagated through the model.
>>
>>So, how did I implement support for this?
>>
>>One obvious way of dealing with this would be to implement support 
>>for missing values throughout my entire library of mathematical 
>>functions. With about 400 such classes, this would be a daunting 
>>task! Moreover, while a few functions may be able to handle missing 
>>values in some meaningful way, most of these functions would not. So 
>>it really didn't make sense to go through all the trouble of 
>>implementing support for something that in 99.99% of the cases would 
>>not be meaningful after all.
>>
>>In Riscue each node owns a formula object, which is a tree composed 
>>of function objects from my function library. So, in order to 
>>calculate its value, which is done using a method called 
>>calcObjectValue(), the node asks its formula to do this, at the same 
>>time passing a list of input edges to the formula. The formula then 
>>uses the values obtained from the input edges in its calculations and 
>>responds back to the node with a value. The edges can also have their 
>>own formulas, and transform values obtained from their respective 
>>input nodes in a similar fashion.
>>
>>This design actually provides a very easy way to support missing 
>>values. Whenever a request for a value is sent from a formula to some 
>>object (i.e., to an edge or a node), using a method called 
>>getObjectValue(), two things can happen:
>>
>>1) The object has a valid value which is passed back to the formula.
>>2) The object does not have a valid value, and throws a MissingValueException.
>>
>>This exception is NOT handled by the formula object, so this way I 
>>avoided having to implement support for this throughout my function 
>>library. Instead this is handled by the owner of the formula, i.e., 
>>either a node or an edge. Handling this exception is trivial since 
>>this simply means assigning a missing value (encoded in some suitable 
>>fashion) to the object instead of a valid value.
>>
>>So the only (well almost...) changes I needed to make in my code, were:
>>
>>(i) Modifying the getObjectValue()-method so that missing values 
>>results in a MissingValueException being thrown.
>>(ii) Adding a new "catch block" in the calcObjectValue()-method for 
>>the nodes and edges, handling the MissingValueException.
>>
>>I don't know if this solution has any relevance to your problem, but 
>>maybe this will trigger some ideas.
>>
>>Arne 
>>    
>>
>
>On 12/13/05, I (cxh) wrote:
>  
>
>>Ok, I hacked up the following:
>>
>>* Token.java now has nil() and isNil() methods
>>  I went with nil because null is a Java keyword.
>>
>>  The term "missing" is rather appealing since the Token.toString()
>>  method usually prints out "present".  So, it could be modified to
>>  print out "missing".  However, I feel that someone is likely to have
>>  a parameter named "missing" somewhere.  nil seems safer.
>>
>>  This is not cast in stone, comments are welcome.
>>
>>  These methods have tests. 
>>  
>>* data/expr/Constants.java now defines a "nil" constant which
>>  is a Token that has the nil() method:
>>        ptolemy.data.Token nil = new ptolemy.data.Token();
>>        nil.nil();
>>        _table.put("nil", nil);
>>  Thus, one can now create expressions that have nil in them
>>
>>  The "nil" constant has a test in data/expr/test/PtParser.tcl
>>
>>* actor/lib/RemoveNilTokens.java:
>>  A new actor that reads its input and discards any nil tokens
>>  in the fire() method.  It might be better to do this in prefire()
>>  This actor is available in the "More Libraries" -> "Esoteric" section.
>>
>>  Note that this actor should not be used in SDF.
>>  
>>  No tests yet.
>>
>>* domains/pn/demo/RemoveNilTokens/RemoveNilTokens.xml
>>  A model that uses RemoveNilTokens.
>>  I think I'm terminating the PN process poorly.  I get 7 outputs
>>  instead of 5.  I could use some help here.
>>  Also, I have to explicitly set the type of output of the RemoveNilTokens
>>  actor.
>>
>>
>>The model looks like:
>>
>>          Bool Switch
>>Ramp----> | 
>>          |----------------> RemoveNilTokens ---------> Display
>>Const     |                                      |
>>that ---> |                                       --> "Code that stops PN"
>>produces  _
>>nil       ^
>>          |
>>          |
>>Bernoulli -
>>
>>
>>So, now we have a straw man in PN to try out.  
>>
>>Questions:
>>1) Would something like this meet the needs of the Kepler group?
>>2) Should RemoveNilTokens do something in prefire()?
>>   I'm not sure if I can get access to the Token and call Token.isNil()
>>   in prefire().
>>3) Is "nil" an ok name?
>>4) How do I get PN to terminate properly after I see 5 non-nil tokens?
>>
>>_Christopher
>>
>>Edward wrote:
>>--------
>>
>>
>>    At 08:16 AM 12/13/2005 -0800, Shawn Bowers wrote:
>>    >One option for SDF would be to literally "propagate" the missing-valued
>>    >tokens, instead of "dropping" them (as in PN).   Then, it seems the
>>    >consumption/production rates at least wouldn't be violated.  Doing this
>>    >correctly might be a bit tricky, but could probably perform this based on
>>    >the token production/consumption rates --
>>    
>>    In SDF, before any actor is fired, its prefire() method is called.
>>    For all our actors that require input data, if prefire() will
>>    return false if there is no input token, and the actor will not
>>    be fired. Consequently, its outputs will also have no token...
>>    
>>    So SDF already does this propagation...
>>    
>>    Edward
>>    
>>    
>>    
>>    ------------
>>    Edward A. Lee
>>    Professor, Chair of the EE Division, Associate Chair of EECS
>>    231 Cory Hall, UC Berkeley, Berkeley, CA 94720
>>    phone: 510-642-0253 or 510-642-0455, fax: 510-642-2845
>>    eal at eecs.Berkeley.EDU, http://ptolemy.eecs.berkeley.edu/~eal  
>>    
>>    _______________________________________________
>>    Ptolemy maillist  -  Ptolemy at chess.eecs.berkeley.edu
>>    http://chess.eecs.berkeley.edu/ptolemy/listinfo/ptolemy
>>--------
>>_______________________________________________
>>Kepler-dev mailing list
>>Kepler-dev at ecoinformatics.org
>>http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-dev
>>    
>>
>_______________________________________________
>Kepler-dev mailing list
>Kepler-dev at ecoinformatics.org
>http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-dev
>
>  
>


-- 
*******************************************************************
Dan Higgins                                  higgins at nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Ph: 805-893-5127
National Center for Ecological Analysis and Synthesis (NCEAS) Marine Science Building - Room 3405
Santa Barbara, CA 93195
*******************************************************************





More information about the Kepler-dev mailing list