[kepler-dev] [Bug 2240] - add support for null values to data passing among ports

bugzilla-daemon@ecoinformatics.org bugzilla-daemon at ecoinformatics.org
Tue Feb 7 11:13:40 PST 2006


------- Additional Comments From cxh at eecs.berkeley.edu  2006-02-07 11:13 -------
Dan wrote:

I would like to add some thoughts to this issue of null/nil tokens in

Consider first some comments from the R system about what it calls 'not
available' or 'missing value' elements

from R documentation ------

"In some cases the components of a vector may not be completely known.
When an element or value is \223not available\224 or a \223missing
value\224 in the statistical sense, a place within a vector may be
reserved for it by assigning it the special value NA. In general any
operation on an NA becomes an NA. The motivation for this rule is
simply that if the specification of an operation is incomplete, the
result cannot be known and hence is not available.  The function
is.na(x) gives a logical vector of the same size as x with value TRUE
if and only if the corresponding element in x is NA.
 > z <- c(1:3,NA); ind <- is.na(z)

Notice that the logical expression x == NA is quite different from
is.na(x) since NA is not really a value but a marker for a quantity
that is not available. Thus x == NA is a vector of the same length as
x all of whose values are NA as the logical expression itself is
incomplete and hence undecidable.

Note that there is a second kind of \223missing\224 values which are
produced by numerical computation, the so-called Not a Number, NaN,
values. Examples are

 > 0/0
 > Inf - Inf
which both give NaN since the result In summary, is.na(xx) is TRUE
is.nan(xx) is only TRUE for NaNs."

Note that R typically works with vectors (arrays) of values. NAs are
often kept as part of these vectors so that sizes do not change. The NAs
can be explicitly removed from a vector when required with a statement like

 > y <- x[!is.na(x)]

which creates a new vector y which will contain the non-missing values
of x, in the same order.

So I suggest that in Kepler/Ptolemy we need to consider nil/null values
in arrays as well as the stand-alone nil/null token. For example, if we
are working with a data table that has some NA values in some columns we
may be working with individual cell, table rows, or table columns. If
Kepler automatically drops null/nil values, this may cause problems
relating rows or columns. And columns are usually the same structural
type, so we may well be working with a TokenArray to handle the whole
column. All tokens in a token array need to be of the same type, so how
do we handle nilTokens in an array of, say, DoubleTokens?

So it seems to me that we really do not want a NilToken; we really want
tokens of any type that have a nil/null property [Christopher has
already added this to the Token class, so this should apply to any of
the classes derived from Token.] This is the approach that R has taken
to handle NA value - simple set one double/float, integer, or character
value to represent a NA value. {See my email of 12/22/2005). One can
then create an actor to do things like strip NAs from an array. And
existing actors that do not consider nils will continue to work without

So a column in a table with type 'double' that has one missing value
would have one DoubleToken with a property of nil/null. But we would
also like any mathematical operation applied to the token to give a
result that is also nil/null. This would imply that the various operator
methods of Tokens (e.g. _add or _multiply) should check the nil property
and respond appropriately. And then there is the problem of an actor
that just gets the value of a Token and does not check the nil property.
This might be handled by setting the value to some predefined value that
can be recognized (e.g. setting a float/double nil value to NaN or some
very large number; NaN is very convenient since primitive mathematical
operations on NaN return NaN.)

Note that some of the Java GIS actors that are already in Kepler must
handle missing data values in spatial rasters. In that case, I simply
defined any double value over a very large threshold to represent a
'missing value'. This is a custom case of the nil/null problem and it
works with the current version of Kepler because the rasters are passed
by reference to filenames, rather than through tokens directly
containing data.

So much for my 2 cents ;)


More information about the Kepler-dev mailing list