[kepler-dev] Bug 2240: add support for null values to data
Christopher Brooks
cxh at eecs.berkeley.edu
Wed Dec 14 11:50:12 PST 2005
Below is Professor Arne Huseby's comments about missing values:
> I actually implemented support for missing values in my simulation
> program, Riscue, earlier this fall. Normally, when I run an ordinary
> simulation, I would never encounter missing data. However, I also use
> Riscue to process and analyze real statistical data, and such data
> often come with a lot of missing values.
>
> When I load a table of statistical data into Riscue, Riscue generates
> a set of "data nodes", one for each column in the table. I can then
> do a lot of descriptive statistics using these data nodes, e.g.,
> cumulative distribution plots, histograms, scatter plots, regression
> analysis, correlation analysis, etc. I can also do postprocessing of
> the data, i.e, transformations, merging, filtering etc.
>
> In addition to using data nodes as input to such things, data nodes
> can also be used as elements in a model. When I run a simulation on a
> model containing data nodes, Riscue will sample random values from
> the values contained in each of the data nodes. In statistical
> terminology this is usually called "resampling".
>
> Thus, since I now support data nodes with missing values, I also need
> to deal with these missing values in the simulations. Say that I
> e.g., want to simulate a very simple model with only two nodes, A and
> B, where A is a data node, and B is a node taking the sampled value
> from A, and transforming this into some other value, using some
> function f. Thus, when a "good" value, say x, is sampled from A, then
> B gets the value f(x). If on the other hand, a missing value is
> sampled from A, there is a question how this should be treated in B.
> At least in my applications, the natural thing would be to say that B
> gets a missing value too, i.e., that the missing value state is
> propagated through the model.
>
> So, how did I implement support for this?
>
> One obvious way of dealing with this would be to implement support
> for missing values throughout my entire library of mathematical
> functions. With about 400 such classes, this would be a daunting
> task! Moreover, while a few functions may be able to handle missing
> values in some meaningful way, most of these functions would not. So
> it really didn't make sense to go through all the trouble of
> implementing support for something that in 99.99% of the cases would
> not be meaningful after all.
>
> In Riscue each node owns a formula object, which is a tree composed
> of function objects from my function library. So, in order to
> calculate its value, which is done using a method called
> calcObjectValue(), the node asks its formula to do this, at the same
> time passing a list of input edges to the formula. The formula then
> uses the values obtained from the input edges in its calculations and
> responds back to the node with a value. The edges can also have their
> own formulas, and transform values obtained from their respective
> input nodes in a similar fashion.
>
> This design actually provides a very easy way to support missing
> values. Whenever a request for a value is sent from a formula to some
> object (i.e., to an edge or a node), using a method called
> getObjectValue(), two things can happen:
>
> 1) The object has a valid value which is passed back to the formula.
> 2) The object does not have a valid value, and throws a MissingValueException.
>
> This exception is NOT handled by the formula object, so this way I
> avoided having to implement support for this throughout my function
> library. Instead this is handled by the owner of the formula, i.e.,
> either a node or an edge. Handling this exception is trivial since
> this simply means assigning a missing value (encoded in some suitable
> fashion) to the object instead of a valid value.
>
> So the only (well almost...) changes I needed to make in my code, were:
>
> (i) Modifying the getObjectValue()-method so that missing values
> results in a MissingValueException being thrown.
> (ii) Adding a new "catch block" in the calcObjectValue()-method for
> the nodes and edges, handling the MissingValueException.
>
> I don't know if this solution has any relevance to your problem, but
> maybe this will trigger some ideas.
>
> Arne
BTW -
http://www.riscue.org/ says:
"Riscue is a software application for doing probabilistic risk
analysis. Key application areas are:
o Cost and Schedule Risks
o Hazard analysis
o Reliability analysis
o Financial risks
o Insurance
o Total Value Chain Analysis
o Oil reservoir and production profile risk
"
_Christopher
More information about the Kepler-dev
mailing list