[kepler-dev] Bug 2240: add support for null values to data

Christopher Brooks cxh at eecs.berkeley.edu
Wed Dec 14 11:50:12 PST 2005


Below is Professor Arne Huseby's comments about missing values:

> I actually implemented support for missing values in my simulation 
> program, Riscue, earlier this fall. Normally, when I run an ordinary 
> simulation, I would never encounter missing data. However, I also use 
> Riscue to process and analyze real statistical data, and such data 
> often come with a lot of missing values.
> 
> When I load a table of statistical data into Riscue, Riscue generates 
> a set of "data nodes", one for each column in the table. I can then 
> do a lot of descriptive statistics using these data nodes, e.g., 
> cumulative distribution plots, histograms, scatter plots, regression 
> analysis, correlation analysis, etc. I can also do postprocessing of 
> the data, i.e, transformations, merging, filtering etc.
> 
> In addition to using data nodes as input to such things, data nodes 
> can also be used as elements in a model. When I run a simulation on a 
> model containing data nodes, Riscue will sample random values from 
> the values contained in each of the data nodes. In statistical 
> terminology this is usually called "resampling".
> 
> Thus, since I now support data nodes with missing values, I also need 
> to deal with these missing values in the simulations. Say that I 
> e.g., want to simulate a very simple model with only two nodes, A and 
> B, where A is a data node, and B is a node taking the sampled value 
> from A, and transforming this into some other value, using some 
> function f. Thus, when a "good" value, say x, is sampled from A, then 
> B gets the value f(x). If on the other hand, a missing value is 
> sampled from A, there is a question how this should be treated in B. 
> At least in my applications, the natural thing would be to say that B 
> gets a missing value too, i.e., that the missing value state is 
> propagated through the model.
> 
> So, how did I implement support for this?
> 
> One obvious way of dealing with this would be to implement support 
> for missing values throughout my entire library of mathematical 
> functions. With about 400 such classes, this would be a daunting 
> task! Moreover, while a few functions may be able to handle missing 
> values in some meaningful way, most of these functions would not. So 
> it really didn't make sense to go through all the trouble of 
> implementing support for something that in 99.99% of the cases would 
> not be meaningful after all.
> 
> In Riscue each node owns a formula object, which is a tree composed 
> of function objects from my function library. So, in order to 
> calculate its value, which is done using a method called 
> calcObjectValue(), the node asks its formula to do this, at the same 
> time passing a list of input edges to the formula. The formula then 
> uses the values obtained from the input edges in its calculations and 
> responds back to the node with a value. The edges can also have their 
> own formulas, and transform values obtained from their respective 
> input nodes in a similar fashion.
> 
> This design actually provides a very easy way to support missing 
> values. Whenever a request for a value is sent from a formula to some 
> object (i.e., to an edge or a node), using a method called 
> getObjectValue(), two things can happen:
> 
> 1) The object has a valid value which is passed back to the formula.
> 2) The object does not have a valid value, and throws a MissingValueException.
> 
> This exception is NOT handled by the formula object, so this way I 
> avoided having to implement support for this throughout my function 
> library. Instead this is handled by the owner of the formula, i.e., 
> either a node or an edge. Handling this exception is trivial since 
> this simply means assigning a missing value (encoded in some suitable 
> fashion) to the object instead of a valid value.
> 
> So the only (well almost...) changes I needed to make in my code, were:
> 
> (i) Modifying the getObjectValue()-method so that missing values 
> results in a MissingValueException being thrown.
> (ii) Adding a new "catch block" in the calcObjectValue()-method for 
> the nodes and edges, handling the MissingValueException.
> 
> I don't know if this solution has any relevance to your problem, but 
> maybe this will trigger some ideas.
> 
> Arne 

BTW - 
http://www.riscue.org/ says:
    "Riscue is a software application for doing probabilistic risk
    analysis. Key application areas are:

          o Cost and Schedule Risks
          o Hazard analysis
          o Reliability analysis
          o Financial risks
          o Insurance
          o Total Value Chain Analysis
          o Oil reservoir and production profile risk 
    "

_Christopher



More information about the Kepler-dev mailing list