[kepler-dev] Bad dataset from workshop. Re: wf

Fri Jan 13 08:54:26 PST 2006

I do recall seeing the code to parse fixed format files.  I don't know 
off hand how this code gets selected to execute.  But in this case, it 
is not.

I don't understand how there can be a variable number of delimiters 
between tokens.  In a delimited file, two adjacent delimiters would be 
interpreted as a "missing value" in that particular column.  This 
particular datafile does define specific values for missing data (-3.0 
IIRC).  We could, in theory, use this information to infer that the 
table is delimited by a regular expression.  This would not be reliable 
unless there is some documentation for the metadata definition which 
defines this.

The stack trace that chad, jing, and now I are getting ( String index 
problem deep in hsqldb ) appears to be a bug in the delimiter parsing 
routines in hsqldb.  No - correction.  hsqldb assumes that a delimiter 
in hex is specified in exactly 4 bytes.  We are passing it '\\u20'  when 
it expects '\\u0020'.  That is a bug I can fix :)

For error recovery, I think we first need to address error generation 
and reporting.  We need to some how track down the exceptions comming 
from the code and determine what a reasonable message would be and 
present that back to the user.

Kevin

Matt Jones wrote:

> EML can indeed describe fixed-format files.  And I had thought that 
> Jing had included support for those file formats in the 
> EML200DataSource actor.  I'm not sure how he had set up the 
> interaction with hsql but my impression was that it worked. Maybe he 
> converts fixed format to delimited format before passing it to hsql.  
> Jing can shed more light here.  In any case, the metadata provider 
> would need to provide accurate metadata stating that its a fixed 
> format file if in indeed it is.
>
> In this particular case, the format may not be fixed -- rather, it may 
> be delimited with a variable number of delimiter tokens between data 
> values.  I'd have to see more of the file to figure that out.  There 
> are fields in EML to describe this case.
>
> In general it seems that hsql's import routines are too limited to 
> support the variety of data formats that EML describes, so we will 
> probably need to do some translation in EML200DataSource before 
> passing the data to hsql.  Again, Jing should be able to clarify here.
>
> Overall the EML data import has proven to be fragile, which we need to 
> fix.  There are several consistent problems:
>
> 1) EML200DataSource doesn't use all available metadata in the EML
> 2) The metadata describing the data schema is wrong wrt the
>    data structure (e.g., too many or too few columns, wrong
>    delimeters, etc)
> 3) the data has missing values and therefore is not supported
>    by EML200DS (and won't be till we support null tokens)
> 4) Select values in the data are illegal -- for example, columnns
>    described as a float contain a value like "NA" without a statement
>    in EML that "NA" is used as a missing value code
> 5) The metadata says the data is available at a given URL, but when
>    we try to download the data it is either not there, restricted
>    access, or not actually a data file (usually a web page)
>
> We need to systematically go through these issues and make this import 
> robust.  The only thing we can't really protect against is incorrect 
> metadata.  In that case, it would be nice if we gathered some 
> information about what is wrong and possibly ask the user about the 
> problem in order to adapt.  For example, consider the following two 
> questions we might be able to pose:
>
> "The metadata indicates that records are separated by newlines, but we 
> have detected carriage returns in the data stream.  Should we switch 
> to using carriage returns as the record separator?"
>
> 'The "bodylength" column should contian numeric values, but we have 
> detected repeated use of the value "N/A" in the data.  Should this be 
> considered a code for a missing value?'
>
> There are probably a handful of common problems like this that we 
> might be able to anticipate.  Interestingly, these issues really can't 
> be detected until the data is retrieved, which is especially a problem 
> if we delay download of the data until the 'run' has started.
>
> Matt
>
> Kevin Ruland wrote:
>
>> Chad, et al.
>>
>> I looked at this.  It appears the problem is the metadata for this
>> dataset does not accurately describe its structure.
>>
>> The metadata states:
>>
>>             <physical scope="document">
>>                 <objectName>w6-pcp.txt</objectName>
>>                 <characterEncoding>ASCII</characterEncoding>
>>                 <dataFormat>
>>                     <textFormat>
>>                         <recordDelimiter>\n\r</recordDelimiter>
>>                         
>> <attributeOrientation>column</attributeOrientation>
>>                         <simpleDelimited>
>>                             <fieldDelimiter>0x20</fieldDelimiter>
>>                         </simpleDelimited>
>>                     </textFormat>
>>                 </dataFormat>
>>                 <distribution scope="document">
>>                     <online>
>>                         <url
>> function="download">http://www.hubbardbrook.org/research/data/atmos/pcp_chem/w6-pcp.txt</url> 
>>
>>                     </online>
>>                 </distribution>
>>             </physical>
>>
>> Note the following:
>>
>> The recordDelimiter is \n\r.
>> The fieldDelimiter is 0x20 (space).
>> It does not have a node physical/dataFormat/textFormat/numHeaderLines
>> which we interpret as meaning there are no header records.
>>
>> Here are the first two records of the dataset:
>>
>>     ws  year  mo    precip      Ca      Mg       K      Na      Al    
>> NH4      pH     SO4     NO3      Cl     PO4   
>> Sio2                                                  6  1963   6    
>> 67.500   0.300   0.070   0.100   0.070  -3.000 -3.000   -3.00  
>> -3.000  -3.000  -3.000 -3.0000  -3.000
>>
>> And it's hexdump:
>>
>> 0000000: 2020 2020 7773 2020 7965 6172 2020 6d6f      ws  year  mo
>> 0000010: 2020 2020 7072 6563 6970 2020 2020 2020      precip     
>> 0000020: 4361 2020 2020 2020 4d67 2020 2020 2020  Ca      Mg     
>> 0000030: 204b 2020 2020 2020 4e61 2020 2020 2020   K      Na     
>> 0000040: 416c 2020 2020 204e 4834 2020 2020 2020  Al     NH4     
>> 0000050: 7048 2020 2020 2053 4f34 2020 2020 204e  pH     SO4     N
>> 0000060: 4f33 2020 2020 2020 436c 2020 2020 2050  O3      Cl     P
>> 0000070: 4f34 2020 2020 5369 6f32 2020 2020 2020  O4    Sio2     
>> 0000080: 2020 2020 2020 2020 2020 2020 2020 2020                 
>> 0000090: 2020 2020 0909 0920 2020 2009 0a20 2020      ...    ..  
>> 00000a0: 2020 3620 2031 3936 3320 2020 3620 2020    6  1963   6  
>> 00000b0: 2036 372e 3530 3020 2020 302e 3330 3020   67.500   0.300
>> 00000c0: 2020 302e 3037 3020 2020 302e 3130 3020    0.070   0.100
>> 00000d0: 2020 302e 3037 3020 202d 332e 3030 3020    0.070  -3.000
>> 00000e0: 202d 332e 3030 3020 2020 2d33 2e30 3020   -3.000   -3.00
>> 00000f0: 202d 332e 3030 3020 202d 332e 3030 3020   -3.000  -3.000
>> 0000100: 202d 332e 3030 3020 2d33 2e30 3030 3020   -3.000 -3.0000
>> 0000110: 202d 332e 3030 300a 2020 2020 2036 2020   -3.000.     6
>> The top record is the header.  The metadata did not state it had a 
>> header.
>>
>> Note there are 4 spaces before 'ws' then 2 spaces before 'year'.  There
>> is a bunch of white space padding (0x20 and 0x09 tabs).  Also the line
>> terminator for this record is 0x0a (nl) whereas the metadata states the
>> line terminator is \n\r which should be (0x0a 0x0d).
>>
>> The second record is the first record of data.  It's structure is:  5
>> spaces then '6'
>>
>> We are using HSQL's Text Table functionality.  This essentially binds a
>> ddl definition in the database to an external data file (the text
>> file).  The functionality provided by hsql is not flexible enough to
>> handle regular expressions for the column seperators (which would be
>> required in this instance).  It is also not possible to parse a fixed
>> format text file (which this particular file appears to be).
>>
>> Essentially there are two problems with this dataset:  The metadata does
>> not adequately describe the data format, and, the hsql functionality
>> does not support this type of text file structure.
>>
>> I'm not exactly certain what we can do here.  I don't know if eml can
>> describe a fixed format file.  And even if it did, we'd have to do some
>> work to parse the fixed format file and insert into the database table.
>>
>> Kevin
>>