[kepler-dev] Bad dataset from workshop. Re: wf

Fri Jan 13 10:50:43 PST 2006

Hi, kevin:

In emldatasource actor, in order to get data there are two steps.
first is to run hsqldb query, second is to run simple format or fixed 
format parser.

resultset = run(hsqldb query);

if (resultset == null)
{
   run simple format parser (or fixed format parser);
}

The problem can cause don't run fixed format parser can be resultset is 
not null (but is empty), or metacat is not correct. I will take a look.

Jing

Jing Tao
National Center for Ecological
Analysis and Synthesis (NCEAS)
735 State St. Suite 204
Santa Barbara, CA 93101

On Fri, 13 Jan 2006, Kevin Ruland wrote:

> Date: Fri, 13 Jan 2006 10:54:26 -0600
> From: Kevin Ruland <kruland at ku.edu>
> To: Matt Jones <jones at nceas.ucsb.edu>
> Cc: Kepler-Dev <kepler-dev at ecoinformatics.org>
> Subject: Re: [kepler-dev] Bad dataset from workshop.  Re: wf
> 
>
> I do recall seeing the code to parse fixed format files.  I don't know
> off hand how this code gets selected to execute.  But in this case, it
> is not.
>
> I don't understand how there can be a variable number of delimiters
> between tokens.  In a delimited file, two adjacent delimiters would be
> interpreted as a "missing value" in that particular column.  This
> particular datafile does define specific values for missing data (-3.0
> IIRC).  We could, in theory, use this information to infer that the
> table is delimited by a regular expression.  This would not be reliable
> unless there is some documentation for the metadata definition which
> defines this.
>
> The stack trace that chad, jing, and now I are getting ( String index
> problem deep in hsqldb ) appears to be a bug in the delimiter parsing
> routines in hsqldb.  No - correction.  hsqldb assumes that a delimiter
> in hex is specified in exactly 4 bytes.  We are passing it '\\u20'  when
> it expects '\\u0020'.  That is a bug I can fix :)
>
> For error recovery, I think we first need to address error generation
> and reporting.  We need to some how track down the exceptions comming
> from the code and determine what a reasonable message would be and
> present that back to the user.
>
> Kevin
>
> Matt Jones wrote:
>
>> EML can indeed describe fixed-format files.  And I had thought that
>> Jing had included support for those file formats in the
>> EML200DataSource actor.  I'm not sure how he had set up the
>> interaction with hsql but my impression was that it worked. Maybe he
>> converts fixed format to delimited format before passing it to hsql.
>> Jing can shed more light here.  In any case, the metadata provider
>> would need to provide accurate metadata stating that its a fixed
>> format file if in indeed it is.
>>
>> In this particular case, the format may not be fixed -- rather, it may
>> be delimited with a variable number of delimiter tokens between data
>> values.  I'd have to see more of the file to figure that out.  There
>> are fields in EML to describe this case.
>>
>> In general it seems that hsql's import routines are too limited to
>> support the variety of data formats that EML describes, so we will
>> probably need to do some translation in EML200DataSource before
>> passing the data to hsql.  Again, Jing should be able to clarify here.
>>
>> Overall the EML data import has proven to be fragile, which we need to
>> fix.  There are several consistent problems:
>>
>> 1) EML200DataSource doesn't use all available metadata in the EML
>> 2) The metadata describing the data schema is wrong wrt the
>>    data structure (e.g., too many or too few columns, wrong
>>    delimeters, etc)
>> 3) the data has missing values and therefore is not supported
>>    by EML200DS (and won't be till we support null tokens)
>> 4) Select values in the data are illegal -- for example, columnns
>>    described as a float contain a value like "NA" without a statement
>>    in EML that "NA" is used as a missing value code
>> 5) The metadata says the data is available at a given URL, but when
>>    we try to download the data it is either not there, restricted
>>    access, or not actually a data file (usually a web page)
>>
>> We need to systematically go through these issues and make this import
>> robust.  The only thing we can't really protect against is incorrect
>> metadata.  In that case, it would be nice if we gathered some
>> information about what is wrong and possibly ask the user about the
>> problem in order to adapt.  For example, consider the following two
>> questions we might be able to pose:
>>
>> "The metadata indicates that records are separated by newlines, but we
>> have detected carriage returns in the data stream.  Should we switch
>> to using carriage returns as the record separator?"
>>
>> 'The "bodylength" column should contian numeric values, but we have
>> detected repeated use of the value "N/A" in the data.  Should this be
>> considered a code for a missing value?'
>>
>> There are probably a handful of common problems like this that we
>> might be able to anticipate.  Interestingly, these issues really can't
>> be detected until the data is retrieved, which is especially a problem
>> if we delay download of the data until the 'run' has started.
>>
>> Matt
>>
>> Kevin Ruland wrote:
>>
>>> Chad, et al.
>>>
>>> I looked at this.  It appears the problem is the metadata for this
>>> dataset does not accurately describe its structure.
>>>
>>> The metadata states:
>>>
>>>             <physical scope="document">
>>>                 <objectName>w6-pcp.txt</objectName>
>>>                 <characterEncoding>ASCII</characterEncoding>
>>>                 <dataFormat>
>>>                     <textFormat>
>>>                         <recordDelimiter>\n\r</recordDelimiter>
>>>
>>> <attributeOrientation>column</attributeOrientation>
>>>                         <simpleDelimited>
>>>                             <fieldDelimiter>0x20</fieldDelimiter>
>>>                         </simpleDelimited>
>>>                     </textFormat>
>>>                 </dataFormat>
>>>                 <distribution scope="document">
>>>                     <online>
>>>                         <url
>>> function="download">http://www.hubbardbrook.org/research/data/atmos/pcp_chem/w6-pcp.txt</url>
>>>
>>>                     </online>
>>>                 </distribution>
>>>             </physical>
>>>
>>> Note the following:
>>>
>>> The recordDelimiter is \n\r.
>>> The fieldDelimiter is 0x20 (space).
>>> It does not have a node physical/dataFormat/textFormat/numHeaderLines
>>> which we interpret as meaning there are no header records.
>>>
>>> Here are the first two records of the dataset:
>>>
>>>     ws  year  mo    precip      Ca      Mg       K      Na      Al
>>> NH4      pH     SO4     NO3      Cl     PO4
>>> Sio2                                                  6  1963   6
>>> 67.500   0.300   0.070   0.100   0.070  -3.000 -3.000   -3.00
>>> -3.000  -3.000  -3.000 -3.0000  -3.000
>>>
>>> And it's hexdump:
>>>
>>> 0000000: 2020 2020 7773 2020 7965 6172 2020 6d6f      ws  year  mo
>>> 0000010: 2020 2020 7072 6563 6970 2020 2020 2020      precip
>>> 0000020: 4361 2020 2020 2020 4d67 2020 2020 2020  Ca      Mg
>>> 0000030: 204b 2020 2020 2020 4e61 2020 2020 2020   K      Na
>>> 0000040: 416c 2020 2020 204e 4834 2020 2020 2020  Al     NH4
>>> 0000050: 7048 2020 2020 2053 4f34 2020 2020 204e  pH     SO4     N
>>> 0000060: 4f33 2020 2020 2020 436c 2020 2020 2050  O3      Cl     P
>>> 0000070: 4f34 2020 2020 5369 6f32 2020 2020 2020  O4    Sio2
>>> 0000080: 2020 2020 2020 2020 2020 2020 2020 2020
>>> 0000090: 2020 2020 0909 0920 2020 2009 0a20 2020      ...    ..
>>> 00000a0: 2020 3620 2031 3936 3320 2020 3620 2020    6  1963   6
>>> 00000b0: 2036 372e 3530 3020 2020 302e 3330 3020   67.500   0.300
>>> 00000c0: 2020 302e 3037 3020 2020 302e 3130 3020    0.070   0.100
>>> 00000d0: 2020 302e 3037 3020 202d 332e 3030 3020    0.070  -3.000
>>> 00000e0: 202d 332e 3030 3020 2020 2d33 2e30 3020   -3.000   -3.00
>>> 00000f0: 202d 332e 3030 3020 202d 332e 3030 3020   -3.000  -3.000
>>> 0000100: 202d 332e 3030 3020 2d33 2e30 3030 3020   -3.000 -3.0000
>>> 0000110: 202d 332e 3030 300a 2020 2020 2036 2020   -3.000.     6
>>> The top record is the header.  The metadata did not state it had a
>>> header.
>>>
>>> Note there are 4 spaces before 'ws' then 2 spaces before 'year'.  There
>>> is a bunch of white space padding (0x20 and 0x09 tabs).  Also the line
>>> terminator for this record is 0x0a (nl) whereas the metadata states the
>>> line terminator is \n\r which should be (0x0a 0x0d).
>>>
>>> The second record is the first record of data.  It's structure is:  5
>>> spaces then '6'
>>>
>>> We are using HSQL's Text Table functionality.  This essentially binds a
>>> ddl definition in the database to an external data file (the text
>>> file).  The functionality provided by hsql is not flexible enough to
>>> handle regular expressions for the column seperators (which would be
>>> required in this instance).  It is also not possible to parse a fixed
>>> format text file (which this particular file appears to be).
>>>
>>> Essentially there are two problems with this dataset:  The metadata does
>>> not adequately describe the data format, and, the hsql functionality
>>> does not support this type of text file structure.
>>>
>>> I'm not exactly certain what we can do here.  I don't know if eml can
>>> describe a fixed format file.  And even if it did, we'd have to do some
>>> work to parse the fixed format file and insert into the database table.
>>>
>>> Kevin
>>>
>
> _______________________________________________
> Kepler-dev mailing list
> Kepler-dev at ecoinformatics.org
> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-dev
>
>