[kepler-dev] Bad dataset from workshop. Re: wf
Kevin Ruland
kruland at ku.edu
Fri Jan 13 08:54:26 PST 2006
I do recall seeing the code to parse fixed format files. I don't know
off hand how this code gets selected to execute. But in this case, it
is not.
I don't understand how there can be a variable number of delimiters
between tokens. In a delimited file, two adjacent delimiters would be
interpreted as a "missing value" in that particular column. This
particular datafile does define specific values for missing data (-3.0
IIRC). We could, in theory, use this information to infer that the
table is delimited by a regular expression. This would not be reliable
unless there is some documentation for the metadata definition which
defines this.
The stack trace that chad, jing, and now I are getting ( String index
problem deep in hsqldb ) appears to be a bug in the delimiter parsing
routines in hsqldb. No - correction. hsqldb assumes that a delimiter
in hex is specified in exactly 4 bytes. We are passing it '\\u20' when
it expects '\\u0020'. That is a bug I can fix :)
For error recovery, I think we first need to address error generation
and reporting. We need to some how track down the exceptions comming
from the code and determine what a reasonable message would be and
present that back to the user.
Kevin
Matt Jones wrote:
> EML can indeed describe fixed-format files. And I had thought that
> Jing had included support for those file formats in the
> EML200DataSource actor. I'm not sure how he had set up the
> interaction with hsql but my impression was that it worked. Maybe he
> converts fixed format to delimited format before passing it to hsql.
> Jing can shed more light here. In any case, the metadata provider
> would need to provide accurate metadata stating that its a fixed
> format file if in indeed it is.
>
> In this particular case, the format may not be fixed -- rather, it may
> be delimited with a variable number of delimiter tokens between data
> values. I'd have to see more of the file to figure that out. There
> are fields in EML to describe this case.
>
> In general it seems that hsql's import routines are too limited to
> support the variety of data formats that EML describes, so we will
> probably need to do some translation in EML200DataSource before
> passing the data to hsql. Again, Jing should be able to clarify here.
>
> Overall the EML data import has proven to be fragile, which we need to
> fix. There are several consistent problems:
>
> 1) EML200DataSource doesn't use all available metadata in the EML
> 2) The metadata describing the data schema is wrong wrt the
> data structure (e.g., too many or too few columns, wrong
> delimeters, etc)
> 3) the data has missing values and therefore is not supported
> by EML200DS (and won't be till we support null tokens)
> 4) Select values in the data are illegal -- for example, columnns
> described as a float contain a value like "NA" without a statement
> in EML that "NA" is used as a missing value code
> 5) The metadata says the data is available at a given URL, but when
> we try to download the data it is either not there, restricted
> access, or not actually a data file (usually a web page)
>
> We need to systematically go through these issues and make this import
> robust. The only thing we can't really protect against is incorrect
> metadata. In that case, it would be nice if we gathered some
> information about what is wrong and possibly ask the user about the
> problem in order to adapt. For example, consider the following two
> questions we might be able to pose:
>
> "The metadata indicates that records are separated by newlines, but we
> have detected carriage returns in the data stream. Should we switch
> to using carriage returns as the record separator?"
>
> 'The "bodylength" column should contian numeric values, but we have
> detected repeated use of the value "N/A" in the data. Should this be
> considered a code for a missing value?'
>
> There are probably a handful of common problems like this that we
> might be able to anticipate. Interestingly, these issues really can't
> be detected until the data is retrieved, which is especially a problem
> if we delay download of the data until the 'run' has started.
>
> Matt
>
> Kevin Ruland wrote:
>
>> Chad, et al.
>>
>> I looked at this. It appears the problem is the metadata for this
>> dataset does not accurately describe its structure.
>>
>> The metadata states:
>>
>> <physical scope="document">
>> <objectName>w6-pcp.txt</objectName>
>> <characterEncoding>ASCII</characterEncoding>
>> <dataFormat>
>> <textFormat>
>> <recordDelimiter>\n\r</recordDelimiter>
>>
>> <attributeOrientation>column</attributeOrientation>
>> <simpleDelimited>
>> <fieldDelimiter>0x20</fieldDelimiter>
>> </simpleDelimited>
>> </textFormat>
>> </dataFormat>
>> <distribution scope="document">
>> <online>
>> <url
>> function="download">http://www.hubbardbrook.org/research/data/atmos/pcp_chem/w6-pcp.txt</url>
>>
>> </online>
>> </distribution>
>> </physical>
>>
>> Note the following:
>>
>> The recordDelimiter is \n\r.
>> The fieldDelimiter is 0x20 (space).
>> It does not have a node physical/dataFormat/textFormat/numHeaderLines
>> which we interpret as meaning there are no header records.
>>
>> Here are the first two records of the dataset:
>>
>> ws year mo precip Ca Mg K Na Al
>> NH4 pH SO4 NO3 Cl PO4
>> Sio2 6 1963 6
>> 67.500 0.300 0.070 0.100 0.070 -3.000 -3.000 -3.00
>> -3.000 -3.000 -3.000 -3.0000 -3.000
>>
>> And it's hexdump:
>>
>> 0000000: 2020 2020 7773 2020 7965 6172 2020 6d6f ws year mo
>> 0000010: 2020 2020 7072 6563 6970 2020 2020 2020 precip
>> 0000020: 4361 2020 2020 2020 4d67 2020 2020 2020 Ca Mg
>> 0000030: 204b 2020 2020 2020 4e61 2020 2020 2020 K Na
>> 0000040: 416c 2020 2020 204e 4834 2020 2020 2020 Al NH4
>> 0000050: 7048 2020 2020 2053 4f34 2020 2020 204e pH SO4 N
>> 0000060: 4f33 2020 2020 2020 436c 2020 2020 2050 O3 Cl P
>> 0000070: 4f34 2020 2020 5369 6f32 2020 2020 2020 O4 Sio2
>> 0000080: 2020 2020 2020 2020 2020 2020 2020 2020
>> 0000090: 2020 2020 0909 0920 2020 2009 0a20 2020 ... ..
>> 00000a0: 2020 3620 2031 3936 3320 2020 3620 2020 6 1963 6
>> 00000b0: 2036 372e 3530 3020 2020 302e 3330 3020 67.500 0.300
>> 00000c0: 2020 302e 3037 3020 2020 302e 3130 3020 0.070 0.100
>> 00000d0: 2020 302e 3037 3020 202d 332e 3030 3020 0.070 -3.000
>> 00000e0: 202d 332e 3030 3020 2020 2d33 2e30 3020 -3.000 -3.00
>> 00000f0: 202d 332e 3030 3020 202d 332e 3030 3020 -3.000 -3.000
>> 0000100: 202d 332e 3030 3020 2d33 2e30 3030 3020 -3.000 -3.0000
>> 0000110: 202d 332e 3030 300a 2020 2020 2036 2020 -3.000. 6
>> The top record is the header. The metadata did not state it had a
>> header.
>>
>> Note there are 4 spaces before 'ws' then 2 spaces before 'year'. There
>> is a bunch of white space padding (0x20 and 0x09 tabs). Also the line
>> terminator for this record is 0x0a (nl) whereas the metadata states the
>> line terminator is \n\r which should be (0x0a 0x0d).
>>
>> The second record is the first record of data. It's structure is: 5
>> spaces then '6'
>>
>> We are using HSQL's Text Table functionality. This essentially binds a
>> ddl definition in the database to an external data file (the text
>> file). The functionality provided by hsql is not flexible enough to
>> handle regular expressions for the column seperators (which would be
>> required in this instance). It is also not possible to parse a fixed
>> format text file (which this particular file appears to be).
>>
>> Essentially there are two problems with this dataset: The metadata does
>> not adequately describe the data format, and, the hsql functionality
>> does not support this type of text file structure.
>>
>> I'm not exactly certain what we can do here. I don't know if eml can
>> describe a fixed format file. And even if it did, we'd have to do some
>> work to parse the fixed format file and insert into the database table.
>>
>> Kevin
>>
More information about the Kepler-dev
mailing list