[kepler-dev] Bad dataset from workshop. Re: wf
tao at nceas.ucsb.edu
Fri Jan 13 10:50:43 PST 2006
In emldatasource actor, in order to get data there are two steps.
first is to run hsqldb query, second is to run simple format or fixed
resultset = run(hsqldb query);
if (resultset == null)
run simple format parser (or fixed format parser);
The problem can cause don't run fixed format parser can be resultset is
not null (but is empty), or metacat is not correct. I will take a look.
National Center for Ecological
Analysis and Synthesis (NCEAS)
735 State St. Suite 204
Santa Barbara, CA 93101
On Fri, 13 Jan 2006, Kevin Ruland wrote:
> Date: Fri, 13 Jan 2006 10:54:26 -0600
> From: Kevin Ruland <kruland at ku.edu>
> To: Matt Jones <jones at nceas.ucsb.edu>
> Cc: Kepler-Dev <kepler-dev at ecoinformatics.org>
> Subject: Re: [kepler-dev] Bad dataset from workshop. Re: wf
> I do recall seeing the code to parse fixed format files. I don't know
> off hand how this code gets selected to execute. But in this case, it
> is not.
> I don't understand how there can be a variable number of delimiters
> between tokens. In a delimited file, two adjacent delimiters would be
> interpreted as a "missing value" in that particular column. This
> particular datafile does define specific values for missing data (-3.0
> IIRC). We could, in theory, use this information to infer that the
> table is delimited by a regular expression. This would not be reliable
> unless there is some documentation for the metadata definition which
> defines this.
> The stack trace that chad, jing, and now I are getting ( String index
> problem deep in hsqldb ) appears to be a bug in the delimiter parsing
> routines in hsqldb. No - correction. hsqldb assumes that a delimiter
> in hex is specified in exactly 4 bytes. We are passing it '\\u20' when
> it expects '\\u0020'. That is a bug I can fix :)
> For error recovery, I think we first need to address error generation
> and reporting. We need to some how track down the exceptions comming
> from the code and determine what a reasonable message would be and
> present that back to the user.
> Matt Jones wrote:
>> EML can indeed describe fixed-format files. And I had thought that
>> Jing had included support for those file formats in the
>> EML200DataSource actor. I'm not sure how he had set up the
>> interaction with hsql but my impression was that it worked. Maybe he
>> converts fixed format to delimited format before passing it to hsql.
>> Jing can shed more light here. In any case, the metadata provider
>> would need to provide accurate metadata stating that its a fixed
>> format file if in indeed it is.
>> In this particular case, the format may not be fixed -- rather, it may
>> be delimited with a variable number of delimiter tokens between data
>> values. I'd have to see more of the file to figure that out. There
>> are fields in EML to describe this case.
>> In general it seems that hsql's import routines are too limited to
>> support the variety of data formats that EML describes, so we will
>> probably need to do some translation in EML200DataSource before
>> passing the data to hsql. Again, Jing should be able to clarify here.
>> Overall the EML data import has proven to be fragile, which we need to
>> fix. There are several consistent problems:
>> 1) EML200DataSource doesn't use all available metadata in the EML
>> 2) The metadata describing the data schema is wrong wrt the
>> data structure (e.g., too many or too few columns, wrong
>> delimeters, etc)
>> 3) the data has missing values and therefore is not supported
>> by EML200DS (and won't be till we support null tokens)
>> 4) Select values in the data are illegal -- for example, columnns
>> described as a float contain a value like "NA" without a statement
>> in EML that "NA" is used as a missing value code
>> 5) The metadata says the data is available at a given URL, but when
>> we try to download the data it is either not there, restricted
>> access, or not actually a data file (usually a web page)
>> We need to systematically go through these issues and make this import
>> robust. The only thing we can't really protect against is incorrect
>> metadata. In that case, it would be nice if we gathered some
>> information about what is wrong and possibly ask the user about the
>> problem in order to adapt. For example, consider the following two
>> questions we might be able to pose:
>> "The metadata indicates that records are separated by newlines, but we
>> have detected carriage returns in the data stream. Should we switch
>> to using carriage returns as the record separator?"
>> 'The "bodylength" column should contian numeric values, but we have
>> detected repeated use of the value "N/A" in the data. Should this be
>> considered a code for a missing value?'
>> There are probably a handful of common problems like this that we
>> might be able to anticipate. Interestingly, these issues really can't
>> be detected until the data is retrieved, which is especially a problem
>> if we delay download of the data until the 'run' has started.
>> Kevin Ruland wrote:
>>> Chad, et al.
>>> I looked at this. It appears the problem is the metadata for this
>>> dataset does not accurately describe its structure.
>>> The metadata states:
>>> <physical scope="document">
>>> <distribution scope="document">
>>> Note the following:
>>> The recordDelimiter is \n\r.
>>> The fieldDelimiter is 0x20 (space).
>>> It does not have a node physical/dataFormat/textFormat/numHeaderLines
>>> which we interpret as meaning there are no header records.
>>> Here are the first two records of the dataset:
>>> ws year mo precip Ca Mg K Na Al
>>> NH4 pH SO4 NO3 Cl PO4
>>> Sio2 6 1963 6
>>> 67.500 0.300 0.070 0.100 0.070 -3.000 -3.000 -3.00
>>> -3.000 -3.000 -3.000 -3.0000 -3.000
>>> And it's hexdump:
>>> 0000000: 2020 2020 7773 2020 7965 6172 2020 6d6f ws year mo
>>> 0000010: 2020 2020 7072 6563 6970 2020 2020 2020 precip
>>> 0000020: 4361 2020 2020 2020 4d67 2020 2020 2020 Ca Mg
>>> 0000030: 204b 2020 2020 2020 4e61 2020 2020 2020 K Na
>>> 0000040: 416c 2020 2020 204e 4834 2020 2020 2020 Al NH4
>>> 0000050: 7048 2020 2020 2053 4f34 2020 2020 204e pH SO4 N
>>> 0000060: 4f33 2020 2020 2020 436c 2020 2020 2050 O3 Cl P
>>> 0000070: 4f34 2020 2020 5369 6f32 2020 2020 2020 O4 Sio2
>>> 0000080: 2020 2020 2020 2020 2020 2020 2020 2020
>>> 0000090: 2020 2020 0909 0920 2020 2009 0a20 2020 ... ..
>>> 00000a0: 2020 3620 2031 3936 3320 2020 3620 2020 6 1963 6
>>> 00000b0: 2036 372e 3530 3020 2020 302e 3330 3020 67.500 0.300
>>> 00000c0: 2020 302e 3037 3020 2020 302e 3130 3020 0.070 0.100
>>> 00000d0: 2020 302e 3037 3020 202d 332e 3030 3020 0.070 -3.000
>>> 00000e0: 202d 332e 3030 3020 2020 2d33 2e30 3020 -3.000 -3.00
>>> 00000f0: 202d 332e 3030 3020 202d 332e 3030 3020 -3.000 -3.000
>>> 0000100: 202d 332e 3030 3020 2d33 2e30 3030 3020 -3.000 -3.0000
>>> 0000110: 202d 332e 3030 300a 2020 2020 2036 2020 -3.000. 6
>>> The top record is the header. The metadata did not state it had a
>>> Note there are 4 spaces before 'ws' then 2 spaces before 'year'. There
>>> is a bunch of white space padding (0x20 and 0x09 tabs). Also the line
>>> terminator for this record is 0x0a (nl) whereas the metadata states the
>>> line terminator is \n\r which should be (0x0a 0x0d).
>>> The second record is the first record of data. It's structure is: 5
>>> spaces then '6'
>>> We are using HSQL's Text Table functionality. This essentially binds a
>>> ddl definition in the database to an external data file (the text
>>> file). The functionality provided by hsql is not flexible enough to
>>> handle regular expressions for the column seperators (which would be
>>> required in this instance). It is also not possible to parse a fixed
>>> format text file (which this particular file appears to be).
>>> Essentially there are two problems with this dataset: The metadata does
>>> not adequately describe the data format, and, the hsql functionality
>>> does not support this type of text file structure.
>>> I'm not exactly certain what we can do here. I don't know if eml can
>>> describe a fixed format file. And even if it did, we'd have to do some
>>> work to parse the fixed format file and insert into the database table.
> Kepler-dev mailing list
> Kepler-dev at ecoinformatics.org
More information about the Kepler-dev