[kepler-dev] Bad dataset from workshop. Re: wf
Shawn Bowers
sbowers at ucdavis.edu
Fri Jan 13 08:41:55 PST 2006
Just a quick observation: Wouldn't it also be a good thing to detect
these problems when the dataset/metadata is provided to a repository
such as EcoGrid? This would probably avoid a lot of these problems.
Of course, it doesn't solve the case where the metadata was created
locally and loaded into Kepler ...
-shawn
Matt Jones wrote:
> EML can indeed describe fixed-format files. And I had thought that Jing
> had included support for those file formats in the EML200DataSource
> actor. I'm not sure how he had set up the interaction with hsql but my
> impression was that it worked. Maybe he converts fixed format to
> delimited format before passing it to hsql. Jing can shed more light
> here. In any case, the metadata provider would need to provide accurate
> metadata stating that its a fixed format file if in indeed it is.
>
> In this particular case, the format may not be fixed -- rather, it may
> be delimited with a variable number of delimiter tokens between data
> values. I'd have to see more of the file to figure that out. There are
> fields in EML to describe this case.
>
> In general it seems that hsql's import routines are too limited to
> support the variety of data formats that EML describes, so we will
> probably need to do some translation in EML200DataSource before passing
> the data to hsql. Again, Jing should be able to clarify here.
>
> Overall the EML data import has proven to be fragile, which we need to
> fix. There are several consistent problems:
>
> 1) EML200DataSource doesn't use all available metadata in the EML
> 2) The metadata describing the data schema is wrong wrt the
> data structure (e.g., too many or too few columns, wrong
> delimeters, etc)
> 3) the data has missing values and therefore is not supported
> by EML200DS (and won't be till we support null tokens)
> 4) Select values in the data are illegal -- for example, columnns
> described as a float contain a value like "NA" without a statement
> in EML that "NA" is used as a missing value code
> 5) The metadata says the data is available at a given URL, but when
> we try to download the data it is either not there, restricted
> access, or not actually a data file (usually a web page)
>
> We need to systematically go through these issues and make this import
> robust. The only thing we can't really protect against is incorrect
> metadata. In that case, it would be nice if we gathered some
> information about what is wrong and possibly ask the user about the
> problem in order to adapt. For example, consider the following two
> questions we might be able to pose:
>
> "The metadata indicates that records are separated by newlines, but we
> have detected carriage returns in the data stream. Should we switch to
> using carriage returns as the record separator?"
>
> 'The "bodylength" column should contian numeric values, but we have
> detected repeated use of the value "N/A" in the data. Should this be
> considered a code for a missing value?'
>
> There are probably a handful of common problems like this that we might
> be able to anticipate. Interestingly, these issues really can't be
> detected until the data is retrieved, which is especially a problem if
> we delay download of the data until the 'run' has started.
>
> Matt
>
> Kevin Ruland wrote:
>> Chad, et al.
>>
>> I looked at this. It appears the problem is the metadata for this
>> dataset does not accurately describe its structure.
>>
>> The metadata states:
>>
>> <physical scope="document">
>> <objectName>w6-pcp.txt</objectName>
>> <characterEncoding>ASCII</characterEncoding>
>> <dataFormat>
>> <textFormat>
>> <recordDelimiter>\n\r</recordDelimiter>
>> <attributeOrientation>column</attributeOrientation>
>> <simpleDelimited>
>> <fieldDelimiter>0x20</fieldDelimiter>
>> </simpleDelimited>
>> </textFormat>
>> </dataFormat>
>> <distribution scope="document">
>> <online>
>> <url
>> function="download">http://www.hubbardbrook.org/research/data/atmos/pcp_chem/w6-pcp.txt</url>
>> </online>
>> </distribution>
>> </physical>
>>
>> Note the following:
>>
>> The recordDelimiter is \n\r.
>> The fieldDelimiter is 0x20 (space).
>> It does not have a node physical/dataFormat/textFormat/numHeaderLines
>> which we interpret as meaning there are no header records.
>>
>> Here are the first two records of the dataset:
>>
>> ws year mo precip Ca Mg K Na Al
>> NH4 pH SO4 NO3 Cl PO4
>> Sio2
>> 6 1963 6 67.500 0.300 0.070 0.100 0.070 -3.000
>> -3.000 -3.00 -3.000 -3.000 -3.000 -3.0000 -3.000
>>
>> And it's hexdump:
>>
>> 0000000: 2020 2020 7773 2020 7965 6172 2020 6d6f ws year mo
>> 0000010: 2020 2020 7072 6563 6970 2020 2020 2020 precip
>> 0000020: 4361 2020 2020 2020 4d67 2020 2020 2020 Ca Mg
>> 0000030: 204b 2020 2020 2020 4e61 2020 2020 2020 K Na
>> 0000040: 416c 2020 2020 204e 4834 2020 2020 2020 Al NH4
>> 0000050: 7048 2020 2020 2053 4f34 2020 2020 204e pH SO4 N
>> 0000060: 4f33 2020 2020 2020 436c 2020 2020 2050 O3 Cl P
>> 0000070: 4f34 2020 2020 5369 6f32 2020 2020 2020 O4 Sio2
>> 0000080: 2020 2020 2020 2020 2020 2020 2020 2020
>> 0000090: 2020 2020 0909 0920 2020 2009 0a20 2020 ... ..
>> 00000a0: 2020 3620 2031 3936 3320 2020 3620 2020 6 1963 6
>> 00000b0: 2036 372e 3530 3020 2020 302e 3330 3020 67.500 0.300
>> 00000c0: 2020 302e 3037 3020 2020 302e 3130 3020 0.070 0.100
>> 00000d0: 2020 302e 3037 3020 202d 332e 3030 3020 0.070 -3.000
>> 00000e0: 202d 332e 3030 3020 2020 2d33 2e30 3020 -3.000 -3.00
>> 00000f0: 202d 332e 3030 3020 202d 332e 3030 3020 -3.000 -3.000
>> 0000100: 202d 332e 3030 3020 2d33 2e30 3030 3020 -3.000 -3.0000
>> 0000110: 202d 332e 3030 300a 2020 2020 2036 2020 -3.000. 6
>>
>> The top record is the header. The metadata did not state it had a header.
>>
>> Note there are 4 spaces before 'ws' then 2 spaces before 'year'. There
>> is a bunch of white space padding (0x20 and 0x09 tabs). Also the line
>> terminator for this record is 0x0a (nl) whereas the metadata states the
>> line terminator is \n\r which should be (0x0a 0x0d).
>>
>> The second record is the first record of data. It's structure is: 5
>> spaces then '6'
>>
>> We are using HSQL's Text Table functionality. This essentially binds a
>> ddl definition in the database to an external data file (the text
>> file). The functionality provided by hsql is not flexible enough to
>> handle regular expressions for the column seperators (which would be
>> required in this instance). It is also not possible to parse a fixed
>> format text file (which this particular file appears to be).
>>
>> Essentially there are two problems with this dataset: The metadata does
>> not adequately describe the data format, and, the hsql functionality
>> does not support this type of text file structure.
>>
>> I'm not exactly certain what we can do here. I don't know if eml can
>> describe a fixed format file. And even if it did, we'd have to do some
>> work to parse the fixed format file and insert into the database table.
>>
>> Kevin
>>
>> Chad Berkley wrote:
>>
>>
>>> <?xml version="1.0" standalone="no"?>
>>> <!DOCTYPE entity PUBLIC "-//UC Berkeley//DTD MoML 1//EN"
>>> "http://ptolemy.eecs.berkeley.edu/xml/dtd/MoML_1.dtd">
>>> <entity name="model" class="ptolemy.actor.TypedCompositeActor">
>>> <property name="_createdBy" class="ptolemy.kernel.attributes.VersionAttribute" value="5.1-alpha">
>>> </property>
>>> <property name="SDF Director" class="ptolemy.domains.sdf.kernel.SDFDirector">
>>> <property name="_svgIcon" class="ptolemy.kernel.util.ConfigurableAttribute">
>>> <configure>../kepler-docs/dev/usability/graphics/svg/director.svg</configure>
>>> </property>
>>> <property name="_thumbnailRasterIcon" class="ptolemy.kernel.util.ConfigurableAttribute">
>>> <configure>/actorthumbs/director-sm.gif</configure>
>>> </property>
>>> <property name="timeResolution" class="ptolemy.moml.SharedParameter" value="1E-10">
>>> </property>
>>> <property name="entityId" class="org.kepler.moml.NamedObjId" value="urn:lsid:kepler-project.org:director:1:1">
>>> </property>
>>> <property name="class" class="ptolemy.kernel.util.StringAttribute" value="ptolemy.domains.sdf.kernel.SDFDirector">
>>> <property name="id" class="ptolemy.kernel.util.StringAttribute" value="urn:lsid:kepler-project.org:directorclass:1:1">
>>> </property>
>>> </property>
>>> <property name="semanticType000" class="org.kepler.sms.SemanticType" value="urn:lsid:localhost:onto:1:1#Director">
>>> </property>
>>> <property name="_location" class="ptolemy.kernel.util.Location" value="{160, 110}">
>>> </property>
>>> </property>
>>> <property name="_windowProperties" class="ptolemy.actor.gui.WindowPropertiesAttribute" value="{bounds={82, 22, 850, 732}, maximized=false}">
>>> </property>
>>> <property name="_vergilSize" class="ptolemy.actor.gui.SizeAttribute" value="[590, 610]">
>>> </property>
>>> <property name="_vergilZoomFactor" class="ptolemy.data.expr.ExpertParameter" value="1.0">
>>> </property>
>>> <property name="_vergilCenter" class="ptolemy.data.expr.ExpertParameter" value="{295.0, 305.0}">
>>> </property>
>>> <entity name="Display" class="ptolemy.actor.lib.gui.Display">
>>> <property name="_svgIcon" class="ptolemy.kernel.util.ConfigurableAttribute">
>>> <configure>../kepler-docs/dev/usability/graphics/svg/text_disp.svg</configure>
>>> </property>
>>> <property name="_thumbnailRasterIcon" class="ptolemy.kernel.util.ConfigurableAttribute">
>>> <configure>/actorthumbs/text_disp-sm.gif</configure>
>>> </property>
>>> <property name="rowsDisplayed" class="ptolemy.data.expr.Parameter" value="10">
>>> </property>
>>> <property name="columnsDisplayed" class="ptolemy.data.expr.Parameter" value="40">
>>> </property>
>>> <property name="suppressBlankLines" class="ptolemy.data.expr.Parameter" value="false">
>>> </property>
>>> <property name="_windowProperties" class="ptolemy.actor.gui.WindowPropertiesAttribute" value="{bounds={270, 279, 484, 209}, maximized=false}">
>>> </property>
>>> <property name="entityId" class="org.kepler.moml.NamedObjId" value="urn:lsid:kepler-project.org:actor:7:1">
>>> </property>
>>> <property name="class" class="ptolemy.kernel.util.StringAttribute" value="ptolemy.actor.lib.gui.Display">
>>> <property name="id" class="ptolemy.kernel.util.StringAttribute" value="urn:lsid:kepler-project.org:class:883:1">
>>> </property>
>>> </property>
>>> <property name="semanticType000" class="org.kepler.sms.SemanticType" value="urn:lsid:localhost:onto:1:1#TextualOutputActor">
>>> </property>
>>> <property name="_location" class="ptolemy.kernel.util.Location" value="{310, 345}">
>>> </property>
>>> </entity>
>>> <entity name="Chemistry of Bulk Precipitation at HBEF WS-2" class="org.ecoinformatics.seek.datasource.eml.eml2.Eml200DataSource">
>>> <property name="_svgIcon" class="ptolemy.kernel.util.ConfigurableAttribute">
>>> <configure>../kepler-docs/dev/usability/graphics/svg/dataFile.svg</configure>
>>> </property>
>>> <property name="_thumbnailRasterIcon" class="ptolemy.kernel.util.ConfigurableAttribute">
>>> <configure>/actorthumbs/dataFile-sm.gif</configure>
>>> </property>
>>> <property name="schemaDef" class="ptolemy.kernel.util.StringAttribute" value="<schema> <table name="w6-pcp"> <field name="ws" dataType="STRING"/> <field name="year" dataType="STRING"/> <field name="mo" dataType="STRING"/> <field name="precip" dataType="FLOAT"/> <field name="Ca" dataType="FLOAT"/> <field name="Mg" dataType="FLOAT"/> <field name="K" dataType="FLOAT"/> <field name="Na" dataType="FLOAT"/> <field name="Al" dataType="FLOAT"/> <field name="NH4" dataType="FLOAT"/> <field name="pH" dataType="FLOAT"/> <field name="SO4" dataType="FLOAT&quo
t;
> /> <field name="NO3" dataType="FLOAT"/> <field name="Cl" dataType="FLOAT"/> <field name="PO4" dataType="FLOAT"/> <field name="SiO2" dataType="FLOAT"/> </table> </schema> ">
>>> <property name="schemaDef" class="ptolemy.actor.gui.style.TextStyle">
>>> <property name="height" class="ptolemy.data.expr.Parameter" value="10">
>>> </property>
>>> <property name="width" class="ptolemy.data.expr.Parameter" value="30">
>>> </property>
>>> </property>
>>> </property>
>>> <property name="sqlDef" class="ptolemy.kernel.util.StringAttribute">
>>> <property name="sqlDef" class="ptolemy.actor.gui.style.TextStyle">
>>> <property name="height" class="ptolemy.data.expr.Parameter" value="10">
>>> </property>
>>> <property name="width" class="ptolemy.data.expr.Parameter" value="30">
>>> </property>
>>> </property>
>>> </property>
>>> <property name="Selected Entity" class="ptolemy.data.expr.StringParameter" value="w6-pcp">
>>> </property>
>>> <property name="outputType" class="ptolemy.data.expr.StringParameter" value="As Field">
>>> </property>
>>> <property name="_tableauFactory" class="org.kepler.objectmanager.data.db.QBTableauFactory">
>>> <property name="sqlName" class="ptolemy.kernel.util.StringAttribute" value="sqlDef">
>>> </property>
>>> <property name="schemaName" class="ptolemy.kernel.util.StringAttribute" value="schemaDef">
>>> </property>
>>> </property>
>>> <property name="recordid" class="ptolemy.kernel.util.StringAttribute" value="knb-lter-hbr.20.1">
>>> </property>
>>> <property name="endpoint" class="ptolemy.kernel.util.StringAttribute" value="http://ecogrid.ecoinformatics.org/knb/services/EcoGridQuery">
>>> </property>
>>> <property name="namespace" class="ptolemy.kernel.util.StringAttribute" value="eml://ecoinformatics.org/eml-2.0.0">
>>> </property>
>>> <property name="w6-pcp" class="org.ecoinformatics.seek.ecogrid.ResultRecordDetail">
>>> </property>
>>> <property name="_location" class="ptolemy.kernel.util.Location" value="[140.0, 285.0]">
>>> </property>
>>> <port name="ws" class="ptolemy.actor.TypedIOPort">
>>> <property name="output"/>
>>> </port>
>>> <port name="year" class="ptolemy.actor.TypedIOPort">
>>> <property name="output"/>
>>> </port>
>>> <port name="mo" class="ptolemy.actor.TypedIOPort">
>>> <property name="output"/>
>>> </port>
>>> <port name="precip" class="ptolemy.actor.TypedIOPort">
>>> <property name="output"/>
>>> </port>
>>> <port name="Ca" class="ptolemy.actor.TypedIOPort">
>>> <property name="output"/>
>>> </port>
>>> <port name="Mg" class="ptolemy.actor.TypedIOPort">
>>> <property name="output"/>
>>> </port>
>>> <port name="K" class="ptolemy.actor.TypedIOPort">
>>> <property name="output"/>
>>> </port>
>>> <port name="Na" class="ptolemy.actor.TypedIOPort">
>>> <property name="output"/>
>>> </port>
>>> <port name="Al" class="ptolemy.actor.TypedIOPort">
>>> <property name="output"/>
>>> </port>
>>> <port name="NH4" class="ptolemy.actor.TypedIOPort">
>>> <property name="output"/>
>>> </port>
>>> <port name="pH" class="ptolemy.actor.TypedIOPort">
>>> <property name="output"/>
>>> </port>
>>> <port name="SO4" class="ptolemy.actor.TypedIOPort">
>>> <property name="output"/>
>>> </port>
>>> <port name="NO3" class="ptolemy.actor.TypedIOPort">
>>> <property name="output"/>
>>> </port>
>>> <port name="Cl" class="ptolemy.actor.TypedIOPort">
>>> <property name="output"/>
>>> </port>
>>> <port name="PO4" class="ptolemy.actor.TypedIOPort">
>>> <property name="output"/>
>>> </port>
>>> <port name="SiO2" class="ptolemy.actor.TypedIOPort">
>>> <property name="output"/>
>>> </port>
>>> </entity>
>>> <relation name="relation" class="ptolemy.actor.TypedIORelation">
>>> <property name="width" class="ptolemy.data.expr.Parameter" value="1">
>>> </property>
>>> </relation>
>>> <link port="Display.input" relation="relation"/>
>>> <link port="Chemistry of Bulk Precipitation at HBEF WS-2.pH" relation="relation"/>
>>> </entity>
>>>
>>>
>>
>> _______________________________________________
>> Kepler-dev mailing list
>> Kepler-dev at ecoinformatics.org
>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-dev
>
More information about the Kepler-dev
mailing list