[kepler-dev] Bad dataset from workshop. Re: wf

Matt Jones jones at nceas.ucsb.edu
Fri Jan 13 08:34:42 PST 2006


EML can indeed describe fixed-format files.  And I had thought that Jing 
had included support for those file formats in the EML200DataSource 
actor.  I'm not sure how he had set up the interaction with hsql but my 
impression was that it worked. Maybe he converts fixed format to 
delimited format before passing it to hsql.  Jing can shed more light 
here.  In any case, the metadata provider would need to provide accurate 
metadata stating that its a fixed format file if in indeed it is.

In this particular case, the format may not be fixed -- rather, it may 
be delimited with a variable number of delimiter tokens between data 
values.  I'd have to see more of the file to figure that out.  There are 
fields in EML to describe this case.

In general it seems that hsql's import routines are too limited to 
support the variety of data formats that EML describes, so we will 
probably need to do some translation in EML200DataSource before passing 
the data to hsql.  Again, Jing should be able to clarify here.

Overall the EML data import has proven to be fragile, which we need to 
fix.  There are several consistent problems:

1) EML200DataSource doesn't use all available metadata in the EML
2) The metadata describing the data schema is wrong wrt the
    data structure (e.g., too many or too few columns, wrong
    delimeters, etc)
3) the data has missing values and therefore is not supported
    by EML200DS (and won't be till we support null tokens)
4) Select values in the data are illegal -- for example, columnns
    described as a float contain a value like "NA" without a statement
    in EML that "NA" is used as a missing value code
5) The metadata says the data is available at a given URL, but when
    we try to download the data it is either not there, restricted
    access, or not actually a data file (usually a web page)

We need to systematically go through these issues and make this import 
robust.  The only thing we can't really protect against is incorrect 
metadata.  In that case, it would be nice if we gathered some 
information about what is wrong and possibly ask the user about the 
problem in order to adapt.  For example, consider the following two 
questions we might be able to pose:

"The metadata indicates that records are separated by newlines, but we 
have detected carriage returns in the data stream.  Should we switch to 
using carriage returns as the record separator?"

'The "bodylength" column should contian numeric values, but we have 
detected repeated use of the value "N/A" in the data.  Should this be 
considered a code for a missing value?'

There are probably a handful of common problems like this that we might 
be able to anticipate.  Interestingly, these issues really can't be 
detected until the data is retrieved, which is especially a problem if 
we delay download of the data until the 'run' has started.

Matt

Kevin Ruland wrote:
> Chad, et al.
> 
> I looked at this.  It appears the problem is the metadata for this
> dataset does not accurately describe its structure.
> 
> The metadata states:
> 
>             <physical scope="document">
>                 <objectName>w6-pcp.txt</objectName>
>                 <characterEncoding>ASCII</characterEncoding>
>                 <dataFormat>
>                     <textFormat>
>                         <recordDelimiter>\n\r</recordDelimiter>
>                         <attributeOrientation>column</attributeOrientation>
>                         <simpleDelimited>
>                             <fieldDelimiter>0x20</fieldDelimiter>
>                         </simpleDelimited>
>                     </textFormat>
>                 </dataFormat>
>                 <distribution scope="document">
>                     <online>
>                         <url
> function="download">http://www.hubbardbrook.org/research/data/atmos/pcp_chem/w6-pcp.txt</url>
>                     </online>
>                 </distribution>
>             </physical>
> 
> Note the following:
> 
> The recordDelimiter is \n\r.
> The fieldDelimiter is 0x20 (space).
> It does not have a node physical/dataFormat/textFormat/numHeaderLines
> which we interpret as meaning there are no header records.
> 
> Here are the first two records of the dataset:
> 
>     ws  year  mo    precip      Ca      Mg       K      Na      Al    
> NH4      pH     SO4     NO3      Cl     PO4   
> Sio2                                             
>      6  1963   6    67.500   0.300   0.070   0.100   0.070  -3.000 
> -3.000   -3.00  -3.000  -3.000  -3.000 -3.0000  -3.000
> 
> And it's hexdump:
> 
> 0000000: 2020 2020 7773 2020 7965 6172 2020 6d6f      ws  year  mo
> 0000010: 2020 2020 7072 6563 6970 2020 2020 2020      precip     
> 0000020: 4361 2020 2020 2020 4d67 2020 2020 2020  Ca      Mg     
> 0000030: 204b 2020 2020 2020 4e61 2020 2020 2020   K      Na     
> 0000040: 416c 2020 2020 204e 4834 2020 2020 2020  Al     NH4     
> 0000050: 7048 2020 2020 2053 4f34 2020 2020 204e  pH     SO4     N
> 0000060: 4f33 2020 2020 2020 436c 2020 2020 2050  O3      Cl     P
> 0000070: 4f34 2020 2020 5369 6f32 2020 2020 2020  O4    Sio2     
> 0000080: 2020 2020 2020 2020 2020 2020 2020 2020                 
> 0000090: 2020 2020 0909 0920 2020 2009 0a20 2020      ...    ..  
> 00000a0: 2020 3620 2031 3936 3320 2020 3620 2020    6  1963   6  
> 00000b0: 2036 372e 3530 3020 2020 302e 3330 3020   67.500   0.300
> 00000c0: 2020 302e 3037 3020 2020 302e 3130 3020    0.070   0.100
> 00000d0: 2020 302e 3037 3020 202d 332e 3030 3020    0.070  -3.000
> 00000e0: 202d 332e 3030 3020 2020 2d33 2e30 3020   -3.000   -3.00
> 00000f0: 202d 332e 3030 3020 202d 332e 3030 3020   -3.000  -3.000
> 0000100: 202d 332e 3030 3020 2d33 2e30 3030 3020   -3.000 -3.0000
> 0000110: 202d 332e 3030 300a 2020 2020 2036 2020   -3.000.     6 
> 
> The top record is the header.  The metadata did not state it had a header.
> 
> Note there are 4 spaces before 'ws' then 2 spaces before 'year'.  There
> is a bunch of white space padding (0x20 and 0x09 tabs).  Also the line
> terminator for this record is 0x0a (nl) whereas the metadata states the
> line terminator is \n\r which should be (0x0a 0x0d).
> 
> The second record is the first record of data.  It's structure is:  5
> spaces then '6'
> 
> We are using HSQL's Text Table functionality.  This essentially binds a
> ddl definition in the database to an external data file (the text
> file).  The functionality provided by hsql is not flexible enough to
> handle regular expressions for the column seperators (which would be
> required in this instance).  It is also not possible to parse a fixed
> format text file (which this particular file appears to be).
> 
> Essentially there are two problems with this dataset:  The metadata does
> not adequately describe the data format, and, the hsql functionality
> does not support this type of text file structure.
> 
> I'm not exactly certain what we can do here.  I don't know if eml can
> describe a fixed format file.  And even if it did, we'd have to do some
> work to parse the fixed format file and insert into the database table.
> 
> Kevin
> 
> Chad Berkley wrote:
> 
> 
>><?xml version="1.0" standalone="no"?>
>><!DOCTYPE entity PUBLIC "-//UC Berkeley//DTD MoML 1//EN"
>>   "http://ptolemy.eecs.berkeley.edu/xml/dtd/MoML_1.dtd">
>><entity name="model" class="ptolemy.actor.TypedCompositeActor">
>>   <property name="_createdBy" class="ptolemy.kernel.attributes.VersionAttribute" value="5.1-alpha">
>>   </property>
>>   <property name="SDF Director" class="ptolemy.domains.sdf.kernel.SDFDirector">
>>       <property name="_svgIcon" class="ptolemy.kernel.util.ConfigurableAttribute">
>>           <configure>../kepler-docs/dev/usability/graphics/svg/director.svg</configure>
>>       </property>
>>       <property name="_thumbnailRasterIcon" class="ptolemy.kernel.util.ConfigurableAttribute">
>>           <configure>/actorthumbs/director-sm.gif</configure>
>>       </property>
>>       <property name="timeResolution" class="ptolemy.moml.SharedParameter" value="1E-10">
>>       </property>
>>       <property name="entityId" class="org.kepler.moml.NamedObjId" value="urn:lsid:kepler-project.org:director:1:1">
>>       </property>
>>       <property name="class" class="ptolemy.kernel.util.StringAttribute" value="ptolemy.domains.sdf.kernel.SDFDirector">
>>           <property name="id" class="ptolemy.kernel.util.StringAttribute" value="urn:lsid:kepler-project.org:directorclass:1:1">
>>           </property>
>>       </property>
>>       <property name="semanticType000" class="org.kepler.sms.SemanticType" value="urn:lsid:localhost:onto:1:1#Director">
>>       </property>
>>       <property name="_location" class="ptolemy.kernel.util.Location" value="{160, 110}">
>>       </property>
>>   </property>
>>   <property name="_windowProperties" class="ptolemy.actor.gui.WindowPropertiesAttribute" value="{bounds={82, 22, 850, 732}, maximized=false}">
>>   </property>
>>   <property name="_vergilSize" class="ptolemy.actor.gui.SizeAttribute" value="[590, 610]">
>>   </property>
>>   <property name="_vergilZoomFactor" class="ptolemy.data.expr.ExpertParameter" value="1.0">
>>   </property>
>>   <property name="_vergilCenter" class="ptolemy.data.expr.ExpertParameter" value="{295.0, 305.0}">
>>   </property>
>>   <entity name="Display" class="ptolemy.actor.lib.gui.Display">
>>       <property name="_svgIcon" class="ptolemy.kernel.util.ConfigurableAttribute">
>>           <configure>../kepler-docs/dev/usability/graphics/svg/text_disp.svg</configure>
>>       </property>
>>       <property name="_thumbnailRasterIcon" class="ptolemy.kernel.util.ConfigurableAttribute">
>>           <configure>/actorthumbs/text_disp-sm.gif</configure>
>>       </property>
>>       <property name="rowsDisplayed" class="ptolemy.data.expr.Parameter" value="10">
>>       </property>
>>       <property name="columnsDisplayed" class="ptolemy.data.expr.Parameter" value="40">
>>       </property>
>>       <property name="suppressBlankLines" class="ptolemy.data.expr.Parameter" value="false">
>>       </property>
>>       <property name="_windowProperties" class="ptolemy.actor.gui.WindowPropertiesAttribute" value="{bounds={270, 279, 484, 209}, maximized=false}">
>>       </property>
>>       <property name="entityId" class="org.kepler.moml.NamedObjId" value="urn:lsid:kepler-project.org:actor:7:1">
>>       </property>
>>       <property name="class" class="ptolemy.kernel.util.StringAttribute" value="ptolemy.actor.lib.gui.Display">
>>           <property name="id" class="ptolemy.kernel.util.StringAttribute" value="urn:lsid:kepler-project.org:class:883:1">
>>           </property>
>>       </property>
>>       <property name="semanticType000" class="org.kepler.sms.SemanticType" value="urn:lsid:localhost:onto:1:1#TextualOutputActor">
>>       </property>
>>       <property name="_location" class="ptolemy.kernel.util.Location" value="{310, 345}">
>>       </property>
>>   </entity>
>>   <entity name="Chemistry of Bulk Precipitation at HBEF WS-2" class="org.ecoinformatics.seek.datasource.eml.eml2.Eml200DataSource">
>>       <property name="_svgIcon" class="ptolemy.kernel.util.ConfigurableAttribute">
>>           <configure>../kepler-docs/dev/usability/graphics/svg/dataFile.svg</configure>
>>       </property>
>>       <property name="_thumbnailRasterIcon" class="ptolemy.kernel.util.ConfigurableAttribute">
>>           <configure>/actorthumbs/dataFile-sm.gif</configure>
>>       </property>
>>       <property name="schemaDef" class="ptolemy.kernel.util.StringAttribute" value="&lt;schema&gt;&#10;  &lt;table name=&quot;w6-pcp&quot;&gt;&#10;    &lt;field name=&quot;ws&quot; dataType=&quot;STRING&quot;/&gt;&#10;    &lt;field name=&quot;year&quot; dataType=&quot;STRING&quot;/&gt;&#10;    &lt;field name=&quot;mo&quot; dataType=&quot;STRING&quot;/&gt;&#10;    &lt;field name=&quot;precip&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;Ca&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;Mg&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;K&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;Na&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;Al&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;NH4&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;pH&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;SO4&quot; dataType=&quot;FLOAT&quot;
/&gt;&#10;    &lt;field name=&quot;NO3&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;Cl&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;PO4&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;SiO2&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;  &lt;/table&gt;&#10;&lt;/schema&gt;&#10;">
>>           <property name="schemaDef" class="ptolemy.actor.gui.style.TextStyle">
>>               <property name="height" class="ptolemy.data.expr.Parameter" value="10">
>>               </property>
>>               <property name="width" class="ptolemy.data.expr.Parameter" value="30">
>>               </property>
>>           </property>
>>       </property>
>>       <property name="sqlDef" class="ptolemy.kernel.util.StringAttribute">
>>           <property name="sqlDef" class="ptolemy.actor.gui.style.TextStyle">
>>               <property name="height" class="ptolemy.data.expr.Parameter" value="10">
>>               </property>
>>               <property name="width" class="ptolemy.data.expr.Parameter" value="30">
>>               </property>
>>           </property>
>>       </property>
>>       <property name="Selected Entity" class="ptolemy.data.expr.StringParameter" value="w6-pcp">
>>       </property>
>>       <property name="outputType" class="ptolemy.data.expr.StringParameter" value="As Field">
>>       </property>
>>       <property name="_tableauFactory" class="org.kepler.objectmanager.data.db.QBTableauFactory">
>>           <property name="sqlName" class="ptolemy.kernel.util.StringAttribute" value="sqlDef">
>>           </property>
>>           <property name="schemaName" class="ptolemy.kernel.util.StringAttribute" value="schemaDef">
>>           </property>
>>       </property>
>>       <property name="recordid" class="ptolemy.kernel.util.StringAttribute" value="knb-lter-hbr.20.1">
>>       </property>
>>       <property name="endpoint" class="ptolemy.kernel.util.StringAttribute" value="http://ecogrid.ecoinformatics.org/knb/services/EcoGridQuery">
>>       </property>
>>       <property name="namespace" class="ptolemy.kernel.util.StringAttribute" value="eml://ecoinformatics.org/eml-2.0.0">
>>       </property>
>>       <property name="w6-pcp" class="org.ecoinformatics.seek.ecogrid.ResultRecordDetail">
>>       </property>
>>       <property name="_location" class="ptolemy.kernel.util.Location" value="[140.0, 285.0]">
>>       </property>
>>       <port name="ws" class="ptolemy.actor.TypedIOPort">
>>           <property name="output"/>
>>       </port>
>>       <port name="year" class="ptolemy.actor.TypedIOPort">
>>           <property name="output"/>
>>       </port>
>>       <port name="mo" class="ptolemy.actor.TypedIOPort">
>>           <property name="output"/>
>>       </port>
>>       <port name="precip" class="ptolemy.actor.TypedIOPort">
>>           <property name="output"/>
>>       </port>
>>       <port name="Ca" class="ptolemy.actor.TypedIOPort">
>>           <property name="output"/>
>>       </port>
>>       <port name="Mg" class="ptolemy.actor.TypedIOPort">
>>           <property name="output"/>
>>       </port>
>>       <port name="K" class="ptolemy.actor.TypedIOPort">
>>           <property name="output"/>
>>       </port>
>>       <port name="Na" class="ptolemy.actor.TypedIOPort">
>>           <property name="output"/>
>>       </port>
>>       <port name="Al" class="ptolemy.actor.TypedIOPort">
>>           <property name="output"/>
>>       </port>
>>       <port name="NH4" class="ptolemy.actor.TypedIOPort">
>>           <property name="output"/>
>>       </port>
>>       <port name="pH" class="ptolemy.actor.TypedIOPort">
>>           <property name="output"/>
>>       </port>
>>       <port name="SO4" class="ptolemy.actor.TypedIOPort">
>>           <property name="output"/>
>>       </port>
>>       <port name="NO3" class="ptolemy.actor.TypedIOPort">
>>           <property name="output"/>
>>       </port>
>>       <port name="Cl" class="ptolemy.actor.TypedIOPort">
>>           <property name="output"/>
>>       </port>
>>       <port name="PO4" class="ptolemy.actor.TypedIOPort">
>>           <property name="output"/>
>>       </port>
>>       <port name="SiO2" class="ptolemy.actor.TypedIOPort">
>>           <property name="output"/>
>>       </port>
>>   </entity>
>>   <relation name="relation" class="ptolemy.actor.TypedIORelation">
>>       <property name="width" class="ptolemy.data.expr.Parameter" value="1">
>>       </property>
>>   </relation>
>>   <link port="Display.input" relation="relation"/>
>>   <link port="Chemistry of Bulk Precipitation at HBEF WS-2.pH" relation="relation"/>
>></entity>
>> 
>>
> 
> 
> _______________________________________________
> Kepler-dev mailing list
> Kepler-dev at ecoinformatics.org
> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-dev

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Matt Jones                                   Ph: 907-789-0496
jones at nceas.ucsb.edu                    SIP #: 1-747-626-7082
National Center for Ecological Analysis and Synthesis (NCEAS)
UC Santa Barbara     http://www.nceas.ucsb.edu/ecoinformatics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


More information about the Kepler-dev mailing list