[kepler-dev] Bad dataset from workshop. Re: wf

Shawn Bowers sbowers at ucdavis.edu
Fri Jan 13 08:41:55 PST 2006


Just a quick observation: Wouldn't it also be a good thing to detect
these problems when the dataset/metadata is provided to a repository
such as EcoGrid?  This would probably avoid a lot of these problems.
Of course, it doesn't solve the case where the metadata was created
locally and loaded into Kepler ...

-shawn

Matt Jones wrote:
> EML can indeed describe fixed-format files.  And I had thought that Jing 
> had included support for those file formats in the EML200DataSource 
> actor.  I'm not sure how he had set up the interaction with hsql but my 
> impression was that it worked. Maybe he converts fixed format to 
> delimited format before passing it to hsql.  Jing can shed more light 
> here.  In any case, the metadata provider would need to provide accurate 
> metadata stating that its a fixed format file if in indeed it is.
> 
> In this particular case, the format may not be fixed -- rather, it may 
> be delimited with a variable number of delimiter tokens between data 
> values.  I'd have to see more of the file to figure that out.  There are 
> fields in EML to describe this case.
> 
> In general it seems that hsql's import routines are too limited to 
> support the variety of data formats that EML describes, so we will 
> probably need to do some translation in EML200DataSource before passing 
> the data to hsql.  Again, Jing should be able to clarify here.
> 
> Overall the EML data import has proven to be fragile, which we need to 
> fix.  There are several consistent problems:
> 
> 1) EML200DataSource doesn't use all available metadata in the EML
> 2) The metadata describing the data schema is wrong wrt the
>     data structure (e.g., too many or too few columns, wrong
>     delimeters, etc)
> 3) the data has missing values and therefore is not supported
>     by EML200DS (and won't be till we support null tokens)
> 4) Select values in the data are illegal -- for example, columnns
>     described as a float contain a value like "NA" without a statement
>     in EML that "NA" is used as a missing value code
> 5) The metadata says the data is available at a given URL, but when
>     we try to download the data it is either not there, restricted
>     access, or not actually a data file (usually a web page)
> 
> We need to systematically go through these issues and make this import 
> robust.  The only thing we can't really protect against is incorrect 
> metadata.  In that case, it would be nice if we gathered some 
> information about what is wrong and possibly ask the user about the 
> problem in order to adapt.  For example, consider the following two 
> questions we might be able to pose:
> 
> "The metadata indicates that records are separated by newlines, but we 
> have detected carriage returns in the data stream.  Should we switch to 
> using carriage returns as the record separator?"
> 
> 'The "bodylength" column should contian numeric values, but we have 
> detected repeated use of the value "N/A" in the data.  Should this be 
> considered a code for a missing value?'
> 
> There are probably a handful of common problems like this that we might 
> be able to anticipate.  Interestingly, these issues really can't be 
> detected until the data is retrieved, which is especially a problem if 
> we delay download of the data until the 'run' has started.
> 
> Matt
> 
> Kevin Ruland wrote:
>> Chad, et al.
>>
>> I looked at this.  It appears the problem is the metadata for this
>> dataset does not accurately describe its structure.
>>
>> The metadata states:
>>
>>             <physical scope="document">
>>                 <objectName>w6-pcp.txt</objectName>
>>                 <characterEncoding>ASCII</characterEncoding>
>>                 <dataFormat>
>>                     <textFormat>
>>                         <recordDelimiter>\n\r</recordDelimiter>
>>                         <attributeOrientation>column</attributeOrientation>
>>                         <simpleDelimited>
>>                             <fieldDelimiter>0x20</fieldDelimiter>
>>                         </simpleDelimited>
>>                     </textFormat>
>>                 </dataFormat>
>>                 <distribution scope="document">
>>                     <online>
>>                         <url
>> function="download">http://www.hubbardbrook.org/research/data/atmos/pcp_chem/w6-pcp.txt</url>
>>                     </online>
>>                 </distribution>
>>             </physical>
>>
>> Note the following:
>>
>> The recordDelimiter is \n\r.
>> The fieldDelimiter is 0x20 (space).
>> It does not have a node physical/dataFormat/textFormat/numHeaderLines
>> which we interpret as meaning there are no header records.
>>
>> Here are the first two records of the dataset:
>>
>>     ws  year  mo    precip      Ca      Mg       K      Na      Al    
>> NH4      pH     SO4     NO3      Cl     PO4   
>> Sio2                                             
>>      6  1963   6    67.500   0.300   0.070   0.100   0.070  -3.000 
>> -3.000   -3.00  -3.000  -3.000  -3.000 -3.0000  -3.000
>>
>> And it's hexdump:
>>
>> 0000000: 2020 2020 7773 2020 7965 6172 2020 6d6f      ws  year  mo
>> 0000010: 2020 2020 7072 6563 6970 2020 2020 2020      precip     
>> 0000020: 4361 2020 2020 2020 4d67 2020 2020 2020  Ca      Mg     
>> 0000030: 204b 2020 2020 2020 4e61 2020 2020 2020   K      Na     
>> 0000040: 416c 2020 2020 204e 4834 2020 2020 2020  Al     NH4     
>> 0000050: 7048 2020 2020 2053 4f34 2020 2020 204e  pH     SO4     N
>> 0000060: 4f33 2020 2020 2020 436c 2020 2020 2050  O3      Cl     P
>> 0000070: 4f34 2020 2020 5369 6f32 2020 2020 2020  O4    Sio2     
>> 0000080: 2020 2020 2020 2020 2020 2020 2020 2020                 
>> 0000090: 2020 2020 0909 0920 2020 2009 0a20 2020      ...    ..  
>> 00000a0: 2020 3620 2031 3936 3320 2020 3620 2020    6  1963   6  
>> 00000b0: 2036 372e 3530 3020 2020 302e 3330 3020   67.500   0.300
>> 00000c0: 2020 302e 3037 3020 2020 302e 3130 3020    0.070   0.100
>> 00000d0: 2020 302e 3037 3020 202d 332e 3030 3020    0.070  -3.000
>> 00000e0: 202d 332e 3030 3020 2020 2d33 2e30 3020   -3.000   -3.00
>> 00000f0: 202d 332e 3030 3020 202d 332e 3030 3020   -3.000  -3.000
>> 0000100: 202d 332e 3030 3020 2d33 2e30 3030 3020   -3.000 -3.0000
>> 0000110: 202d 332e 3030 300a 2020 2020 2036 2020   -3.000.     6 
>>
>> The top record is the header.  The metadata did not state it had a header.
>>
>> Note there are 4 spaces before 'ws' then 2 spaces before 'year'.  There
>> is a bunch of white space padding (0x20 and 0x09 tabs).  Also the line
>> terminator for this record is 0x0a (nl) whereas the metadata states the
>> line terminator is \n\r which should be (0x0a 0x0d).
>>
>> The second record is the first record of data.  It's structure is:  5
>> spaces then '6'
>>
>> We are using HSQL's Text Table functionality.  This essentially binds a
>> ddl definition in the database to an external data file (the text
>> file).  The functionality provided by hsql is not flexible enough to
>> handle regular expressions for the column seperators (which would be
>> required in this instance).  It is also not possible to parse a fixed
>> format text file (which this particular file appears to be).
>>
>> Essentially there are two problems with this dataset:  The metadata does
>> not adequately describe the data format, and, the hsql functionality
>> does not support this type of text file structure.
>>
>> I'm not exactly certain what we can do here.  I don't know if eml can
>> describe a fixed format file.  And even if it did, we'd have to do some
>> work to parse the fixed format file and insert into the database table.
>>
>> Kevin
>>
>> Chad Berkley wrote:
>>
>>
>>> <?xml version="1.0" standalone="no"?>
>>> <!DOCTYPE entity PUBLIC "-//UC Berkeley//DTD MoML 1//EN"
>>>   "http://ptolemy.eecs.berkeley.edu/xml/dtd/MoML_1.dtd">
>>> <entity name="model" class="ptolemy.actor.TypedCompositeActor">
>>>   <property name="_createdBy" class="ptolemy.kernel.attributes.VersionAttribute" value="5.1-alpha">
>>>   </property>
>>>   <property name="SDF Director" class="ptolemy.domains.sdf.kernel.SDFDirector">
>>>       <property name="_svgIcon" class="ptolemy.kernel.util.ConfigurableAttribute">
>>>           <configure>../kepler-docs/dev/usability/graphics/svg/director.svg</configure>
>>>       </property>
>>>       <property name="_thumbnailRasterIcon" class="ptolemy.kernel.util.ConfigurableAttribute">
>>>           <configure>/actorthumbs/director-sm.gif</configure>
>>>       </property>
>>>       <property name="timeResolution" class="ptolemy.moml.SharedParameter" value="1E-10">
>>>       </property>
>>>       <property name="entityId" class="org.kepler.moml.NamedObjId" value="urn:lsid:kepler-project.org:director:1:1">
>>>       </property>
>>>       <property name="class" class="ptolemy.kernel.util.StringAttribute" value="ptolemy.domains.sdf.kernel.SDFDirector">
>>>           <property name="id" class="ptolemy.kernel.util.StringAttribute" value="urn:lsid:kepler-project.org:directorclass:1:1">
>>>           </property>
>>>       </property>
>>>       <property name="semanticType000" class="org.kepler.sms.SemanticType" value="urn:lsid:localhost:onto:1:1#Director">
>>>       </property>
>>>       <property name="_location" class="ptolemy.kernel.util.Location" value="{160, 110}">
>>>       </property>
>>>   </property>
>>>   <property name="_windowProperties" class="ptolemy.actor.gui.WindowPropertiesAttribute" value="{bounds={82, 22, 850, 732}, maximized=false}">
>>>   </property>
>>>   <property name="_vergilSize" class="ptolemy.actor.gui.SizeAttribute" value="[590, 610]">
>>>   </property>
>>>   <property name="_vergilZoomFactor" class="ptolemy.data.expr.ExpertParameter" value="1.0">
>>>   </property>
>>>   <property name="_vergilCenter" class="ptolemy.data.expr.ExpertParameter" value="{295.0, 305.0}">
>>>   </property>
>>>   <entity name="Display" class="ptolemy.actor.lib.gui.Display">
>>>       <property name="_svgIcon" class="ptolemy.kernel.util.ConfigurableAttribute">
>>>           <configure>../kepler-docs/dev/usability/graphics/svg/text_disp.svg</configure>
>>>       </property>
>>>       <property name="_thumbnailRasterIcon" class="ptolemy.kernel.util.ConfigurableAttribute">
>>>           <configure>/actorthumbs/text_disp-sm.gif</configure>
>>>       </property>
>>>       <property name="rowsDisplayed" class="ptolemy.data.expr.Parameter" value="10">
>>>       </property>
>>>       <property name="columnsDisplayed" class="ptolemy.data.expr.Parameter" value="40">
>>>       </property>
>>>       <property name="suppressBlankLines" class="ptolemy.data.expr.Parameter" value="false">
>>>       </property>
>>>       <property name="_windowProperties" class="ptolemy.actor.gui.WindowPropertiesAttribute" value="{bounds={270, 279, 484, 209}, maximized=false}">
>>>       </property>
>>>       <property name="entityId" class="org.kepler.moml.NamedObjId" value="urn:lsid:kepler-project.org:actor:7:1">
>>>       </property>
>>>       <property name="class" class="ptolemy.kernel.util.StringAttribute" value="ptolemy.actor.lib.gui.Display">
>>>           <property name="id" class="ptolemy.kernel.util.StringAttribute" value="urn:lsid:kepler-project.org:class:883:1">
>>>           </property>
>>>       </property>
>>>       <property name="semanticType000" class="org.kepler.sms.SemanticType" value="urn:lsid:localhost:onto:1:1#TextualOutputActor">
>>>       </property>
>>>       <property name="_location" class="ptolemy.kernel.util.Location" value="{310, 345}">
>>>       </property>
>>>   </entity>
>>>   <entity name="Chemistry of Bulk Precipitation at HBEF WS-2" class="org.ecoinformatics.seek.datasource.eml.eml2.Eml200DataSource">
>>>       <property name="_svgIcon" class="ptolemy.kernel.util.ConfigurableAttribute">
>>>           <configure>../kepler-docs/dev/usability/graphics/svg/dataFile.svg</configure>
>>>       </property>
>>>       <property name="_thumbnailRasterIcon" class="ptolemy.kernel.util.ConfigurableAttribute">
>>>           <configure>/actorthumbs/dataFile-sm.gif</configure>
>>>       </property>
>>>       <property name="schemaDef" class="ptolemy.kernel.util.StringAttribute" value="&lt;schema&gt;&#10;  &lt;table name=&quot;w6-pcp&quot;&gt;&#10;    &lt;field name=&quot;ws&quot; dataType=&quot;STRING&quot;/&gt;&#10;    &lt;field name=&quot;year&quot; dataType=&quot;STRING&quot;/&gt;&#10;    &lt;field name=&quot;mo&quot; dataType=&quot;STRING&quot;/&gt;&#10;    &lt;field name=&quot;precip&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;Ca&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;Mg&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;K&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;Na&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;Al&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;NH4&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;pH&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;SO4&quot; dataType=&quot;FLOAT&quo
t;
> /&gt;&#10;    &lt;field name=&quot;NO3&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;Cl&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;PO4&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;SiO2&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;  &lt;/table&gt;&#10;&lt;/schema&gt;&#10;">
>>>           <property name="schemaDef" class="ptolemy.actor.gui.style.TextStyle">
>>>               <property name="height" class="ptolemy.data.expr.Parameter" value="10">
>>>               </property>
>>>               <property name="width" class="ptolemy.data.expr.Parameter" value="30">
>>>               </property>
>>>           </property>
>>>       </property>
>>>       <property name="sqlDef" class="ptolemy.kernel.util.StringAttribute">
>>>           <property name="sqlDef" class="ptolemy.actor.gui.style.TextStyle">
>>>               <property name="height" class="ptolemy.data.expr.Parameter" value="10">
>>>               </property>
>>>               <property name="width" class="ptolemy.data.expr.Parameter" value="30">
>>>               </property>
>>>           </property>
>>>       </property>
>>>       <property name="Selected Entity" class="ptolemy.data.expr.StringParameter" value="w6-pcp">
>>>       </property>
>>>       <property name="outputType" class="ptolemy.data.expr.StringParameter" value="As Field">
>>>       </property>
>>>       <property name="_tableauFactory" class="org.kepler.objectmanager.data.db.QBTableauFactory">
>>>           <property name="sqlName" class="ptolemy.kernel.util.StringAttribute" value="sqlDef">
>>>           </property>
>>>           <property name="schemaName" class="ptolemy.kernel.util.StringAttribute" value="schemaDef">
>>>           </property>
>>>       </property>
>>>       <property name="recordid" class="ptolemy.kernel.util.StringAttribute" value="knb-lter-hbr.20.1">
>>>       </property>
>>>       <property name="endpoint" class="ptolemy.kernel.util.StringAttribute" value="http://ecogrid.ecoinformatics.org/knb/services/EcoGridQuery">
>>>       </property>
>>>       <property name="namespace" class="ptolemy.kernel.util.StringAttribute" value="eml://ecoinformatics.org/eml-2.0.0">
>>>       </property>
>>>       <property name="w6-pcp" class="org.ecoinformatics.seek.ecogrid.ResultRecordDetail">
>>>       </property>
>>>       <property name="_location" class="ptolemy.kernel.util.Location" value="[140.0, 285.0]">
>>>       </property>
>>>       <port name="ws" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="year" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="mo" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="precip" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="Ca" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="Mg" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="K" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="Na" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="Al" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="NH4" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="pH" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="SO4" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="NO3" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="Cl" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="PO4" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="SiO2" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>   </entity>
>>>   <relation name="relation" class="ptolemy.actor.TypedIORelation">
>>>       <property name="width" class="ptolemy.data.expr.Parameter" value="1">
>>>       </property>
>>>   </relation>
>>>   <link port="Display.input" relation="relation"/>
>>>   <link port="Chemistry of Bulk Precipitation at HBEF WS-2.pH" relation="relation"/>
>>> </entity>
>>>
>>>
>>
>> _______________________________________________
>> Kepler-dev mailing list
>> Kepler-dev at ecoinformatics.org
>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-dev
> 



More information about the Kepler-dev mailing list