[kepler-dev] Bad dataset from workshop. Re: wf

Jing Tao tao at nceas.ucsb.edu
Fri Jan 13 11:44:35 PST 2006


hi, matt:

Since hsqldb doesn't support fixed delimiter text file table, so i have to 
use fixed delimiter reader to read data file and load it into hsqldb 
(this is not in memory, it is in hsqldb data file). So kepler can handle 
dataset from hsqldb.

but this case is simple delimited dataset, we only make a mapping between 
datset file and hsqldb table.

Jing

Jing Tao
National Center for Ecological
Analysis and Synthesis (NCEAS)
735 State St. Suite 204
Santa Barbara, CA 93101

On Fri, 13 Jan 2006, Matt Jones wrote:

> Date: Fri, 13 Jan 2006 07:34:42 -0900
> From: Matt Jones <jones at nceas.ucsb.edu>
> To: Kevin Ruland <kruland at ku.edu>
> Cc: Kepler-Dev <kepler-dev at ecoinformatics.org>
> Subject: Re: [kepler-dev] Bad dataset from workshop.  Re: wf
> 
> EML can indeed describe fixed-format files.  And I had thought that Jing
> had included support for those file formats in the EML200DataSource
> actor.  I'm not sure how he had set up the interaction with hsql but my
> impression was that it worked. Maybe he converts fixed format to
> delimited format before passing it to hsql.  Jing can shed more light
> here.  In any case, the metadata provider would need to provide accurate
> metadata stating that its a fixed format file if in indeed it is.
>
> In this particular case, the format may not be fixed -- rather, it may
> be delimited with a variable number of delimiter tokens between data
> values.  I'd have to see more of the file to figure that out.  There are
> fields in EML to describe this case.
>
> In general it seems that hsql's import routines are too limited to
> support the variety of data formats that EML describes, so we will
> probably need to do some translation in EML200DataSource before passing
> the data to hsql.  Again, Jing should be able to clarify here.
>
> Overall the EML data import has proven to be fragile, which we need to
> fix.  There are several consistent problems:
>
> 1) EML200DataSource doesn't use all available metadata in the EML
> 2) The metadata describing the data schema is wrong wrt the
>    data structure (e.g., too many or too few columns, wrong
>    delimeters, etc)
> 3) the data has missing values and therefore is not supported
>    by EML200DS (and won't be till we support null tokens)
> 4) Select values in the data are illegal -- for example, columnns
>    described as a float contain a value like "NA" without a statement
>    in EML that "NA" is used as a missing value code
> 5) The metadata says the data is available at a given URL, but when
>    we try to download the data it is either not there, restricted
>    access, or not actually a data file (usually a web page)
>
> We need to systematically go through these issues and make this import
> robust.  The only thing we can't really protect against is incorrect
> metadata.  In that case, it would be nice if we gathered some
> information about what is wrong and possibly ask the user about the
> problem in order to adapt.  For example, consider the following two
> questions we might be able to pose:
>
> "The metadata indicates that records are separated by newlines, but we
> have detected carriage returns in the data stream.  Should we switch to
> using carriage returns as the record separator?"
>
> 'The "bodylength" column should contian numeric values, but we have
> detected repeated use of the value "N/A" in the data.  Should this be
> considered a code for a missing value?'
>
> There are probably a handful of common problems like this that we might
> be able to anticipate.  Interestingly, these issues really can't be
> detected until the data is retrieved, which is especially a problem if
> we delay download of the data until the 'run' has started.
>
> Matt
>
> Kevin Ruland wrote:
>> Chad, et al.
>>
>> I looked at this.  It appears the problem is the metadata for this
>> dataset does not accurately describe its structure.
>>
>> The metadata states:
>>
>>             <physical scope="document">
>>                 <objectName>w6-pcp.txt</objectName>
>>                 <characterEncoding>ASCII</characterEncoding>
>>                 <dataFormat>
>>                     <textFormat>
>>                         <recordDelimiter>\n\r</recordDelimiter>
>>                         <attributeOrientation>column</attributeOrientation>
>>                         <simpleDelimited>
>>                             <fieldDelimiter>0x20</fieldDelimiter>
>>                         </simpleDelimited>
>>                     </textFormat>
>>                 </dataFormat>
>>                 <distribution scope="document">
>>                     <online>
>>                         <url
>> function="download">http://www.hubbardbrook.org/research/data/atmos/pcp_chem/w6-pcp.txt</url>
>>                     </online>
>>                 </distribution>
>>             </physical>
>>
>> Note the following:
>>
>> The recordDelimiter is \n\r.
>> The fieldDelimiter is 0x20 (space).
>> It does not have a node physical/dataFormat/textFormat/numHeaderLines
>> which we interpret as meaning there are no header records.
>>
>> Here are the first two records of the dataset:
>>
>>     ws  year  mo    precip      Ca      Mg       K      Na      Al
>> NH4      pH     SO4     NO3      Cl     PO4
>> Sio2
>>      6  1963   6    67.500   0.300   0.070   0.100   0.070  -3.000
>> -3.000   -3.00  -3.000  -3.000  -3.000 -3.0000  -3.000
>>
>> And it's hexdump:
>>
>> 0000000: 2020 2020 7773 2020 7965 6172 2020 6d6f      ws  year  mo
>> 0000010: 2020 2020 7072 6563 6970 2020 2020 2020      precip
>> 0000020: 4361 2020 2020 2020 4d67 2020 2020 2020  Ca      Mg
>> 0000030: 204b 2020 2020 2020 4e61 2020 2020 2020   K      Na
>> 0000040: 416c 2020 2020 204e 4834 2020 2020 2020  Al     NH4
>> 0000050: 7048 2020 2020 2053 4f34 2020 2020 204e  pH     SO4     N
>> 0000060: 4f33 2020 2020 2020 436c 2020 2020 2050  O3      Cl     P
>> 0000070: 4f34 2020 2020 5369 6f32 2020 2020 2020  O4    Sio2
>> 0000080: 2020 2020 2020 2020 2020 2020 2020 2020
>> 0000090: 2020 2020 0909 0920 2020 2009 0a20 2020      ...    ..
>> 00000a0: 2020 3620 2031 3936 3320 2020 3620 2020    6  1963   6
>> 00000b0: 2036 372e 3530 3020 2020 302e 3330 3020   67.500   0.300
>> 00000c0: 2020 302e 3037 3020 2020 302e 3130 3020    0.070   0.100
>> 00000d0: 2020 302e 3037 3020 202d 332e 3030 3020    0.070  -3.000
>> 00000e0: 202d 332e 3030 3020 2020 2d33 2e30 3020   -3.000   -3.00
>> 00000f0: 202d 332e 3030 3020 202d 332e 3030 3020   -3.000  -3.000
>> 0000100: 202d 332e 3030 3020 2d33 2e30 3030 3020   -3.000 -3.0000
>> 0000110: 202d 332e 3030 300a 2020 2020 2036 2020   -3.000.     6
>>
>> The top record is the header.  The metadata did not state it had a header.
>>
>> Note there are 4 spaces before 'ws' then 2 spaces before 'year'.  There
>> is a bunch of white space padding (0x20 and 0x09 tabs).  Also the line
>> terminator for this record is 0x0a (nl) whereas the metadata states the
>> line terminator is \n\r which should be (0x0a 0x0d).
>>
>> The second record is the first record of data.  It's structure is:  5
>> spaces then '6'
>>
>> We are using HSQL's Text Table functionality.  This essentially binds a
>> ddl definition in the database to an external data file (the text
>> file).  The functionality provided by hsql is not flexible enough to
>> handle regular expressions for the column seperators (which would be
>> required in this instance).  It is also not possible to parse a fixed
>> format text file (which this particular file appears to be).
>>
>> Essentially there are two problems with this dataset:  The metadata does
>> not adequately describe the data format, and, the hsql functionality
>> does not support this type of text file structure.
>>
>> I'm not exactly certain what we can do here.  I don't know if eml can
>> describe a fixed format file.  And even if it did, we'd have to do some
>> work to parse the fixed format file and insert into the database table.
>>
>> Kevin
>>
>> Chad Berkley wrote:
>>
>>
>>> <?xml version="1.0" standalone="no"?>
>>> <!DOCTYPE entity PUBLIC "-//UC Berkeley//DTD MoML 1//EN"
>>>   "http://ptolemy.eecs.berkeley.edu/xml/dtd/MoML_1.dtd">
>>> <entity name="model" class="ptolemy.actor.TypedCompositeActor">
>>>   <property name="_createdBy" class="ptolemy.kernel.attributes.VersionAttribute" value="5.1-alpha">
>>>   </property>
>>>   <property name="SDF Director" class="ptolemy.domains.sdf.kernel.SDFDirector">
>>>       <property name="_svgIcon" class="ptolemy.kernel.util.ConfigurableAttribute">
>>>           <configure>../kepler-docs/dev/usability/graphics/svg/director.svg</configure>
>>>       </property>
>>>       <property name="_thumbnailRasterIcon" class="ptolemy.kernel.util.ConfigurableAttribute">
>>>           <configure>/actorthumbs/director-sm.gif</configure>
>>>       </property>
>>>       <property name="timeResolution" class="ptolemy.moml.SharedParameter" value="1E-10">
>>>       </property>
>>>       <property name="entityId" class="org.kepler.moml.NamedObjId" value="urn:lsid:kepler-project.org:director:1:1">
>>>       </property>
>>>       <property name="class" class="ptolemy.kernel.util.StringAttribute" value="ptolemy.domains.sdf.kernel.SDFDirector">
>>>           <property name="id" class="ptolemy.kernel.util.StringAttribute" value="urn:lsid:kepler-project.org:directorclass:1:1">
>>>           </property>
>>>       </property>
>>>       <property name="semanticType000" class="org.kepler.sms.SemanticType" value="urn:lsid:localhost:onto:1:1#Director">
>>>       </property>
>>>       <property name="_location" class="ptolemy.kernel.util.Location" value="{160, 110}">
>>>       </property>
>>>   </property>
>>>   <property name="_windowProperties" class="ptolemy.actor.gui.WindowPropertiesAttribute" value="{bounds={82, 22, 850, 732}, maximized=false}">
>>>   </property>
>>>   <property name="_vergilSize" class="ptolemy.actor.gui.SizeAttribute" value="[590, 610]">
>>>   </property>
>>>   <property name="_vergilZoomFactor" class="ptolemy.data.expr.ExpertParameter" value="1.0">
>>>   </property>
>>>   <property name="_vergilCenter" class="ptolemy.data.expr.ExpertParameter" value="{295.0, 305.0}">
>>>   </property>
>>>   <entity name="Display" class="ptolemy.actor.lib.gui.Display">
>>>       <property name="_svgIcon" class="ptolemy.kernel.util.ConfigurableAttribute">
>>>           <configure>../kepler-docs/dev/usability/graphics/svg/text_disp.svg</configure>
>>>       </property>
>>>       <property name="_thumbnailRasterIcon" class="ptolemy.kernel.util.ConfigurableAttribute">
>>>           <configure>/actorthumbs/text_disp-sm.gif</configure>
>>>       </property>
>>>       <property name="rowsDisplayed" class="ptolemy.data.expr.Parameter" value="10">
>>>       </property>
>>>       <property name="columnsDisplayed" class="ptolemy.data.expr.Parameter" value="40">
>>>       </property>
>>>       <property name="suppressBlankLines" class="ptolemy.data.expr.Parameter" value="false">
>>>       </property>
>>>       <property name="_windowProperties" class="ptolemy.actor.gui.WindowPropertiesAttribute" value="{bounds={270, 279, 484, 209}, maximized=false}">
>>>       </property>
>>>       <property name="entityId" class="org.kepler.moml.NamedObjId" value="urn:lsid:kepler-project.org:actor:7:1">
>>>       </property>
>>>       <property name="class" class="ptolemy.kernel.util.StringAttribute" value="ptolemy.actor.lib.gui.Display">
>>>           <property name="id" class="ptolemy.kernel.util.StringAttribute" value="urn:lsid:kepler-project.org:class:883:1">
>>>           </property>
>>>       </property>
>>>       <property name="semanticType000" class="org.kepler.sms.SemanticType" value="urn:lsid:localhost:onto:1:1#TextualOutputActor">
>>>       </property>
>>>       <property name="_location" class="ptolemy.kernel.util.Location" value="{310, 345}">
>>>       </property>
>>>   </entity>
>>>   <entity name="Chemistry of Bulk Precipitation at HBEF WS-2" class="org.ecoinformatics.seek.datasource.eml.eml2.Eml200DataSource">
>>>       <property name="_svgIcon" class="ptolemy.kernel.util.ConfigurableAttribute">
>>>           <configure>../kepler-docs/dev/usability/graphics/svg/dataFile.svg</configure>
>>>       </property>
>>>       <property name="_thumbnailRasterIcon" class="ptolemy.kernel.util.ConfigurableAttribute">
>>>           <configure>/actorthumbs/dataFile-sm.gif</configure>
>>>       </property>
>>>       <property name="schemaDef" class="ptolemy.kernel.util.StringAttribute" value="&lt;schema&gt;&#10;  &lt;table name=&quot;w6-pcp&quot;&gt;&#10;    &lt;field name=&quot;ws&quot; dataType=&quot;STRING&quot;/&gt;&#10;    &lt;field name=&quot;year&quot; dataType=&quot;STRING&quot;/&gt;&#10;    &lt;field name=&quot;mo&quot; dataType=&quot;STRING&quot;/&gt;&#10;    &lt;field name=&quot;precip&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;Ca&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;Mg&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;K&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;Na&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;Al&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;NH4&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;pH&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;SO4&quot; dataType=&quot;FLOAT&quot;
> /&gt;&#10;    &lt;field name=&quot;NO3&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;Cl&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;PO4&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;    &lt;field name=&quot;SiO2&quot; dataType=&quot;FLOAT&quot;/&gt;&#10;  &lt;/table&gt;&#10;&lt;/schema&gt;&#10;">
>>>           <property name="schemaDef" class="ptolemy.actor.gui.style.TextStyle">
>>>               <property name="height" class="ptolemy.data.expr.Parameter" value="10">
>>>               </property>
>>>               <property name="width" class="ptolemy.data.expr.Parameter" value="30">
>>>               </property>
>>>           </property>
>>>       </property>
>>>       <property name="sqlDef" class="ptolemy.kernel.util.StringAttribute">
>>>           <property name="sqlDef" class="ptolemy.actor.gui.style.TextStyle">
>>>               <property name="height" class="ptolemy.data.expr.Parameter" value="10">
>>>               </property>
>>>               <property name="width" class="ptolemy.data.expr.Parameter" value="30">
>>>               </property>
>>>           </property>
>>>       </property>
>>>       <property name="Selected Entity" class="ptolemy.data.expr.StringParameter" value="w6-pcp">
>>>       </property>
>>>       <property name="outputType" class="ptolemy.data.expr.StringParameter" value="As Field">
>>>       </property>
>>>       <property name="_tableauFactory" class="org.kepler.objectmanager.data.db.QBTableauFactory">
>>>           <property name="sqlName" class="ptolemy.kernel.util.StringAttribute" value="sqlDef">
>>>           </property>
>>>           <property name="schemaName" class="ptolemy.kernel.util.StringAttribute" value="schemaDef">
>>>           </property>
>>>       </property>
>>>       <property name="recordid" class="ptolemy.kernel.util.StringAttribute" value="knb-lter-hbr.20.1">
>>>       </property>
>>>       <property name="endpoint" class="ptolemy.kernel.util.StringAttribute" value="http://ecogrid.ecoinformatics.org/knb/services/EcoGridQuery">
>>>       </property>
>>>       <property name="namespace" class="ptolemy.kernel.util.StringAttribute" value="eml://ecoinformatics.org/eml-2.0.0">
>>>       </property>
>>>       <property name="w6-pcp" class="org.ecoinformatics.seek.ecogrid.ResultRecordDetail">
>>>       </property>
>>>       <property name="_location" class="ptolemy.kernel.util.Location" value="[140.0, 285.0]">
>>>       </property>
>>>       <port name="ws" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="year" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="mo" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="precip" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="Ca" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="Mg" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="K" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="Na" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="Al" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="NH4" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="pH" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="SO4" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="NO3" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="Cl" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="PO4" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>       <port name="SiO2" class="ptolemy.actor.TypedIOPort">
>>>           <property name="output"/>
>>>       </port>
>>>   </entity>
>>>   <relation name="relation" class="ptolemy.actor.TypedIORelation">
>>>       <property name="width" class="ptolemy.data.expr.Parameter" value="1">
>>>       </property>
>>>   </relation>
>>>   <link port="Display.input" relation="relation"/>
>>>   <link port="Chemistry of Bulk Precipitation at HBEF WS-2.pH" relation="relation"/>
>>> </entity>
>>>
>>>
>>
>>
>> _______________________________________________
>> Kepler-dev mailing list
>> Kepler-dev at ecoinformatics.org
>> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-dev
>
> -- 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Matt Jones                                   Ph: 907-789-0496
> jones at nceas.ucsb.edu                    SIP #: 1-747-626-7082
> National Center for Ecological Analysis and Synthesis (NCEAS)
> UC Santa Barbara     http://www.nceas.ucsb.edu/ecoinformatics
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> _______________________________________________
> Kepler-dev mailing list
> Kepler-dev at ecoinformatics.org
> http://mercury.nceas.ucsb.edu/ecoinformatics/mailman/listinfo/kepler-dev
>
>


More information about the Kepler-dev mailing list