[eml-dev] [Bug 5427] New: round-trip encoding of missing values uploaded then queried from a db table is lost
bugzilla-daemon at ecoinformatics.org
bugzilla-daemon at ecoinformatics.org
Thu Jun 23 14:37:02 PDT 2011
http://bugzilla.ecoinformatics.org/show_bug.cgi?id=5427
Summary: round-trip encoding of missing values uploaded then
queried from a db table is lost
Product: EML
Version: 2.1.0
Platform: All
OS/Version: All
Status: NEW
Severity: enhancement
Priority: P2
Component: datamanager
AssignedTo: tao at nceas.ucsb.edu
ReportedBy: marygastil at yahoo.com
QAContact: eml-dev at ecoinformatics.org
Estimated Hours: 0.0
Round-trip encoding of missing values in EML datasets uploaded to and queried
from a database table is lost with current version of DML.
Short link to this doc is http://goo.gl/2mq9T
Might pertain to:
EML Data Manager Library
PASTA workflows
EML-parsed data delivered from a DML-loaded database
Might be a feature request for a future iteration,... Far in the future!
This is about handling missing values in data tables, whether they are stored
in a database table as codes or as nulls, and how they are then coalesce()-ed
in a VIEW of that table.
Background:
EML allows multiple missing value codes for the same data table column, as
there may be more than one reason for missing a value. This is good. The EML
Data Manager Library (DML) compares missing value codes with a string
comparison, not a numeric comparison. So -9999.0 does not match -9999. A
numeric column may have a text missing value code such as NaN or na in a column
of float type.
Issue:
When the data is inserted into the database table, obviously the non-numeric
string missing value codes cannot be inserted literally. They are inserted as
nulls. I looked at that part of the DML code. Any datum which matches one of
the missing value codes for its column gets inserted as a null.*
Since any missing value code is collapsed into a simple null, then the original
information about what kind of missing value code it was has been lost.
When querying that data table, either a VIEW specifically written for that
table or the code constructing that query could use the EML to assign a missing
value code to nulls using coalesce(), but only if there were only one missing
value code per column. Where multiple codes exist, it would be wrong to just
arbitrarily assign the first-listed code to all nulls of a column.
Proposed Solution:
Alternatively, the DML could store missing value codes, assigning numeric codes
to replace non-numeric codes where necessary (a tricky feat since it implies
knowledge of the range of valid values, which may not be specified in the EML.)
Then a corresponding query would have to be stored as a VIEW, with a CASE
wrapping that column to translate back to the original codes.
Details:
*Notes relating to actual java code are below.
In the DML DatabaseAdapter class, in the method generateInsertSQL() gets three
inputs: the attributeList, the tableName, and oneRowData. Each attribute value
is compared as a literal string to the possible missing value codes for that
attribute using the private method issMissingValue on line 523
In generateInsertSQL() line 267 if a value is a missing value then it jumps to
the next attribute in the list. The insert statement then does not upload that
column. This is equivalent to inserting a null into that column. Since the DML
does not seem to put any NON NULL constraints on any columns, (let alone any
constraints at all), that in itself does not generate an error. It does,
however, mean that missing value codes are not stored in the database table and
so if there are multiple codes for a column that information is lost.
This was looking at svn revision number 2195 of Duanes branch
https://code.ecoinformatics.org/code/eml/branches/DATAMANAGER_QUALITY/src/org/ecoinformatics/datamanager/database/DatabaseAdapter.java
on 23June2011
--
Configure bugmail: http://bugzilla.ecoinformatics.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
More information about the Eml-dev
mailing list