[Bug 1132] fix access control ambiguities

Fri Aug 15 10:20:30 PDT 2003

http://bugzilla.ecoinformatics.org/show_bug.cgi?id=1132

We found some issues that need to be discussed regarding access control 
in the EML2 specification.  We have run into major problems while trying 
to implement the specified access control procedures for Metacat and 
suspect that these problems are not fixable without a change in the EML2 
access control specification.  Though we are having these problems 
within Metacat, we believe them to be general to any system that is 
trying to be EML2 compliant.

In EML2 there are two possible places where a processor may encounter 
access control: one is at the resource level and the other is at the 
additionalMetadata level. According to the EML spec, resource level 
access control applies to the  whole document and additionalMetadata 
rules apply to a specific subtree for finer grained access control of 
EML subtrees. This allows one to have a general access policy and then 
make specific exceptions or changes for particular subtrees.

The problems arise when a processor must remove a controlled subtree and 
deliver it to the user.  Once the user changes the document and 
resubmits it, the subtree that was removed must be put back in its valid 
and correct location.

1) Take this document for instance:
   <a>
      <b>b</b>
      <d>d</d>
   </a>

If a user has permission to write to the whole document (permission 
comes from top level access control) and doesn't have permission to read 
subtree d (restriction comes from addtionalmetadata access control) when 
he tries to download the document he will get part of the document like:
   <a>
     <b>b</b>
   </a>

The user adds the elements c and e to the document.
   <a>
     <b>b</b>
     <c>c</c>
     <e>e</e>
   </a>

Once the document is submitted back to the processor, the processor must 
figure out that element d (that was removed before) must fit in between 
c and e like so:
  <a>
     <b>b</b>
     <c>c</c>
     <d>d</d>
     <e>e</e>
  </a>

This may seem simple, but first of all, the only way to know where d is 
supposed to go when you remove it is to store its parent id and its most
immediate sibling(s) id(s).  In this case d's parent is the same (a) but 
in the original document b was it's most immediate sibling.  If d is
inserted below b, the document becomes invalid.  The only way to 
possibly know where d is allowed to be reinserted is to parse the schema 
which could still fail because element d could be legally allowed in 
many different locations (ie, it is not necessarily deterministic wrt 
node placement).

2) Nested subtrees also present a problem.
   <a id="100">
     <b>
       <c id="200">c</c>
       <d>d</d>
     </b>
   </a>

An access module in additionalMetadata could specify that a user has 
read access to c but not a.  If the processor simply returns c but not a 
or sub-elements (besides c) of a, the resulting document makes no sense. 
We need some sort of cascade rule that says that once read has been 
taken away for a node, none of its children can be made 'readable'.

3) Previously we stated that there are two palaces for access 
information to exist.  This is actually not quite correct. In EML2 each 
of the four resource level modules (dataset, software, citation and 
protocol) have their own embedded access module. Even though a document 
has only one resource level module, the other resource level modules are 
embedded in each other.  For example, you can have a citation within a 
dataset.  That citation has its own access module.  We have not defined 
in the EML spec how that is to be handled by a processor.  Should the 
top-level resource access description take precedence?  Probably. Should 
the lower level elements be ignored, or used in a manner similar to
additionalMetadata?  If the latter, to what do they apply, themselves, 
or their parent resource (unlike additionalMetadata, there is no 
describes element here to clarify the situtation)?

Proposed solution:

Changing EML at this late date is hugely problematic.  We feel that we 
should maintain our commitment to make changes in EML backwards 
compatible (ie, EML 2.0.0 docs would be valid 2.0.1 docs).  However, we 
feel that this is an important bug that compromises the usefulness of 
EML, and so fixing it now is the right thing to do.  Nevertheless, we 
should minimize the disruptiveness of the change by 1) trying not to 
change the schema structure, and 2) redefining semantics of access 
control in a more tractable way.

We propose to alter EML to allow only two levels of access control.  The 
first would be document wide control, accomplished by a new "access" 
element on the root "eml" document.  The second would be data control 
for specific data files, accomplished by an optional "access" element in 
the physical distribution module that applies to the data object being 
described.  We should remove access from the eml-resource module (now 
that it is in the eml module itself), although this would be an 
incompatible schema change.  Alternatively we could simply define in the 
spec that access elements on the "resource" module are to be ignored. 
Restricting access control to the metadata and data respectively would 
greatly simplify the processing of EML, although it would limit the 
granularity of access control within the EML document.

Here's a fragment that shows what this new model might look like:

<eml>
   ...
   <access>...</access>           <-- defines overall access to
                                      all metadata
   ...
   <dataset>
     <access>...</access>         <-- this is ignored
     <dataTable>
       <physical>
         <distribution>
           <access>...</access>   <-- defines access to the data object
                                      in inline, online, or offline
                                      elements (ie, not the metadata
                                      itself, just the data)
           <inline>...</inline>
         </distribution>
       </physical>
     </dataTable>
   </dataset>
</eml>

Of course, these changes would make an access element that is present in 
the schema (under dataset, for example) be ignored.  Which is certainly 
confusing. We have to choose the lesser of two evils: 1) keep it and 
ignore it, which is confusing but allows schema compatibility with 
2.0.0, or 2) delete it, which is clearer but makes all 2.0.0 documents 
that use it invalid and must be transformed to become valid EML 2.0.1 
documents.  This is a tough choice.

We also need to clarify how to interpret the values found in the 
'permission' element, in that we should make it clear that 
'changePermission' permission is needed to change an access block, not 
just 'write' permission.  Currently the values we have (read, write, 
changePermission, all) are only tersely defined.

Comments or suggestions are welcome!
Jing, Chad, Matt, Dan, and Chris

-- 
-------------------------------------------------------------------
Matt Jones                                     jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Fax: 425-920-2439    Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)
University of California Santa Barbara
Interested in ecological informatics? http://www.ecoinformatics.org
-------------------------------------------------------------------