distribution element issues

Thu May 16 12:42:14 PDT 2002

Ok so let me see if i can summarize where we are. I knew the minute i'd sent
it that the file:// example didnt make my point!!

1 - we may need some kind of alternate url field (or at least an attribute
to warn us) for when data have been made available online but not in some
way that we could fully automate its access.

2 - we will consider supporting token substitutions for parameters that can
be embedded into urls in order to make a single url string work for several
physical objects within a dataset.

3 - content models for connection parameters are too proprietary to the
local storage/driver/application combination and probably limited to a
handful that are actually used by any given site. That information is also
probably not likely to be shared to outside users, but may be shared on an
intranet to support local users.

4 - we eventually will support WSDL for describing connection information

5 - Sites (and authors of editing wizards) will for now work with their own
content models for parameters and use them in that restricted context. In
most cases, they will provide translation of those into a url if they choose
to share that information via EML. it will be up to them to determine how to
manage this proprietary connection information together with the rest of
their EML content. for users designing their own RDBMS storage this is easy,
but what about users that might want to use eml for storage. 

Here are a couple interim variations for beta9 that might allow them to do
it without having to create their own schemas (forgive any errors in DTD
syntax - im too used to schema now):

<!ELEMENT distribution ((connection|offlineMedium)+)>
<!ELEMENT connection ((directURL|indirectURL), parameterList?)>
<!ATTLIST connection name CDATA #IMPLIED
                     scheme CDATA #IMPLIED> 
<!ELEMENT directURL (#PCDATA)>
<!ELEMENT indirectURL (#PCDATA)>
<!ELEMENT parameterList (parameter+)>
<!ELEMENT parameter (#PCDATA)>
<!ATTLIST parameter name CDATA #IMPLIED> 

OR:

<!ELEMENT distribution ((connection|offlineMedium)+)>
<!ELEMENT connection ((onlineURL), parameterList?)>
<!ATTLIST connection name CDATA #IMPLIED
                     scheme CDATA #IMPLIED> 
<!ELEMENT onlineURL (#PCDATA)>
<!ATTLIST onlineURL directAccess BOOLEAN>
<!ELEMENT parameterList ( /*whatever the dtd equvalent of ANY is*/)>

either one of these would force users to put in either a direct or indirect
url link to the data (which is probably all we ever want to show end users),
but also provide an alternate set of parameters of thier own design for that
scheme to be used locally.

Peter McCartney (peter.mccartney at asu.edu)
Center for Environmental Studies
Arizona State University
480-965-6791 

-----Original Message-----
From: Matt Jones [mailto:jones at nceas.ucsb.edu]
Sent: Thursday, May 16, 2002 10:37 AM
To: Peter McCartney
Cc: eml-dev
Subject: Re: distribution element issues

Hi Peter,

Thanks for the well-reasoned response.  My comments are inline...

Peter McCartney wrote:
> Well this was the very reason why i proposed providing a choice of 
> parameter models that were specific to schemes. I appreciate the 
> ambiguity of your "database" example, but only because i dont recognize 
> the scheme. By asking you to pick the scheme (MS sql server) from a 
> controlled list that is documented in EML i could then force you to 
> enter version = 7.0, host=maricopa, port=1433, networkProtocol=named 
> pipes, database=arthropods. There would be no ambiguity as to semantics 
> and you would not have to know the exact syntax of how to build the url 
> string for whatever driver i wish have available to use.  

So, you seem to want a finite number of well-defined connection types, 
each of which would have its own set of parameters.  In theory I think 
this is fine, but in practice I think it will only work for you (because 
you'll pick the schemes and parameters that are right for your systems). 
  The diversity of connection types is large and is growing, and I don't 
think we can hope to enumerate even a part of them.  In addition, for 
any given protocol, the details of the connection are complex, and are 
far beyond our ability to enumerate the parameters and their semantics. 
  Take, for example, smb connections.  The IETF working draft that 
enumerates the URL syntax and semantics for these well-known connections 
is 17 pages long 
(http://www.ietf.org/internet-drafts/draft-crhertel-smb-url-02.txt
).  I have read the detailed mailing list archives on this topic, and 
the subtleties in the parameters are deep, especially when 
differentiating smb connections from cifs connections (people want a 
separate url scheme for cifs), even though most clients like windows 
machines handle the two protocols with one user interface.  Personally, 
I am not up to the task of even providing a comprehensive list of 
parameters for the simple protocols like http, https, ftp, sftp, smb, 
cifs.  I can't fathom the complexity of jdbc and odbc, or the oracle 
call interface, or sde.

I would far prefer to not implement something in EML that is a partial, 
hacked-together solution when there is a standardized mechanism for 
providing well-structured information for a protocol (IETF URL schemes). 
  If a protocol is sufficiently well known in the community then someone 
should have developed a URL scheme for describing connection 
information.  If they haven't, I don't think we can really actually 
develop the spec for that connection type.

> To blow this problem off in favor of just using urls I think puts us 
> (almost) back where we were a year ago. 

I'm not trying to blow this off at all.  I am very concerned with 
implementing a partial, ambiguous solution to a very complex problem.

> I do see more utility in URLs 
> after our discussion of providing urls for a specific driver. But is 
> still a problem for users who want to use my data and neither have that 
> particular driver nor know how to rewrite the url string into an 
> equivalent one for an alternate driver (although I can mitigate that 
> somewhat by trying to provide urls for as many different connection 
> protocols out there that i can anticipate). There is also a problem in 
> that I do not see evidence that ALL online connections can indeed be 
> described by a structured URL string (i can't find one for an SDE 
> connection, although i did find one for modem dialups). Im also not 
> convinced that urls carry enough information. If i give you 
> file://maricopa.asu.edu/proj/lter/filename.txt its a crap shoot whether 
> it will work for you because i havent told you that maricopa.asu.edu is 
> an NT server located in the LTER domain. 

That's becuase you used the wrong URL format.  You should have used the 
"smb" URL if it is on an SMB server like NT or SAMBA.

> Similarly, with JDBC there is a 
> keyword for the the driver in the url string, but jdbc isnt smart enough 
> to parse the url and figure out what driver to use - you need to 
> separately provide the class name of the driver that the url is for. 

That is an interesting issue for JDBC.  Its actually a bit of a chick & 
egg, because the driver class is actually what is determining how to 
interpret the driver-specific parameters.  I don't think there is even a 
registry of JDBC driver names (although I could be wrong on that -- 
haven't looked).

> Finally, I really question whether users can be expected to know the 
> proper structure for providing a url string for most service connections 
> -  we will have to provide wizards to help them with that. Those wizards 
> will have to be based on content models of parameters for each known 
> scheme, so why the heck dont we make them part of EML in the first place?

Turning that on its head -- the syntax for encoding the parameters 
(e.g., a URL) has nothing to do with the user interface presentation. 
If a user needs to input the information for, say, an LDAP over TLS 
connection, will we have those paramters in place for EML?  I think not. 
  Seems to me that all of this user-input stuff is going to have to be 
application generated independently of the EML schemas -- we just want 
EML to be able to encode it in a standared way for transport.

> Part of the problem i think we're having here is the difference between 
> connection info we share with the world vesus connection information we 
> want to use locally. I need a metadata format that allows me to generate 
> a display in our data catalog for local users (or my local web 
> application) to know how to find a file while they are sitting in the 
> lab (eg.... network protocol: MS windows networking, domain: LTER, 
> server:maricopa, folder: proj\lter\po10\, filename:xxxx).  perhaps one 
> solution is to make URL a required connection type but provide some form 
> of parameter model as an option. editors could generate the url version 
> from the parameters but the parameters would remain in the metadata for 
> local applications.

I agree.  Public connection info versus private connection info seems to 
be the crux of the matter. Maybe making url required but still providing 
the other fields would work.  But I suspect it'll cause problems for you 
for those connections that you say don't have a URL representation.

> I certainly agree that if there is an unambiguous way of describing a 
> url to a connection, that should be preferred. But I'm pretty sure that 
> if this is the only way of defining a connection in EML, many sites 
> using server connections or local file system addresses (myself 
> included) will wind up extending EML with their own locally defined 
> connection description schemas to solve some of the problems I mention 
> above. If im on my own on this, then im likely to just locally use my 
> original content models for each kind of connection scheme we use at CAP 
> and simply build URLs in XSL when generating valid EML documents. Now 
> maybe this isnt so bad if I am not inclined to show that detailed info 
> to the public anyway.  I guess it all depends on how much we want EML to 
> set standards for managing metadata at the internal site level, but I 
> see some advantages to a solution that is itself part of EML so that we 
> dont have a bazillon different solutions to the same problem.

I'm open to discussion on the extent to which EML is used for 
data/metadata exchange versus internal site management.  I think it 
makes sense for a public exchange mechanism. I'm not sure it is as 
compelling for site-specific details, but I can see your argument for it.

> Before we drop this, has anyone looked at how the SRB MCAT stores 
> connection information? it seems like it has a similar problem in having 
> to deal with a lot of different kinds of connections. Does it manage to 
> do all this with a single URL field?

It is very proprietary, and thus somewhat limited in terms of 
extensibility.  It does not use URLs.  Rather, there is a C driver for 
each type of physical resource connection (UNIX filesystesm, ftp, http, 
Oracle, DB2, etc), and configuration info for each driver is stored 
partly in text config files and partly in the database schema.  It does 
not give generalized access to databases in the way we are discussing -- 
rather, it gives access to particular hardcoded SQL queries.

> On a totally separate note, i like the idea of token substitutions for 
> defining url's in such a way that they can be used more generically - 
> this neatly allows you to define the host and path of an ftp connection 
> once, and then substitute the filename for datasets that have several 
> files on one ftp site. So i say add that feature, regardless of how we 
> resolved the url/parameter debate.

OK, I'll try to develop this further for the next checkin. I'm not quite 
sure how it would work fully.  ANybody got any further suggestions/insights?

> But this feature begs another question. For web apps that dont expose 
> their form parameters in the url via GET, the token substitution trick 
> still won't help us automate running these applications. How do we 
> reference an online application for which further interactive user input 
> cannot be avoided in order to get the data. Do we enter these under 
> "connections" or is an onlineApplicationURL different from an onlineURL? 

There seem to be 2 issues here.  First, some applications do not expose 
a GET interface over HTTP, but rather only allow POST.  They need a 
different parameter encoding than the GET request, which isn't satisfied 
by a URL. It is interesting that an HTTP url implies a GET, when in fact 
it is only 1 of several possible http methods. I'll have to think about 
this some more.

Second, some applications absolutely require user interaction to get to 
the data, so there is no way to provide complete connection information. 
  I think these are out of scope for us, meaning that someone can 
provide an informational URL, but its not going to get us to the data. 
In that case, I do not think it belongs in the distribution element, but 
rathe rin some other more descriptive metadata section.

Well, that's about it.  Hopefully some of the other people on this list 
will chime in and help out with these discussions.  Thanks again for the 
thoughtful comments, Peter.

Matt

> Peter McCartney (peter.mccartney at asu.edu)
> Center for Environmental Studies
> Arizona State University
> 480-965-6791
> 
> -----Original Message-----
> From: Chad Berkley [mailto:berkley at nceas.ucsb.edu]
> Sent: Wednesday, May 15, 2002 1:35 PM
> To: Matt Jones
> Cc: eml-dev at ecoinformatics.org
> Subject: Re: distribution element issues
> 
> 
> I think we should eliminate the parameters altogether.  I don't see the
> point of them since all of the information that they can encode can be
> more precisely encoded in a URL.
> 
> chad
> 
> On Wed, 2002-05-15 at 12:58, Matt Jones wrote:
>  > Hey --
>  >
>  > I pointed out some problems with the "distribution" element that I am
>  > trying to resolve in my second comment on bug 480:
>  >    http://bugzilla.ecoinformatics.org/show_bug.cgi?id=480#c2
>  >
>  > I could really use some feedback on this to see what others think
before
>  > I finalize the changes. This is a plea for help!  Thanks.
>  >
>  > Matt
>  >
>  > --
>  > *******************************************************************
>  > Matt Jones                                    jones at nceas.ucsb.edu
>  > http://www.nceas.ucsb.edu/    Fax: 425-920-2439   Ph: 907-789-0496
>  > National Center for Ecological Analysis and Synthesis (NCEAS)
>  >
>  > Interested in ecological informatics? http://www.ecoinformatics.org
>  > *******************************************************************
>  >
>  > _______________________________________________
>  > eml-dev mailing list
>  > eml-dev at ecoinformatics.org
>  > http://www.ecoinformatics.org/mailman/listinfo/eml-dev
> -- 
> 
> _______________________________________________
> eml-dev mailing list
> eml-dev at ecoinformatics.org
> http://www.ecoinformatics.org/mailman/listinfo/eml-dev
> 

-- 
*******************************************************************
Matt Jones                                    jones at nceas.ucsb.edu
http://www.nceas.ucsb.edu/    Fax: 425-920-2439   Ph: 907-789-0496
National Center for Ecological Analysis and Synthesis (NCEAS)

Interested in ecological informatics? http://www.ecoinformatics.org
*******************************************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/eml-dev/attachments/20020516/166b1e2e/attachment.htm