[seek-dev] SCIA [SChema Integration Assistant] tool news

Joseph Goguen goguen at cs.ucsd.edu
Thu May 6 16:21:05 PDT 2004


Dear Fellow SEEKers,

We are working on a schema matching tool, tentatively called SCIA, that
perhaps could also be useful for semantic transformations of data flowing
among pipelined components in a scientific workflow where components have
structurally different but semantically compatible interfaces.

For some background details, please see the following published paper and
submitted drafts:

  http://www.cse.ucsd.edu/~guilian/reports/apweb_long.ps
  http://www.cse.ucsd.edu/users/goguen/pps/vldb04.ps
  http://www.cse.ucsd.edu/users/goguen/pps/lisbon04.ps

At a recent SDSC SEEK meeting, we discussed using our tool for executable
view generation (i.e., transformation scripts), given path correspondences
obtained from the registered mappings to the global ontologies.  The
following example is essentially the one suggested by Shawn Bowers:

Input : (1) a source schema grouped by book, 
        (2) a target schema grouped by author,
        (3) a text file storing path correspondences between the two schemas

Note: Our tool does not need input (3), the mapping file, since it generates
correspondences, and then generate the executable view from them, *but can
also take advantage of such existing partial or full correspondences*.  Here
are (1) and (2) for this problem:

****** the source DTD: books_book.dtd

<!-- from w3c use case, http://www.bn.com/bib.xml -->
<!ELEMENT bib  (book*)>
<!ELEMENT book  (title, author+, publisher, price)>
<!ATTLIST book  year CDATA  #REQUIRED >
<!ELEMENT author  (last, first )>
<!ELEMENT title  (#PCDATA )>
<!ELEMENT last  (#PCDATA )>
<!ELEMENT first  (#PCDATA )>
<!ELEMENT publisher  (#PCDATA )>
<!ELEMENT price  (#PCDATA )>

***** The target DTD: books_author.dtd

<!ELEMENT books  (author*)>
<!ELEMENT book  (title, publisher, price)>
<!ATTLIST book  year CDATA  #REQUIRED >
<!ELEMENT author  (last, first, book+)>
<!ELEMENT title  (#PCDATA )>
<!ELEMENT last  (#PCDATA )>
<!ELEMENT first  (#PCDATA )>
<!ELEMENT publisher  (#PCDATA )>
<!ELEMENT price  (#PCDATA )>

The file of path correspondences for mapping the above source schema to
target schema (the numerical values are similarity values generated by our
tool in its the matching step.

***** Mapping: books_book --> books_author

/bib --> /books  0.13
/bib/book/author --> /books/author  0.78
/bib/book/author/last --> /books/author/last  0.84
/bib/book/author/first --> /books/author/first  0.84
/bib/book --> /books/author/book  0.78 [author/first=$author/first]
/bib/book/@year --> /books/author/book/@year  0.84
/bib/book/title --> /books/author/book/title  0.84
/bib/book/publisher --> /books/author/book/publisher  0.84
/bib/book/price --> /books/author/book/price  0.84

The condition [author/first=$author/first] in the 5th line above, for getting
all books by a given author, was added manually, although everything else was
generated by the tool (in general, user interactions are necessary, but for
this simple matching task, the only user input needed was the above
condition, plus a confirmation for the automatically generated matches).  Our
algorithm does not try to guess such conditions; although it seems possible
in some special cases, the accuracy would generally be pretty low.  The tool
automatically identifies such "critical points" (where there are inconsistent
parent contexts) and, as part of the interaction, requests the user to
specify the condition(s).

***** Output: the generated view

<books>
  FOR $author IN  document("books_book.xml")//author RETURN
  <author>
    FOR $last IN $author/last RETURN
    <last> $last/text()
    </last>,

    FOR $first IN $author/first RETURN
    <first> $first/text()
    </first>,

    FOR $book IN document("books_book.xml")//book[author/first=$author/first]
    RETURN
    <book
      year = $book/@year/text() >

      FOR $title IN $book/title RETURN
      <title> $title/text()
      </title>,

      FOR $publisher IN $book/publisher RETURN
      <publisher> $publisher/text()
      </publisher>,

      FOR $price IN $book/price RETURN
      <price> $price/text()
      </price>
    </book>
  </author>
</books>


Below is a sample document to illustrate correctness of the transformation
that arises from executing the above view:

****** source document books_book.xml

<?xml version="1.0"?>
<!DOCTYPE bib SYSTEM "books_book.dtd">
<bib>
    <book year="1994">
        <title>TCP/IP Illustrated</title>
        <author><last>Stevens</last><first>W.</first></author>
        <publisher>Addison-Wesley</publisher>
        <price> 65.95</price>
    </book>
 
    <book year="2000">
        <title>Data on the Web</title>
        <author><last>Abiteboul</last><first>Serge</first></author>
        <author><last>Buneman</last><first>Peter</first></author>
        <author><last>Suciu</last><first>Dan</first></author>
        <publisher>Morgan Kaufmann Publishers</publisher>
        <price> 39.95</price>
    </book>
 
    <book year="1995">
        <title>FOUNDATIONS of DATABASES</title>
        <author><last>Abiteboul</last><first>Serge</first></author>
        <author><last>Hull</last><first>Richard</first></author>
        <author><last>Vianu</last><first>Victor</first></author>
        <publisher>Addison-Wesley</publisher>
        <price> 54.95</price>
    </book>
 
</bib>

******* the result document

<?xml version="1.0"?>
<books>
  <author>
    <last>      Stevens    </last>
    <first>      W.    </first>
    <book year="1994">
      <title>        TCP/IP Illustrated      </title>
      <publisher>        Addison-Wesley      </publisher>
      <price>         65.95      </price>
    </book>
  </author>
  <author>
    <last>      Abiteboul    </last>
    <first>      Serge    </first>
    <book year="2000">
      <title>        Data on the Web      </title>
      <publisher>        Morgan Kaufmann Publishers      </publisher>
      <price>         39.95      </price>
    </book>
    <book year="1995">
      <title>        FOUNDATIONS of DATABASES      </title>
      <publisher>        Addison-Wesley      </publisher>
      <price>         54.95      </price>
    </book>
  </author>
  <author>
    <last>      Buneman    </last>
    <first>      Peter    </first>
    <book year="2000">
      <title>        Data on the Web      </title>
      <publisher>        Morgan Kaufmann Publishers      </publisher>
      <price>         39.95      </price>
    </book>
  </author>
  <author>
    <last>      Suciu    </last>
    <first>      Dan    </first>
    <book year="2000">
      <title>        Data on the Web      </title>
      <publisher>        Morgan Kaufmann Publishers      </publisher>
      <price>         39.95      </price>
    </book>  </author>  <author>
    <last>      Abiteboul    </last>
    <first>      Serge    </first>
    <book year="2000">
      <title>        Data on the Web      </title>
      <publisher>        Morgan Kaufmann Publishers      </publisher>
      <price>         39.95      </price>
    </book>    <book year="1995">
      <title>        FOUNDATIONS of DATABASES      </title>
      <publisher>        Addison-Wesley      </publisher>
      <price>         54.95      </price>
    </book>  </author>
  <author>
    <last>      Hull    </last>
    <first>      Richard    </first>
    <book year="1995">
      <title>        FOUNDATIONS of DATABASES      </title>
      <publisher>        Addison-Wesley      </publisher>
      <price>         54.95      </price>
    </book>
  </author>
  <author>
    <last>      Vianu    </last>
    <first>      Victor    </first>
    <book year="1995">
      <title>        FOUNDATIONS of DATABASES      </title>
      <publisher>        Addison-Wesley      </publisher>
      <price>         54.95      </price>
    </book>
  </author>
</books>
<!-- end of document -->

There are still many changes in progress to make the tool more general, and
there are some bugs to fix, including slow execution on Windows due to
multiple threads in communication among windows.  When it is ready, we will
put the code in seek CVS repository.

Any comments will be appreciated!

Joseph & Jenny



More information about the Seek-dev mailing list