[seek-kr-sms] information extraction links

Serguei Krivov Serguei.Krivov at uvm.edu
Wed May 3 12:57:33 PDT 2006

Hope the following notes on Information Extraction could be useful



Information Extraction  problem : There is a class C with m attributes
a1,.am. The class and the attributes correspond to well known concepts of
natural language. Populate the knowledge base with the instances of those
classes   using information from free text sources


Here are the two leading information extraction platforms that have a lot of
libraries to facilitate information extraction









Both   GATE (Cunningham et al 2002, GATEURL) and UIMA are open-source
language engineering infrastructures.   

 In addition to corpus annotation facilities, they  provide a set of
language processing components, e.g., tokeniser, part-of-speech tagger,
named entity recogniser; services for persistent storage of language
resources; and extensive visualisation and evaluation tools aimed at
facilitating the development and deployment of language processing




The leading forum for the IE research were the Message Understanding
Conferences conducted through 1987 to 1998  (MUC) (R. Grishman 1996).  The
other most significant IE related programs are the Automatic Content
Extraction (ACEURL)  and TREC (TRECURL).  One of the most significant
outcome of MUC conferences is the development of standard TIPSTER Common
Architecture (Grishman, R. 1996)  that   provides a specification of
generic APIs for document detection, data extraction, and the associated
document management functions. The important feature of this architecture is
support for "plug and play" interoperability of modules responsible for
different parts of IE process.  This allows the interchange of modules from
different suppliers. The interoperability is achieved by enforcing common
generic format for document annotations in both input and output document of
every module. This format, widely known as TIPSTER format has been used in
many IE applications. Just for the sake of review we could classify existing
IE tools into 3 categories 


 IE Tools 1st generation The earlier IE systems have hard coded rule bases
developed by experienced language engineers. They make use of human
intuition , require only small amount of training data, but development of
those system can be very time consuming. The example of such system is ANNIE
(ANNIEURL) incorporated in GATE. It uses 21 phases, 187 rules, 9 entity
types (av. 20.8 rules per entity type).


IE Tools 2nd generation The next generation IE system use supervised
learning methods for generating rulesets. Thus development of new
applications with those systems does not require LE expertise. An example of
such system is Amilcare. 


IE Tools 3d generation  . IE tools of 2d generation require  large amounts
of annotated text and some human intervention during learning process
Certain new systems such as Armadillo use redundancy of information for
cross checking the correctness of  derived rules. Redundancy is apparent in
the presence of multiple citations of the same facts in superficially
different formats. This redundancy can be exploited to bootstrap the
annotation process needed for Information Extraction. For example, the fact
that a system knows the name of an author can be used to identify a number
of other author names using resources present on the Internet, instead of
using rule-based or statistical applications, or hand-built gazetteers. By
combining a multiplicity of information sources, internal and external to
the system, texts can be annotated with a high degree of accuracy with
minimal or no manual intervention.







(ACEURL) - http://www.itl.nist.gov/iad/894.01/tests/ace/





1.      R. Grishman and B. Sundheim. Message understanding conference - 6: A
brief history. In Proceedings of the 16th International Conference on
Computational Linguistics, Copenhagen, June 1996.


2.      Hamish Cunningham. Information Extraction -  a User Guide. Research
memo CS - 99 - 07: http://www.dcs.shef.ac.uk/~hamish/IE/userguide/main.html

3.      J. Cowie and W. Lehnert. Information Extraction. Communications of
the ACM, 39(1):80-91, 1996.

4.      D. Appelt. An Introduction to Information Extraction. Artificial
Intelligence Communications, ?(?), 1999.

5.	Grishman, R. 1996. "TIPSTER Text Architecture Design " Version 3.1
Technical Report, DARPA.  Available from:
6.	Jerry R. Hobbs, Generic Information Extraction System, avalibel at
7.	Diana Maynard, Information Extraction - why Google doesn't even come
close: http://www.gate.ac.uk/sale/talks/bcs-03-cheltenham.ppt
8.	(Bird and Liberman 1999) Bird S, Liberman M 1999. A Formal Framework
for Linguistic Annotation. Technical Report MS-CIS-99-01, Department of
Computer and Information Science, University of Pennsylvania.
9.	   Marin Dimitrov A Light-weight Approach to Coreference Resolution
for Named Entities in Text (MSc. Thesis) (2002)   
10.	Joseph F. McCarthy, A Trainable Approach To Coreference Resolution
For Information Extraction (1996)   
11.	Saliha Azzam, Kevin Humphreys, Robert Gaizauskas, Coreference
Resolution in a Multilingual Information Extraction System  (Make
<http://citeseer.nj.nec.com/correct/489413>  Corrections)



H. Cunningham, K. Bontcheva, D. Maynard, V. Tablan. GATE -- A New Release.
ELSNews, 11(1), 2002. PDF

H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan. GATE: A Framework and
Graphical Development Environment for Robust NLP Tools and Applications.
Proceedings of the 40th Anniversary Meeting of the Association for
Computational Linguistics (ACL'02). Philadelphia, July 2002. PDF
<http://www.gate.ac.uk/sale/acl02/acl-main.pdf> . 

MAIN H. Cunningham. GATE, a General Architecture for Text Engineering.
Computers and the Humanities, volume 36, pp. 223-254, 2002


Hamish Cunningham, Diana Maynard, Kalina Bontcheva, Valentin Tablan
<http://www.gate.ac.uk/sale/talks/acl02/01.html> GATE: an Architecture for
Development of Robust HLT Applications, 40th Meeting of the Association for
Computational Linguistics (ACL'2002), July 2002.



Fabio Ciravegna, Alexiei Dingli, David Guthrie and Yorick Wilks:
Information to Bootstrap Information Extraction from Web Sites" in IJCAI
2003 Workshop on Information Integration on the Web, workshop in conjunction
with the 18th International <http://www.ijcai-03.org/1024/index.html>  Joint
Conference on Artificial Intelligence (IJCAI 2003), Acapulco, Mexico,
August, 9-15


Fabio Ciravegna, Alexiei Dingli, David Guthrie and Yorick Wilks: "Mining Web
<http://www.dcs.shef.ac.uk/../../Documents/Papers/eacl2003.pdf>  Sites Using
Adaptive Information Extraction" in Research notes and demos section of the
10th Conference of the European Chapter of the Association of Computational
Linguistics (EACL 2003), Budapest, Hungary, April, 12-17



Fabio Ciravegna <http://www.dcs.shef.ac.uk/%7Efabio> :
 <http://www.dcs.shef.ac.uk/%7Efabio/paperi/IJCAI01.pdf> "Adaptive
Information Extraction from Text by Rule Induction and Generalisation" 
in Proceedings of 17th International Joint <http://www.ijcai.org/>
Conference on Artificial Intelligence (IJCAI 2001), Seattle, August 2001."

Fabio Ciravegna <http://www.dcs.shef.ac.uk/%7Efabio> :
 <http://www.dcs.shef.ac.uk/%7Efabio/paperi/Atem01.pdf> "(LP)2, an Adaptive
Algorithm for Information Extraction from Web-related Texts" 
in Proceedings of the IJCAI-2001 <http://www.smi.ucd.ie/ATEM2001/>  Workshop
on Adaptive Text Extraction and Mining, held in conjunction with the 17th
International Conference on Artificial Intelligence (IJCAI-01), Seattle,
August, 2001 

Fabio Ciravegna <http://www.dcs.shef.ac.uk/%7Efabio>  and Daniela Petrelli:
 <http://www.dcs.shef.ac.uk/%7Efabio/paperi/udie.pdf> "User Involvement in
Adaptive Information Extraction: Position Paper" 
in Proceedings of the IJCAI-2001 <http://www.smi.ucd.ie/ATEM2001/>  Workshop
on Adaptive Text Extraction and Mining, held in conjunction with the 17th
International Conference on Artificial Intelligence (IJCAI-01), Seattle,
August, 2001



"MnM: Ontology Driven Semi-Automatic and Automatic Support for Semantic
Maria Vargas-Vera, Enrico Motta, John Domingue, Mattia Lanzoni, Arthur Stutt
and Fabio Ciravegna
The 13th International Conference on Knowledge Engineering and Management
(EKAW 2002), ed Gomez-Perez, A., Springer Verlag, 2002.





-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mercury.nceas.ucsb.edu/ecoinformatics/pipermail/seek-kr-sms/attachments/20060503/8deda53c/attachment-0001.htm

More information about the Seek-kr-sms mailing list