A General Tool for Anaphora Resolution - GuiTAR (v1.1)
A. Installation
Download the file gtar1.1.zip and extract the contained files to a working directory.
Files withing the zip:
gtar1.1.jar
Template.xml
txtToXML
txtXMLPipeline
Version 1.2 of GuiTAR: gtarBeta1.2.jar
Just substitute file gtar1.1.jar referred to below with this one, gtarBeta1.2.jar, and everything else should work as before.
The new version includes a few bug fixes related to determining the agreement features of noun phrases.
New version (3.0.3) of GuiTAR
B. Running an Example under Linux
(though with slight changes of the scripts they should run under Windows as well, as long as a corresponding version of
ltchunk is also available - see below)
First make the two scripts executable as follows:
chmod 700 txtToXML
chmod 700 txtXMLPipeline
Download the following short sample text file bbcexcerpt.txt.
Then follow these steps:
I. Preprocessing-phase 1: Text-to-XML Conversion
(Here is an alternative to this step, which uses Charniak's full parser instead. Thus after using this alternative converter, you should continue with 'Preprocessing-phase 2' below and skip 'Preprocessing-phase 1'.)
Command:
txtToXML bbcexcerpt.txt bbcexcerpt.xml
Output:
None
Explanation:
The script does two things:
1. Calls another script txtXMLPipeline which produces (in this example) the file bbcexcerpt.xml
2. Runs class XMLTokeniser to produce the following file tagged.bbcexcerpt.xml (for this example)
Note: The script txtXMLPipeline makes an external call to ltchunk, a chunker which is part of the LT-XML suite of tools
developed by the University of Edinburgh's LTG.
To my understanding an evaluation copy may be requested from them and also an online demo is available at
http://www.ltg.ed.ac.uk/~mikheev/tagger_demo.html.
I presume the online demo might be used in this example, only the corresponding processed text as defined by the flow of
the script (the pipeline) should be pasted into the text box and then the resulting text (processed by ltchunk) should be
passed over to the next steps of the script.
II. Preprocessing-phase 2: Syntactic heuristics
Command:
java -cp gtar1.1.jar uk.ac.essex.malexa.nlp.dp.GuiTAR.prepro.PreProSyntacticHeuristics tagged.bbcexcerpt.xml
Output:
File: tagged.bbcexcerpt.xml
Number of NEs: 26
Explanation:
For every ne marked-up by the chunker this class does the following:
1. Adds four attributes: np type and the three np agreement features - person, number and gender.
2. Markes-up the premodifiers
3. Markes-up the head
4. Markes-up the postmodifiers
The resulting file is masxml.tagged.bbcexcerpt.xml, which is compliant with the format MAS-XML.
III. Anaphora Resolution
Command:
java -cp gtar1.1.jar GTAR_Runner masxml.tagged.bbcexcerpt.xml > general.log
Output:
Spooled into file general.log.
Explanation:
This is the actual anaphora resolution module, which takes as an input a MAS-XML compliant file and adds new mark-up
holding anaphoric information ( elements). The resulting file for this example is
processed.masxml.tagged.bbcexcerpt.xml.
IV. Anaphora Resolution Evaluation
This example does not make use of the evaluation module which is also included in the jar file, because the text file
employed in the example has not been annotated, thus no reference annotation is available. If a reference annotation is
available it should come in a MAS-XML compliant file with additional elements holding the anaphoric information (the
format is exactly the same as for the elements added by the system, the only difference being the tag name
itself).
The following command may be issued to invoke the evaluation module:
java -cp gtar1.1.jar GTAR_Evaluation masxml.referenceAnnotation.xml > performance.log
Then performance.log may be imported to Excel using the symbol ^ as field separator.
Finally here is the (fairly complete, but a bit out of date :-) API specification of the system (generated by javadoc including private methods).
Source code also available here released
under the GPL Licence.
Feedback most welcome! (contact)
Last updated: Jan 2007.