A General Tool for Anaphora Resolution - GuiTAR (v1.1)

A. Installation

Download the file gtar1.1.zip and extract the contained files to a working directory. Files withing the zip:
gtar1.1.jar
Template.xml
txtToXML
txtXMLPipeline

Version 1.2 of GuiTAR: gtarBeta1.2.jar

Just substitute file gtar1.1.jar referred to below with this one, gtarBeta1.2.jar, and everything else should work as before.

The new version includes a few bug fixes related to determining the agreement features of noun phrases.

New version (3.0.3) of GuiTAR

B. Running an Example under Linux

(though with slight changes of the scripts they should run under Windows as well, as long as a corresponding version of ltchunk is also available - see below)

First make the two scripts executable as follows:

chmod 700 txtToXML
chmod 700 txtXMLPipeline

Download the following short sample text file bbcexcerpt.txt.

Then follow these steps:

I. Preprocessing-phase 1: Text-to-XML Conversion

(Here is an alternative to this step, which uses Charniak's full parser instead. Thus after using this alternative converter, you should continue with 'Preprocessing-phase 2' below and skip 'Preprocessing-phase 1'.)

Command:

txtToXML bbcexcerpt.txt bbcexcerpt.xml

Output:

None

Explanation:

The script does two things:
1. Calls another script txtXMLPipeline which produces (in this example) the file bbcexcerpt.xml
2. Runs class XMLTokeniser to produce the following file tagged.bbcexcerpt.xml (for this example)

Note: The script txtXMLPipeline makes an external call to ltchunk, a chunker which is part of the LT-XML suite of tools developed by the University of Edinburgh's LTG. To my understanding an evaluation copy may be requested from them and also an online demo is available at http://www.ltg.ed.ac.uk/~mikheev/tagger_demo.html.

I presume the online demo might be used in this example, only the corresponding processed text as defined by the flow of the script (the pipeline) should be pasted into the text box and then the resulting text (processed by ltchunk) should be passed over to the next steps of the script.

II. Preprocessing-phase 2: Syntactic heuristics

Command:

java -cp gtar1.1.jar uk.ac.essex.malexa.nlp.dp.GuiTAR.prepro.PreProSyntacticHeuristics tagged.bbcexcerpt.xml

Output:

File: tagged.bbcexcerpt.xml
Number of NEs: 26

Explanation:

For every ne marked-up by the chunker this class does the following:
1. Adds four attributes: np type and the three np agreement features - person, number and gender.
2. Markes-up the premodifiers
3. Markes-up the head
4. Markes-up the postmodifiers

The resulting file is masxml.tagged.bbcexcerpt.xml, which is compliant with the format MAS-XML.

III. Anaphora Resolution

Command:

java -cp gtar1.1.jar GTAR_Runner masxml.tagged.bbcexcerpt.xml > general.log

Output:

Spooled into file general.log.

Explanation:

This is the actual anaphora resolution module, which takes as an input a MAS-XML compliant file and adds new mark-up holding anaphoric information ( elements). The resulting file for this example is processed.masxml.tagged.bbcexcerpt.xml.

IV. Anaphora Resolution Evaluation

This example does not make use of the evaluation module which is also included in the jar file, because the text file employed in the example has not been annotated, thus no reference annotation is available. If a reference annotation is available it should come in a MAS-XML compliant file with additional elements holding the anaphoric information (the format is exactly the same as for the elements added by the system, the only difference being the tag name itself).

The following command may be issued to invoke the evaluation module:
java -cp gtar1.1.jar GTAR_Evaluation masxml.referenceAnnotation.xml > performance.log

Then performance.log may be imported to Excel using the symbol ^ as field separator.

Finally here is the (fairly complete, but a bit out of date :-) API specification of the system (generated by javadoc including private methods).

Source code also available here released under the GPL Licence.

Feedback most welcome! (contact)

Last updated: Jan 2007.