ACE Reader for the 2004 and 2005 datasets.

CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.

ACE Reader for the 2004 and 2005 datasets.

Overview

Dataset Annotation Guidelines [link] (https://www.ldc.upenn.edu/collaborations/past-projects/ace/annotation-tasks-and-specifications)

Dataset Download links: ACE-2004 ACE-2005

Implementation details

Each document is read into a TextAnnotation instance with the following views defined in the ViewNames class:

Usage

Directory Structure

The reader expects data-set to be in the following structure:

corpusHomeDir
├── bc
│   └── apf.dtd
|   └── <other files (*.apf.xml, *.sgm)>
├── bn
│   └── apf.dtd
|   └── <other files (*.apf.xml, *.sgm)>
├── cts
│   └── apf.dtd
|   └── <other files (*.apf.xml, *.sgm)>
└── newswire_nw
    └── apf.dtd
    └── <other files (*.apf.xml, *.sgm)>

Each of the sub-directories represents a section and has different text parsing logic. The reader expects the section directories to end with a suffix representing the parser to be used according to the following suffix logic:

Note: The version of the 2005 corpus for which this reader was developed had the markup files (.xml, .sgm etc.) in a single directory timex2norm under each section. The reader should work for this directory structure too.

Java Usage

import edu.illinois.cs.cogcomp.nlp.corpusreaders.ACEReader;

// Read all sections in ACE-2004
ACEReader reader2004 = new ACEReader("data/ace2004/data/English", true);

// Read all sections in ACE-2005
ACEReader reader2005 = new ACEReader("data/ace2005/data/English", false);

// Read limited sections only
String[] sections = new String[] { "nw", "bn" };
ACEReader reader2004Partial = new ACEReader("data/ace2004/data/English", sections, true);

ACEReader implements Iterable<TextAnnotation> interface.

Sample Usage:

while (reader.hasNext()) {
	TextAnnotation doc = reader.next();
	...
}

or

for (TextAnnotation doc : reader) {
	...
}

Caveats