CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.
Dataset Annotation Guidelines [link] (https://www.ldc.upenn.edu/collaborations/past-projects/ace/annotation-tasks-and-specifications)
Dataset Download links: ACE-2004 ACE-2005
Each document is read into a TextAnnotation
instance with the following views defined in the ViewNames class:
TOKENS
: The basic TokenLabelView
view generated by a Tokenizer
from the raw dataset
textMENTION_ACE
: SpanLabelView
with overlapping constituents where each constituent
represents a entity extent and head-word information is stored as attributes and the label is the Coarse Entity Type. The Fine Entity Type is also stored as an attribute in the Entity constituent. Relations between entities are represented as edges between Entity constituentsCOREF_HEAD, COREF_EXTENT
: CoreferenceView
uses copies of mentions from NER_ACE_COARSE_*
views and adds the longest mention as the canonical mention + adds coreference edges to other mentions for the same entity.The reader expects data-set to be in the following structure:
corpusHomeDir
├── bc
│ └── apf.dtd
| └── <other files (*.apf.xml, *.sgm)>
├── bn
│ └── apf.dtd
| └── <other files (*.apf.xml, *.sgm)>
├── cts
│ └── apf.dtd
| └── <other files (*.apf.xml, *.sgm)>
└── newswire_nw
└── apf.dtd
└── <other files (*.apf.xml, *.sgm)>
Each of the sub-directories represents a section and has different text parsing logic. The reader expects the section directories to end with a suffix representing the parser to be used according to the following suffix logic:
Note: The version of the 2005 corpus for which this reader was developed had the markup files (.xml, .sgm etc.) in a single directory timex2norm
under each section.
The reader should work for this directory structure too.
import edu.illinois.cs.cogcomp.nlp.corpusreaders.ACEReader;
// Read all sections in ACE-2004
ACEReader reader2004 = new ACEReader("data/ace2004/data/English", true);
// Read all sections in ACE-2005
ACEReader reader2005 = new ACEReader("data/ace2005/data/English", false);
// Read limited sections only
String[] sections = new String[] { "nw", "bn" };
ACEReader reader2004Partial = new ACEReader("data/ace2004/data/English", sections, true);
ACEReader
implements Iterable<TextAnnotation>
interface.
Sample Usage:
while (reader.hasNext()) {
TextAnnotation doc = reader.next();
...
}
or
for (TextAnnotation doc : reader) {
...
}