CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.
Penn Attribution Relation Corpus 3.0 (PARC 3.0) paper link
Contact the owner of the corpus, Silvia Pareti for access to the corpus (You will need valid LDC licenses to PTB & PDTB).
Given an source directory, the reader will look for all files with “.xml” extension in all nested sub-directories. Each document is read into a TextAnnotation
instance with the following views defined in the ViewNames
class:
TOKENS
: TokenLabelView
that keeps gold tokenization from corpus.SENTENCE
: TokenLabelView
that keeps gold sentence split from corpus.ATTRIBUTION_RELATION
: PredicateArgumentView
. Each Attribution Relation corresponds to one predicate argument set. The “Cue” in each Attribution Relation serves as a “predicate”, and “source”s and “span”s in that relation serves as arguments.POS
: TokenLabelView
that keeps POS tags from corpusLEMMA
: TokenLabelView
that keeps lemma of each token from corpus
Standard WSJ directory structure.
\PARC3
\train
\00
wsj-0001.xml
...
\01
wsj-0101.xml
...
...
\test
\23
...
\dev
\24
...
import edu.illinois.cs.cogcomp.nlp.corpusreaders.parcReader.PARC3Reader;
import edu.illinois.cs.cogcomp.nlp.corpusreaders.parcReader.PARC3ReaderConfigurator;
// Read all training data, with defualt settings (discard gold POS and LEMMA)
PARC3Reader reader = new PARC3Reader("data/PARC3/train");
or specify your own settings by creating a *.properties
file. See PARC3ReaderConfigurator
for what fields you should specify.
PARC3Reader reader = new PARC3Reader(new ResourceManager("my-parc3-config.properties"))
PARC3Reader
implements Iterable<TextAnnotation>
interface.
while (reader.hasNext()) {
TextAnnotation doc = reader.next();
...
}
or
for (TextAnnotation doc : reader) {
...
}