PARC 3.0 Reader

CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.

PARC 3.0 Reader

Overview

Penn Attribution Relation Corpus 3.0 (PARC 3.0) paper link

Contact the owner of the corpus, Silvia Pareti for access to the corpus (You will need valid LDC licenses to PTB & PDTB).

Implementation details

Given an source directory, the reader will look for all files with “.xml” extension in all nested sub-directories. Each document is read into a TextAnnotation instance with the following views defined in the ViewNames class:

Directory Structure

Standard WSJ directory structure.

\PARC3
	\train
   		\00
			wsj-0001.xml
 	      	...
       	\01
        	wsj-0101.xml
            ...
        ...
    \test
    	\23
        	...
    \dev
    	\24
        	...

Usage in Java

import edu.illinois.cs.cogcomp.nlp.corpusreaders.parcReader.PARC3Reader;
import edu.illinois.cs.cogcomp.nlp.corpusreaders.parcReader.PARC3ReaderConfigurator;

// Read all training data, with defualt settings (discard gold POS and LEMMA)
PARC3Reader reader = new PARC3Reader("data/PARC3/train"); 

or specify your own settings by creating a *.properties file. See PARC3ReaderConfigurator for what fields you should specify.

PARC3Reader reader = new PARC3Reader(new ResourceManager("my-parc3-config.properties"))

PARC3Reader implements Iterable<TextAnnotation> interface.

while (reader.hasNext()) {
	TextAnnotation doc = reader.next();
	...
}

or

for (TextAnnotation doc : reader) {
	...
}