CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.
See the parent package reademe first
A mention is a reference or representation of an entity or an object that appeared in texts. Mentions can have different mention types and the entity a mention is referencing can have different entity types. For example, in the sentence “He is Obama, the president.”, “He”, “Obama” and “the president” are referencing to the same object (Barack Obama himself), so they are all mentions to the entity Barack Obama.
This is a mention detection module that aims to annotate all mentions in given texts. It has an annotator that accepts a TextAnnotation
and give it a new view which contains all the mentions as Constituent
.
This package tags all the tokens in a text into either “B/I/O” or “B/I/O/L/U” schema, then train/test on each single token. After predicting all tokens, we interpret this representation back to mentions.
The mentions predicted from the previous step is only mention heads. After detecting all mention-heads first, this package also has extent classification for each heads. In the extent classifier training process, each token in the extent to both right and left of a given mention head is formed into pairs to train a binary classifier, indicating whether the token is in the extent or not. To add negative examples, one token to the left and right of the mention extent is added. In the evaluating process, given a mention head, the package forms pairs with tokens to the left and right, until finding a token that is predicted to be not in the extent.
Head Boundary
Train\Eval (F1) | ACE | ERE |
---|---|---|
ACE | 89.6 | 83.8 |
ERE | 81.7 | 86.7 |
ACE+ERE | 88.4 | 86.5 |
Extent boundary accuracy given Head
Accuracy | |
---|---|
ACE | 89.45 |
ERE | 88.74 |
ACE+ERE | 86.65 |
If you want to use the illinois-md package independently, you can add a maven dependency in your pom.xml. Please replace the VERSION
with the latest version of the parent package.
<dependency>
<groupId>edu.illinois.cs.cogcomp</groupId>
<artifactId>illinois-md</artifactId>
<version>VERSION</version>
</dependency>
The recommended way to use this package is through the MentionAnnotator
. MentionAnnotator
can be initialized with or without an mode parameter.
The mode parameter has four options:
MentionAnnotator("ACE_NONTYPE")
loads the model trained on ACE without type.MentionAnnotator("ACE_TYPE")
loads the model trained on ACE with type.MentionAnnotator("ERE_NONTYPE")
loads the model trained on ERE without type.MentionAnnotator("ERE_TYPE")
loads the model trained on ERE with type.The initialization of MentionAnnotator()
without parameter set mode to "ACE_NONTYPE"
There is a static function getHeadConstituent(Constituent extent, String viewName)
built into the annotator, which returns the mention-heads.
import edu.illinois.cs.cogcomp.annotation.AnnotatorException;
import edu.illinois.cs.cogcomp.core.datastructures.textannotation.TextAnnotation;
import edu.illinois.cs.cogcomp.core.datastructures.textannotation.Constituent;
import edu.illinois.cs.cogcomp.core.datastructures.textannotation.View;
import edu.illinois.cs.cogcomp.nlp.utility.TokenizerTextAnnotationBuilder;
import edu.illinois.cs.cogcomp.annotation.TextAnnotationBuilder;
import edu.illinois.cs.cogcomp.core.datastructures.ViewNames;
import edu.illinois.cs.cogcomp.nlp.tokenizer.StatefulTokenizer;
import org.cogcomp.md.MentionAnnotator;
import java.io.IOException;
import java.util.List;
public class app
{
public static void main( String[] args ) throws IOException, AnnotatorException
{
String text1 = "Good afternoon, gentlemen. I am a HAL-9000 "
+ "computer. I was born in Urbana, Il. in 1992";
String corpus = "2001_ODYSSEY";
String textId = "001";
// Create a TextAnnotation using the LBJ sentence splitter
// and tokenizers.
TextAnnotationBuilder tab;
// don't split on hyphens, as NER models are trained this way
boolean splitOnHyphens = false;
tab = new TokenizerTextAnnotationBuilder(new StatefulTokenizer(splitOnHyphens));
TextAnnotation ta = tab.createTextAnnotation(corpus, textId, text1);
MentionAnnotator mentionAnnotator = new MentionAnnotator();
mentionAnnotator.addView(ta);
View mentionView = ta.getView(ViewNames.MENTION);
List<Constituent> predictedMentions = mentionView.getConstituents();
for (Constituent extent : predictedMentions){
//Get the head if needed
Constituent head = MentionAnnotator.getHeadConstituent(extent, "MENTION_HEAD");
}
}
}
This package also supports custom training. Through this, the user can produce models that is not pre-loaded in the annotator.
To use this, put data/
models/
tmp/
at the root of the cogcomp-nlp package, then modify/run tests from BIOTester
, ExtentTester
or AnnotatorTester
Please note that the user need to check the inner implementation of the tests to ensure data paths and output paths are correct.
mvn exec:java -Dexec.mainClass="org.cogcomp.md.BIOTester [METHOD]"
Supported Methods:
mvn exec:java -Dexec.mainClass="org.cogcomp.md.ExtentTester [METHOD]"
Supported Methods:
If you use this tool, please cite the following works.
@inproceedings{PengChRo15,
author = {Haoruo Peng, Kai-Wei Chang Dan Roth},
title = {A Joint Framework for Coreference Resolution and Mention Head Detection},
booktitle = {CoNLL},
pages = {10},
month = {7},
year = {2015},
address = {University of Illinois, Urbana-Champaign, Urbana, IL, 61801},
publisher = {ACL},
url = "http://cogcomp.org/papers/MentionDetection.pdf",
}
@inproceedings{RatinovRo09,
author = {L. Ratinov and D. Roth},
title = {Design Challenges and Misconceptions in Named Entity Recognition},
booktitle = {CoNLL},
month = {6},
year = {2009},
url = "http://cogcomp.org/papers/RatinovRo09.pdf",
funding = {MIAS, SoD, Library},
projects = {IE},
comment = {Named entity recognition; information extraction; knowledge resources; word class models; gazetteers; non-local features; global features; inference methods; BIO vs. BIOLU; text chunk representation},
}