CogComp-DatalessClassifier

CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.

CogComp-DatalessClassifier

Given a label ontology, and textual descriptions of those labels, Dataless-Classifier is capable of classifying arbitrary text into that ontology.

It is particularly useful in those scenarios where it is difficult/expensive to gather enough training data to train a supervised text classifier. Dataless-Classifier utilizes the semantic meaning of the labels to bypass the need for explicit supervision. For more information, please visit our main project page.

Some key points:

Label Hierarchy

Dataless Classification requires the end-user to specifcy a Label hierarchy (with label descriptions), which it classifies into. The Label hierarchy needs to be provided using a very specific format:

We provide a sample 20newsgroups hierarchy with label descriptions inside data/hierarchy/20newsgroups, where:

We also provide improved 20newsgroups label descriptions in labelDesc_Kws_embellished.txt which corresponds to the label descriptions used in [2], whereas the labelDesc_Kws_simple.txt corresponds to the label descriptions used in [1].

Embeddings

ESA and Word2Vec Embeddings are fetched from the DataStore on demand.

Config

A sample config file with the default values has been provided in the config folder .. config/project.properties

To check whether you are properly set to use the project or not, run:

If you use this software for research, please cite the following papers:

[1] Chang, Ming-Wei, et al. “Importance of Semantic Representation: Dataless Classification.” AAAI. Vol. 2. 2008.

[2] Song, Yangqiu, and Dan Roth. “On Dataless Hierarchical Text Classification.” AAAI. Vol. 7. 2014.