Transliteration

CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.

Transliteration

This is a Java port of Jeff Pasternack’s C# code from Learning Better Transliterations

See examples in TestTransliteration or Runner.

Training data

To train a model, you need pairs of names. A common source is Wikipedia interlanguage links. For example, see this data from Transliterating From All Languages by Anne Irvine et al.

The standard data format expected is:

foreign<tab>english

That said, the Utils class has readers for many different datasets (including Anne Irvine’s data).

Training a model

The standard class is the SPModel. Use it as follows:

List<Example> training = Utils.readWikiData(trainfile);
SPModel model = new SPModel(training);
model.Train(10);
model.WriteProbs(modelfile);

This will train a model, and write it to the path specified by modelfile.

SPModel has another useful function called Probability(source, target), which will return the transliteration probability of a given pair.

Annotating

A trained model can be used immediately after training, or you can initialize SPModel using a previously trained and saved modelfile.

SPModel model = new SPModel(modelfile);
model.setMaxCandidates(10);
TopList<Double,String> predictions = model.Generate(testexample);

We limited the max number of candidates to 10, so predictions will have at most 10 elements. These are sorted by score, highest to lowest, where the first element is the best.

Interactive

Once you have trained a model, it is often helpful to try interacting with it. Use interactive.sh for this:

$ ./scripts/interactive.sh models/modelfile