TCatNG Toolkit :: Text Categorization via N-Grams

What is TCatNG?

The TCatNG Toolkit is a Java package that you can use to apply N-Gram analysis techniques to the process of categorizing text files.

N-Grams are short sequences of bytes or letters, and their statistics provide valuable informations about byte sequences and strings. N-Gram based approaches are very powerful in text categorization because every string is decomposed into small parts, and errors tend to affect only a limited number of those parts, leaving the remainder intact.

The use of character N-Grams also does not explicitly or implicitly require the specification of a separator, as it is necessary for words. Consequently, analyzing a text in terms of N-Grams constitutes a valuable approach for text written in any language based on an alphabet and the concatenation text-construction operator, eliminating the need for complex text tokenization, stemming, and/or lemmatization.

There are many possible applications: categorizing documents by topic, detecting the author of a text, or recognizing the language and encoding for a bunch of bytes (i.e. in a search engine, to figure the language of a document). This is actually the first application this software package was designed for, but many other uncharted areas are up to you to explore.

In sum, this package offers a robust research framework for experimenting with text categorization using character N-Grams.

People

TCatNG was developed at the XLDB group of the Department of Informatics of the Faculty of Sciences of the University of Lisbon in Portugal. It was created to support the research paper "Language Identification in Web Pages".

TCatNG was written by Bruno Martins and Nuno Seco.

Research

TCatNG is a Java package that implement the classification technique described in Cavnar & Trenkle, "N-Gram-Based Text Categorization".

The central idea of the Cavnar & Trenkle technique is to calculate a "fingerprint" of a document with an unknown category, and compare this with the fingerprints of a number of documents for which the categories are known. The categories of the closest matches are output as the classification. A fingerprint is a list of the most frequent n-grams occurring in a document, ordered by frequency. Fingerprints are compared with a simple "out-of-place" metric.

This package also implements some extentions to the original proposal. Among other things, the software offers support for Good-Turing smoothing and new fingerprint comparison methods based on the similarity metrics proposed by Lin in "An information-theoretic definition of similarity" and Jiand & Conranth in "Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy". Other classification methods besides nearest neighbour are also implemented, such as Support Vector Machines or Bayesian Logistic Regression.

Availability

TCatNG is released under the BSD License, which basically states that you can do anything you like with it as long as you mention the authors and make it clear that the library is covered by the BSD License. It also exempts us from any liability, should this library eat your hard disc or kill your cat.

Source code, samples and detailed documentation are provided in the download. The Java API documentation is also available online.

The toolkit requires very small amounts of disk space to install and run. We encourage you to try it out and let us know of any problems you find. We would also be very happy to hear from people who are using this software package.