Package pt.tumba.ngram

The TCatNG Toolkit is a Java package that you can use to apply N-Gram analysis techniques to the process of categorizing text files.

See:
          Description

Interface Summary
Profile Abstract interface to model an N-Gram Profile.
 

Class Summary
DataProfile A Profile stores N-gram frequency information for a given textual string.
EntryProfile A Profile stores N-gram frequency information for a given textual string.
LanguageClass Understanding the language of a given document is an essential step in working with unstructured multilingual text.
NGram This class models a concrete and simple N-Gram.
NGramCathegorizer NGramCathegorizer implements the classification technique described in Cavnar & Trenkle, "N-Gram-Based Text Categorization".
NGramConstants Contant values used in the TCatNG package.
ProfileReader Class to hold (static) methods to read in profile data.
ProfileWriter Write an N-Gram profile to disk.
TCatNG The TCatNG Toolkit is a Java package that you can use to apply N-Gram analysis techniques to the process of categorizing text files.
 

Exception Summary
TCatNGException Wrapper for Exceptions used in the TCatNG package.
 

Package pt.tumba.ngram Description

The TCatNG Toolkit is a Java package that you can use to apply N-Gram analysis techniques to the process of categorizing text files.
N-Grams are short sequences of bytes or letters, and their statistics provide valuable informations about byte sequences and strings. N-Gram based approaches are very powerful in text categorization because every string is decomposed into small parts, and errors tend to affect only a limited number of those parts, leaving the remainder intact.
The use of character N-Grams also does not explicitly or implicitly require the specification of a separator, as it is necessary for words. Consequently, analyzing a text in terms of N-Grams constitutes a valuable approach for text written in any language based on an alphabet and the concatenation text-construction operator, eliminating the need for complex text tokenization, stemming, and/or lemmatization.
There are many possible applications: categorizing documents by topic, detecting the author of a text, or recognizing the language and encoding for a bunch of bytes (i.e. in a search engine, to figure the language of a document). This is actually the first application this software package was designed for, but many other uncharted areas are up to you to explore.