pt.tumba.ngram

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV PACKAGE NEXT PACKAGE

FRAMES NO FRAMES

Package pt.tumba.ngram

The TCatNG Toolkit is a Java package that you can use to apply N-Gram analysis techniques to the process of categorizing text files.

See:
Description

Interface Summary
Profile	Abstract interface to model an N-Gram Profile.

Class Summary
DataProfile	A `Profile` stores N-gram frequency information for a given textual string.
EntryProfile	A `Profile` stores N-gram frequency information for a given textual string.
LanguageClass	Understanding the language of a given document is an essential step in working with unstructured multilingual text.
NGram	This class models a concrete and simple N-Gram.
NGramCathegorizer	`NGramCathegorizer` implements the classification technique described in Cavnar & Trenkle, "N-Gram-Based Text Categorization".
NGramConstants	Contant values used in the TCatNG package.
ProfileReader	Class to hold (static) methods to read in profile data.
ProfileWriter	Write an N-Gram profile to disk.
TCatNG	The TCatNG Toolkit is a Java package that you can use to apply N-Gram analysis techniques to the process of categorizing text files.

Exception Summary
TCatNGException	Wrapper for Exceptions used in the TCatNG package.

Package pt.tumba.ngram Description

The TCatNG Toolkit is a Java package that you can use to apply N-Gram analysis techniques to the process of categorizing text files.
N-Grams are short sequences of bytes or letters, and their statistics provide valuable informations about byte sequences and strings. N-Gram based approaches are very powerful in text categorization because every string is decomposed into small parts, and errors tend to affect only a limited number of those parts, leaving the remainder intact.
The use of character N-Grams also does not explicitly or implicitly require the specification of a separator, as it is necessary for words. Consequently, analyzing a text in terms of N-Grams constitutes a valuable approach for text written in any language based on an alphabet and the concatenation text-construction operator, eliminating the need for complex text tokenization, stemming, and/or lemmatization.
There are many possible applications: categorizing documents by topic, detecting the author of a text, or recognizing the language and encoding for a bunch of bytes (i.e. in a search engine, to figure the language of a document). This is actually the first application this software package was designed for, but many other uncharted areas are up to you to explore.