|
||||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |
See:
Description
Interface Summary | |
---|---|
Profile | Abstract interface to model an N-Gram Profile. |
Class Summary | |
---|---|
DataProfile | A Profile stores N-gram frequency information for a given textual string. |
EntryProfile | A Profile stores N-gram frequency information for a given textual string. |
LanguageClass | Understanding the language of a given document is an essential step in working with unstructured multilingual text. |
NGram | This class models a concrete and simple N-Gram. |
NGramCathegorizer | NGramCathegorizer implements the classification technique described in
Cavnar & Trenkle, "N-Gram-Based Text Categorization". |
NGramConstants | Contant values used in the TCatNG package. |
ProfileReader | Class to hold (static) methods to read in profile data. |
ProfileWriter | Write an N-Gram profile to disk. |
TCatNG | The TCatNG Toolkit is a Java package that you can use to apply N-Gram analysis techniques to the process of categorizing text files. |
Exception Summary | |
---|---|
TCatNGException | Wrapper for Exceptions used in the TCatNG package. |
The TCatNG Toolkit is a Java package that you can use to apply N-Gram analysis techniques
to the process of categorizing text files.
N-Grams are short sequences of bytes or letters, and their statistics provide valuable informations
about byte sequences and strings. N-Gram based approaches are very powerful in text
categorization because every string is decomposed into small parts, and errors tend to affect only a
limited number of those parts, leaving the remainder intact.
The use of character N-Grams also does not explicitly or implicitly require the specification of a
separator, as it is necessary for words. Consequently, analyzing a text in terms of N-Grams
constitutes a valuable approach for text written in any language based on an alphabet and the
concatenation text-construction operator, eliminating the need for complex text tokenization,
stemming, and/or lemmatization.
There are many possible applications: categorizing documents by topic, detecting the author
of a text, or recognizing the language and encoding for a bunch of bytes (i.e. in a search engine,
to figure the language of a document). This is actually the first application this software package
was designed for, but many other uncharted areas are up to you to explore.
|
||||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |