|
||||||||||
| PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES | |||||||||
See:
Description
| Interface Summary | |
|---|---|
| Profile | Abstract interface to model an N-Gram Profile. |
| Class Summary | |
|---|---|
| DataProfile | A Profile stores N-gram frequency information for a given textual string. |
| EntryProfile | A Profile stores N-gram frequency information for a given textual string. |
| LanguageClass | Understanding the language of a given document is an essential step in working with unstructured multilingual text. |
| NGram | This class models a concrete and simple N-Gram. |
| NGramCathegorizer | NGramCathegorizer implements the classification technique described in
Cavnar & Trenkle, "N-Gram-Based Text Categorization". |
| NGramConstants | Contant values used in the TCatNG package. |
| ProfileReader | Class to hold (static) methods to read in profile data. |
| ProfileWriter | Write an N-Gram profile to disk. |
| TCatNG | The TCatNG Toolkit is a Java package that you can use to apply N-Gram analysis techniques to the process of categorizing text files. |
| Exception Summary | |
|---|---|
| TCatNGException | Wrapper for Exceptions used in the TCatNG package. |
The TCatNG Toolkit is a Java package that you can use to apply N-Gram analysis techniques
to the process of categorizing text files.
N-Grams are short sequences of bytes or letters, and their statistics provide valuable informations
about byte sequences and strings. N-Gram based approaches are very powerful in text
categorization because every string is decomposed into small parts, and errors tend to affect only a
limited number of those parts, leaving the remainder intact.
The use of character N-Grams also does not explicitly or implicitly require the specification of a
separator, as it is necessary for words. Consequently, analyzing a text in terms of N-Grams
constitutes a valuable approach for text written in any language based on an alphabet and the
concatenation text-construction operator, eliminating the need for complex text tokenization,
stemming, and/or lemmatization.
There are many possible applications: categorizing documents by topic, detecting the author
of a text, or recognizing the language and encoding for a bunch of bytes (i.e. in a search engine,
to figure the language of a document). This is actually the first application this software package
was designed for, but many other uncharted areas are up to you to explore.
|
||||||||||
| PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES | |||||||||