|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectpt.tumba.ngram.TCatNG
public class TCatNG
The TCatNG Toolkit is a Java package that you can use to apply N-Gram analysis techniques to the process of categorizing text files.
N-Grams are short sequences of bytes or letters, and their statistics provide valuable informations about byte sequences and strings. N-Gram based approaches are very powerful in text categorization because every string is decomposed into small parts, and errors tend to affect only a limited number of those parts, leaving the remainder intact.
The use of character N-Grams also does not explicitly or implicitly require the specification of a separator, as it is necessary for words. Consequently, analyzing a text in terms of N-Grams constitutes a valuable approach for text written in any language based on an alphabet and the concatenation text-construction operator, eliminating the need for complex text tokenization, stemming, and/or lemmatization.
There are many possible applications: categorizing documents by topic, detecting the author of a text, or recognizing the language and encoding for a bunch of bytes (i.e. in a search engine, to figure the language of a document). This is actually the first application this software package was designed for, but many other uncharted areas are up to you to explore.
| Constructor Summary | |
|---|---|
TCatNG()
|
|
| Method Summary | |
|---|---|
private static java.lang.String[] |
classifyCompression(java.io.File fi,
java.lang.String[] names,
java.lang.String trainingPath)
|
private static java.lang.String[] |
classifyCompression(java.lang.String[] names,
java.lang.String trainingPath)
|
private static java.lang.String[] |
classifyCompression(java.lang.String path,
java.lang.String trainingPath)
|
private static java.lang.String[] |
classifyLogRegression(java.io.File fi,
java.lang.String[] names,
java.lang.String trainingPath)
|
private static java.lang.String[] |
classifyLogRegression(java.lang.String[] names,
java.lang.String trainingPath)
|
private static java.lang.String[] |
classifyLogRegression(java.lang.String path,
java.lang.String trainingPath)
|
private static java.lang.String[] |
classifyNN(java.io.File fi,
java.lang.String[] names,
java.lang.String trainingPath)
|
private static java.lang.String[] |
classifyNN(java.lang.String[] names,
java.lang.String trainingPath)
|
private static java.lang.String[] |
classifyNN(java.lang.String path,
java.lang.String trainingPath)
|
private static java.lang.String[] |
classifySVM(java.io.File fi,
java.lang.String[] names,
java.lang.String trainingPath)
|
private static java.lang.String[] |
classifySVM(java.lang.String[] names,
java.lang.String trainingPath)
|
private static java.lang.String[] |
classifySVM(java.lang.String path,
java.lang.String trainingPath)
|
static void |
main(java.lang.String[] args)
The main method of this package. |
private static void |
printUsage()
Prints command usage information to the standard output. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public TCatNG()
| Method Detail |
|---|
private static java.lang.String[] classifyNN(java.lang.String path,
java.lang.String trainingPath)
private static java.lang.String[] classifyNN(java.lang.String[] names,
java.lang.String trainingPath)
private static java.lang.String[] classifyNN(java.io.File fi,
java.lang.String[] names,
java.lang.String trainingPath)
private static java.lang.String[] classifyCompression(java.lang.String path,
java.lang.String trainingPath)
private static java.lang.String[] classifyCompression(java.lang.String[] names,
java.lang.String trainingPath)
private static java.lang.String[] classifyCompression(java.io.File fi,
java.lang.String[] names,
java.lang.String trainingPath)
private static java.lang.String[] classifySVM(java.lang.String path,
java.lang.String trainingPath)
private static java.lang.String[] classifySVM(java.lang.String[] names,
java.lang.String trainingPath)
private static java.lang.String[] classifySVM(java.io.File fi,
java.lang.String[] names,
java.lang.String trainingPath)
private static java.lang.String[] classifyLogRegression(java.lang.String path,
java.lang.String trainingPath)
private static java.lang.String[] classifyLogRegression(java.lang.String[] names,
java.lang.String trainingPath)
private static java.lang.String[] classifyLogRegression(java.io.File fi,
java.lang.String[] names,
java.lang.String trainingPath)
private static void printUsage()
public static void main(java.lang.String[] args)
args - The command line arguments, tokenized.
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||