|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object pt.tumba.ngram.TCatNG
public class TCatNG
The TCatNG Toolkit is a Java package that you can use to apply N-Gram analysis techniques to the process of categorizing text files.
N-Grams are short sequences of bytes or letters, and their statistics provide valuable informations about byte sequences and strings. N-Gram based approaches are very powerful in text categorization because every string is decomposed into small parts, and errors tend to affect only a limited number of those parts, leaving the remainder intact.
The use of character N-Grams also does not explicitly or implicitly require the specification of a separator, as it is necessary for words. Consequently, analyzing a text in terms of N-Grams constitutes a valuable approach for text written in any language based on an alphabet and the concatenation text-construction operator, eliminating the need for complex text tokenization, stemming, and/or lemmatization.
There are many possible applications: categorizing documents by topic, detecting the author of a text, or recognizing the language and encoding for a bunch of bytes (i.e. in a search engine, to figure the language of a document). This is actually the first application this software package was designed for, but many other uncharted areas are up to you to explore.
Constructor Summary | |
---|---|
TCatNG()
|
Method Summary | |
---|---|
private static java.lang.String[] |
classifyCompression(java.io.File fi,
java.lang.String[] names,
java.lang.String trainingPath)
|
private static java.lang.String[] |
classifyCompression(java.lang.String[] names,
java.lang.String trainingPath)
|
private static java.lang.String[] |
classifyCompression(java.lang.String path,
java.lang.String trainingPath)
|
private static java.lang.String[] |
classifyLogRegression(java.io.File fi,
java.lang.String[] names,
java.lang.String trainingPath)
|
private static java.lang.String[] |
classifyLogRegression(java.lang.String[] names,
java.lang.String trainingPath)
|
private static java.lang.String[] |
classifyLogRegression(java.lang.String path,
java.lang.String trainingPath)
|
private static java.lang.String[] |
classifyNN(java.io.File fi,
java.lang.String[] names,
java.lang.String trainingPath)
|
private static java.lang.String[] |
classifyNN(java.lang.String[] names,
java.lang.String trainingPath)
|
private static java.lang.String[] |
classifyNN(java.lang.String path,
java.lang.String trainingPath)
|
private static java.lang.String[] |
classifySVM(java.io.File fi,
java.lang.String[] names,
java.lang.String trainingPath)
|
private static java.lang.String[] |
classifySVM(java.lang.String[] names,
java.lang.String trainingPath)
|
private static java.lang.String[] |
classifySVM(java.lang.String path,
java.lang.String trainingPath)
|
static void |
main(java.lang.String[] args)
The main method of this package. |
private static void |
printUsage()
Prints command usage information to the standard output. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public TCatNG()
Method Detail |
---|
private static java.lang.String[] classifyNN(java.lang.String path, java.lang.String trainingPath)
private static java.lang.String[] classifyNN(java.lang.String[] names, java.lang.String trainingPath)
private static java.lang.String[] classifyNN(java.io.File fi, java.lang.String[] names, java.lang.String trainingPath)
private static java.lang.String[] classifyCompression(java.lang.String path, java.lang.String trainingPath)
private static java.lang.String[] classifyCompression(java.lang.String[] names, java.lang.String trainingPath)
private static java.lang.String[] classifyCompression(java.io.File fi, java.lang.String[] names, java.lang.String trainingPath)
private static java.lang.String[] classifySVM(java.lang.String path, java.lang.String trainingPath)
private static java.lang.String[] classifySVM(java.lang.String[] names, java.lang.String trainingPath)
private static java.lang.String[] classifySVM(java.io.File fi, java.lang.String[] names, java.lang.String trainingPath)
private static java.lang.String[] classifyLogRegression(java.lang.String path, java.lang.String trainingPath)
private static java.lang.String[] classifyLogRegression(java.lang.String[] names, java.lang.String trainingPath)
private static java.lang.String[] classifyLogRegression(java.io.File fi, java.lang.String[] names, java.lang.String trainingPath)
private static void printUsage()
public static void main(java.lang.String[] args)
args
- The command line arguments, tokenized.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |