pt.tumba.ngram
Class TCatNG

java.lang.Object
  extended by pt.tumba.ngram.TCatNG

public class TCatNG
extends java.lang.Object

The TCatNG Toolkit is a Java package that you can use to apply N-Gram analysis techniques to the process of categorizing text files.

N-Grams are short sequences of bytes or letters, and their statistics provide valuable informations about byte sequences and strings. N-Gram based approaches are very powerful in text categorization because every string is decomposed into small parts, and errors tend to affect only a limited number of those parts, leaving the remainder intact.

The use of character N-Grams also does not explicitly or implicitly require the specification of a separator, as it is necessary for words. Consequently, analyzing a text in terms of N-Grams constitutes a valuable approach for text written in any language based on an alphabet and the concatenation text-construction operator, eliminating the need for complex text tokenization, stemming, and/or lemmatization.

There are many possible applications: categorizing documents by topic, detecting the author of a text, or recognizing the language and encoding for a bunch of bytes (i.e. in a search engine, to figure the language of a document). This is actually the first application this software package was designed for, but many other uncharted areas are up to you to explore.

Author:
Bruno Martins

Constructor Summary
TCatNG()
           
 
Method Summary
private static java.lang.String[] classifyCompression(java.io.File fi, java.lang.String[] names, java.lang.String trainingPath)
           
private static java.lang.String[] classifyCompression(java.lang.String[] names, java.lang.String trainingPath)
           
private static java.lang.String[] classifyCompression(java.lang.String path, java.lang.String trainingPath)
           
private static java.lang.String[] classifyLogRegression(java.io.File fi, java.lang.String[] names, java.lang.String trainingPath)
           
private static java.lang.String[] classifyLogRegression(java.lang.String[] names, java.lang.String trainingPath)
           
private static java.lang.String[] classifyLogRegression(java.lang.String path, java.lang.String trainingPath)
           
private static java.lang.String[] classifyNN(java.io.File fi, java.lang.String[] names, java.lang.String trainingPath)
           
private static java.lang.String[] classifyNN(java.lang.String[] names, java.lang.String trainingPath)
           
private static java.lang.String[] classifyNN(java.lang.String path, java.lang.String trainingPath)
           
private static java.lang.String[] classifySVM(java.io.File fi, java.lang.String[] names, java.lang.String trainingPath)
           
private static java.lang.String[] classifySVM(java.lang.String[] names, java.lang.String trainingPath)
           
private static java.lang.String[] classifySVM(java.lang.String path, java.lang.String trainingPath)
           
static void main(java.lang.String[] args)
          The main method of this package.
private static void printUsage()
          Prints command usage information to the standard output.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TCatNG

public TCatNG()
Method Detail

classifyNN

private static java.lang.String[] classifyNN(java.lang.String path,
                                             java.lang.String trainingPath)

classifyNN

private static java.lang.String[] classifyNN(java.lang.String[] names,
                                             java.lang.String trainingPath)

classifyNN

private static java.lang.String[] classifyNN(java.io.File fi,
                                             java.lang.String[] names,
                                             java.lang.String trainingPath)

classifyCompression

private static java.lang.String[] classifyCompression(java.lang.String path,
                                                      java.lang.String trainingPath)

classifyCompression

private static java.lang.String[] classifyCompression(java.lang.String[] names,
                                                      java.lang.String trainingPath)

classifyCompression

private static java.lang.String[] classifyCompression(java.io.File fi,
                                                      java.lang.String[] names,
                                                      java.lang.String trainingPath)

classifySVM

private static java.lang.String[] classifySVM(java.lang.String path,
                                              java.lang.String trainingPath)

classifySVM

private static java.lang.String[] classifySVM(java.lang.String[] names,
                                              java.lang.String trainingPath)

classifySVM

private static java.lang.String[] classifySVM(java.io.File fi,
                                              java.lang.String[] names,
                                              java.lang.String trainingPath)

classifyLogRegression

private static java.lang.String[] classifyLogRegression(java.lang.String path,
                                                        java.lang.String trainingPath)

classifyLogRegression

private static java.lang.String[] classifyLogRegression(java.lang.String[] names,
                                                        java.lang.String trainingPath)

classifyLogRegression

private static java.lang.String[] classifyLogRegression(java.io.File fi,
                                                        java.lang.String[] names,
                                                        java.lang.String trainingPath)

printUsage

private static void printUsage()
Prints command usage information to the standard output.


main

public static void main(java.lang.String[] args)
The main method of this package.

Parameters:
args - The command line arguments, tokenized.