pt.tumba.ngram
Class NGramCathegorizer

java.lang.Object
  extended by pt.tumba.ngram.NGramCathegorizer

public class NGramCathegorizer
extends java.lang.Object

NGramCathegorizer implements the classification technique described in Cavnar & Trenkle, "N-Gram-Based Text Categorization". It was primarily developed for language guessing, a task on which it is known to perform with near-perfect accuracy.

The central idea of the Cavnar & Trenkle technique is to calculate a "fingerprint" of a document with an unknown category, and compare this with the fingerprints of a number of documents of which the categories are known. The categories of the closest matches are output as the classification. A fingerprint is a list of the most frequent n-grams occurring in a document, ordered by frequency.

In the oroginal proposal, fingerprints are compared with a simple out-of-place metric (see the article for more details). This package implements some extentions to the original proposal, namelly offering support for Good-Turing smoothing and new fingerprint comparison methods, based on the similarity metrics proposed by Lin in "An information-theoretic definition of similarity" and Jiand & Conranth in "Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy".

This library was made with efficiency in mind. There are couple of parameters you may wish to tweak if you intend to use it for other tasks than language guessing.

Since the speed of the classifier is roughly linear with respect to the number of models, you should consider how many models you really need. For instance in case of language guessing: do you really want to recognize every language ever invented?

Author:
Bruno Martins

Field Summary
static java.io.FilenameFilter NGramFilter
          A FilenameFilter for filtering directory listings, recognizing filenames for N-gram profiles.
protected  java.util.List profiles
          The list of profiles with the models for classification.
private  int similarityMetric
          The metric used to measure the distance between the profiles.
 
Constructor Summary
NGramCathegorizer()
          Construct an uninitialized cathegorizer that uses Lin's similarity measure.
NGramCathegorizer(int similarityMetric)
          Construct an uninitialized cathegorizer that uses a specific similarity measure.
NGramCathegorizer(java.lang.String dirName)
          Construct an cathegorizer that uses Lin's similarity measure from a directory with model profiles.
NGramCathegorizer(java.lang.String[] fileNames)
          Construct an Cathegorizer that uses Lin's similarity measure from an array of resource file names.
NGramCathegorizer(java.lang.String[] fileNames, int similarityMetric)
          Construct an Cathegorizer that uses a specific similarity measure from an array of resource file names.
NGramCathegorizer(java.lang.String dirName, int similarityMetric)
          Construct an cathegorizer that uses a specific similarity measure from a directory with model profiles.
 
Method Summary
 void addProfile(Profile prof)
          Add a new Profiles to the list of models.
static double deltaRank(Profile prof1, Profile prof2)
          Calculate "the distance" between two profiles, using the metric proposed by Cavnar & Trenkle.
private  void init(java.io.File fi, java.lang.String[] names)
          Fetch the set of profiles from the disk.
static void main(java.lang.String[] args)
          Sample application to use the Cathegorizer from the command line.
 Profile match(Profile prof)
          Match a given Profile against all the profiles constituting the models in the cathegorizer.
static double profileDistance(Profile prof1, Profile prof2)
          Calculate "the distance" between two profiles, according to the metric selected while instantiating this class.
static double similarityJiang(Profile prof1, Profile prof2)
          Calculate "the distance" between two profiles, using Jiang's & Conranth similarity measure, as proposed in "Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy".
static double similarityLin(Profile prof1, Profile prof2)
          Calculate "the distance" between two profiles, using Lin's similarity measure as proposed in "An information-theoretic definition of similarity".
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

similarityMetric

private int similarityMetric
The metric used to measure the distance between the profiles.


NGramFilter

public static java.io.FilenameFilter NGramFilter
A FilenameFilter for filtering directory listings, recognizing filenames for N-gram profiles. Essentially, all filenames not ending with a ".corpus" extension are valid.


profiles

protected java.util.List profiles
The list of profiles with the models for classification.

Constructor Detail

NGramCathegorizer

public NGramCathegorizer()
Construct an uninitialized cathegorizer that uses Lin's similarity measure.


NGramCathegorizer

public NGramCathegorizer(int similarityMetric)
Construct an uninitialized cathegorizer that uses a specific similarity measure.

Parameters:
similariMetric - The similarity metric to be used in the cathegorizer. 1 for Lin's measure, 2 for Jiang's & Conranth and 3 for Cavnar & Trenkle.

NGramCathegorizer

public NGramCathegorizer(java.lang.String dirName)
                  throws TCatNGException,
                         java.io.FileNotFoundException
Construct an cathegorizer that uses Lin's similarity measure from a directory with model profiles.

Parameters:
dirName - Pathname for the directory with the profiles.
Throws:
TCatNGException - A problem occured while reading the profiles.
java.io.FileNotFoundException - The pathname was not found.

NGramCathegorizer

public NGramCathegorizer(java.lang.String dirName,
                         int similarityMetric)
                  throws TCatNGException,
                         java.io.FileNotFoundException
Construct an cathegorizer that uses a specific similarity measure from a directory with model profiles.

Parameters:
dirName - Pathname for the directory with the profiles.
similariMetric - The similarity metric to be used in the cathegorizer. 1 for Lin's measure, 2 for Jiang's & Conranth and 3 for Cavnar & Trenkle.
Throws:
TCatNGException - A problem occured while reading the profiles.
java.io.FileNotFoundException - The pathname was not found.

NGramCathegorizer

public NGramCathegorizer(java.lang.String[] fileNames)
                  throws TCatNGException,
                         java.io.FileNotFoundException
Construct an Cathegorizer that uses Lin's similarity measure from an array of resource file names.

Parameters:
fileNames - An array with the pathnames for the model profiles.
Throws:
TCatNGException - A problem occured while reading the profiles.
java.io.FileNotFoundException - One of the pathnames was not found.

NGramCathegorizer

public NGramCathegorizer(java.lang.String[] fileNames,
                         int similarityMetric)
                  throws TCatNGException,
                         java.io.FileNotFoundException
Construct an Cathegorizer that uses a specific similarity measure from an array of resource file names.

Parameters:
fileNames - An array with the pathnames for the model profiles.
similariMetric - The similarity metric to be used in the cathegorizer. 1 for Lin's measure, 2 for Jiang's & Conranth and 3 for Cavnar & Trenkle.
Throws:
TCatNGException - A problem occured while reading the profiles.
java.io.FileNotFoundException - One of the pathnames was not found.
Method Detail

profileDistance

public static double profileDistance(Profile prof1,
                                     Profile prof2)
Calculate "the distance" between two profiles, according to the metric selected while instantiating this class.

Parameters:
prof1 - A Profile
prof2 - Another Profile
Returns:
The distance between the two profiles. The higher the value, the smaller the similarity.

deltaRank

public static double deltaRank(Profile prof1,
                               Profile prof2)
Calculate "the distance" between two profiles, using the metric proposed by Cavnar & Trenkle. See the paper for more details.

Parameters:
prof1 - A Profile
prof2 - Another Profile
Returns:
The distance between the two profiles. The higher the value, the smaller the similarity.

similarityLin

public static double similarityLin(Profile prof1,
                                   Profile prof2)
Calculate "the distance" between two profiles, using Lin's similarity measure as proposed in "An information-theoretic definition of similarity".

Parameters:
prof1 - A Profile
prof2 - Another Profile
Returns:
The distance between the two profiles. The higher the value, the smaller the similarity.

similarityJiang

public static double similarityJiang(Profile prof1,
                                     Profile prof2)
Calculate "the distance" between two profiles, using Jiang's & Conranth similarity measure, as proposed in "Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy".

Parameters:
prof1 - A Profile
prof2 - Another Profile
Returns:
The distance between the two profiles. The higher the value, the smaller the similarity.

main

public static void main(java.lang.String[] args)
Sample application to use the Cathegorizer from the command line.

Parameters:
args - The command line arguments, tokenized

addProfile

public void addProfile(Profile prof)
Add a new Profiles to the list of models.

Parameters:
prof - The new Profile to be added.

init

private final void init(java.io.File fi,
                        java.lang.String[] names)
                 throws TCatNGException,
                        java.io.FileNotFoundException
Fetch the set of profiles from the disk.

Parameters:
fi - Base directory for the profiles.
names - Filenames of the profiles to fetch.
Throws:
TCatNGException - A problem occured while reading the profiles.
java.io.FileNotFoundException - One of the pathnames was not found.

match

public Profile match(Profile prof)
Match a given Profile against all the profiles constituting the models in the cathegorizer.

Parameters:
prof - A Profile.
Returns:
The closest matching Profile in the cathegorizer.