pt.tumba.ngram.compression
Class CompressionCategorizer

java.lang.Object
  extended by pt.tumba.ngram.compression.CompressionCategorizer

public class CompressionCategorizer
extends java.lang.Object

Recent results in bioinformatics and observations about the Kolmogorov complexity seem to suggest that simple classification systems can be built using off-the-shelf compression algorithms. CompressionCathegorizer implements the classification technique described in the papers "Towards Parameter-Free Data Mining" and "The Similarity Metric", respectivelly by Ming Li and Keogh et al.

Essentially, these works claim that the distance between two textual strings can be given by C(xy)/(C(x)+C(y)), where C(x) is the length of string x compressed. This way of measuring similarity can be used for classification through a simple procedure: we keep example documents of which the categories are known in a directory, and for a document with an unknown category we simply compress it together with all the example documents and select the category achieving the best similarity as the classification. This implementation can use both the Zip compression algorithm available with the Java SDK, or a more efficient arithmethic coding compressor.

Author:
Bruno Martins

Field Summary
static java.io.FilenameFilter CompressionFilter
          A FilenameFilter for filtering directory listings, recognizing filenames for class profiles.
protected  java.util.List profiles
          A List of example documents, for which the classes are known.
protected  boolean useDistance
          A boolean flag indicating if we should use the normalized compression distance or the compression dissimilarity.
protected  boolean usePPM
          A boolean flag indicating if we should use PPM compression or Zip compression.
 
Constructor Summary
CompressionCategorizer()
          Construct an uninitialized Cathegorizer.
CompressionCategorizer(java.lang.String dirName)
          Construct an Cathegorizer from a whole Directory of resources.
CompressionCategorizer(java.lang.String[] fileNames)
          Construct an Cathegorizer from a List of resource file names.
 
Method Summary
private  byte[] compress(byte[] b)
          Compress a given byte array.
private  byte[] compress(java.io.File f)
          Compress a given File.
private  byte[] compress(java.lang.String s)
          Compress a given String.
private  double compressionDissimilarity(byte[] a, byte[] b)
          Calculates the compression dissimilarity between two byte arrays.
private static double compressionDissimilarity(byte[] conc, byte[] a, byte[] b)
          Returns the compression dissimilarity.
private  double compressionDissimilarity(java.io.File a, java.io.File b)
          Calculates the compression dissimilarity between two Files.
private  double compressionDissimilarity(java.lang.String a, java.lang.String b)
          Calculates the compression dissimilarity between two Strings.
private static byte[] compressPPM(byte[] b)
          Compress a given byte array with the arithmetic coding algorithm.
private static byte[] compressPPM(java.io.File f)
          Compress a given File with the arithmetic coding algorithm.
private static byte[] compressPPM(java.lang.String s)
          Compress a given String with the arithmetic coding algorithm.
private static byte[] compressZip(byte[] b)
          Compress a given byte array with the Zip algorithm.
private static byte[] compressZip(java.io.File f)
          Compress a given File with the Zip algorithm.
private static byte[] compressZip(java.lang.String s)
          Compress a given String with the Zip algorithm.
private  void init(java.io.File fi, java.lang.String[] names)
          Fetch the set of profiles from the disk.
static void main(java.lang.String[] args)
          Sample application to use the Cathegorizer from the command line.
 java.lang.String match(java.io.File f)
          Match a given File against all the Files constituting the models in the cathegorizer.
private  double normalizedCompressionDistance(byte[] a, byte[] b)
          Calculates the compression dissimilarity between two byte arrays.
private static double normalizedCompressionDistance(byte[] conc, byte[] a, byte[] b)
          Calculates the normalized compression distance
private  double normalizedCompressionDistance(java.io.File a, java.io.File b)
          Calculates the normalized compression distance between two Files.
private  double normalizedCompressionDistance(java.lang.String a, java.lang.String b)
          Calculates the normalized compression distance between two Strings.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

CompressionFilter

public static java.io.FilenameFilter CompressionFilter
A FilenameFilter for filtering directory listings, recognizing filenames for class profiles. Essentially, all filenames ending with a ".corpus" extension are valid.


profiles

protected java.util.List profiles
A List of example documents, for which the classes are known.


usePPM

protected boolean usePPM
A boolean flag indicating if we should use PPM compression or Zip compression.


useDistance

protected boolean useDistance
A boolean flag indicating if we should use the normalized compression distance or the compression dissimilarity.

Constructor Detail

CompressionCategorizer

public CompressionCategorizer()
Construct an uninitialized Cathegorizer.


CompressionCategorizer

public CompressionCategorizer(java.lang.String dirName)
                       throws TCatNGException,
                              java.io.FileNotFoundException
Construct an Cathegorizer from a whole Directory of resources.

Parameters:
dirName - Pathname for the directory with the profiles.
Throws:
TCatNGException - A problem occured while reading the profiles.
java.io.FileNotFoundException - The pathname was not found.

CompressionCategorizer

public CompressionCategorizer(java.lang.String[] fileNames)
                       throws TCatNGException,
                              java.io.FileNotFoundException
Construct an Cathegorizer from a List of resource file names.

Parameters:
fileNames - An array with the pathnames for the profiles.
Throws:
TCatNGException - A problem occured while reading the profiles.
java.io.FileNotFoundException - One of the pathnames was not found.
Method Detail

init

private final void init(java.io.File fi,
                        java.lang.String[] names)
                 throws TCatNGException,
                        java.io.FileNotFoundException
Fetch the set of profiles from the disk.

Parameters:
fi - Base directory for the profiles.
names - Filenames of the profiles to fetch.
Throws:
TCatNGException - A problem occured while reading the profiles.
java.io.FileNotFoundException - One of the pathnames was not found.

compress

private byte[] compress(byte[] b)
Compress a given byte array.

Parameters:
b - A byte array.
Returns:
A byte array with the compressed data.

compress

private byte[] compress(java.lang.String s)
Compress a given String.

Parameters:
b - A String.
Returns:
A byte array with the compressed data.

compress

private byte[] compress(java.io.File f)
Compress a given File.

Parameters:
b - A File.
Returns:
A byte array with the compressed data.

compressZip

private static byte[] compressZip(java.lang.String s)
Compress a given String with the Zip algorithm.

Parameters:
b - A String.
Returns:
A byte array with the compressed data.

compressZip

private static byte[] compressZip(byte[] b)
Compress a given byte array with the Zip algorithm.

Parameters:
b - A byte array.
Returns:
A byte array with the compressed data.

compressZip

private static byte[] compressZip(java.io.File f)
Compress a given File with the Zip algorithm.

Parameters:
b - A File.
Returns:
A byte array with the compressed data.

compressPPM

private static byte[] compressPPM(java.lang.String s)
Compress a given String with the arithmetic coding algorithm.

Parameters:
b - A String.
Returns:
A byte array with the compressed data.

compressPPM

private static byte[] compressPPM(byte[] b)
Compress a given byte array with the arithmetic coding algorithm.

Parameters:
b - A byte array.
Returns:
A byte array with the compressed data.

compressPPM

private static byte[] compressPPM(java.io.File f)
Compress a given File with the arithmetic coding algorithm.

Parameters:
b - A File.
Returns:
A byte array with the compressed data.

normalizedCompressionDistance

private static double normalizedCompressionDistance(byte[] conc,
                                                    byte[] a,
                                                    byte[] b)
Calculates the normalized compression distance

Parameters:
-

compressionDissimilarity

private static double compressionDissimilarity(byte[] conc,
                                               byte[] a,
                                               byte[] b)
Returns the compression dissimilarity.

Parameters:
-

normalizedCompressionDistance

private double normalizedCompressionDistance(java.lang.String a,
                                             java.lang.String b)
Calculates the normalized compression distance between two Strings.

Parameters:
a - A String.
b - Another String.
Returns:
The normalized compression distance between the two Strings.

compressionDissimilarity

private double compressionDissimilarity(java.lang.String a,
                                        java.lang.String b)
Calculates the compression dissimilarity between two Strings.

Parameters:
a - A String.
b - Another String.
Returns:
The compression dissimilarity between the two Strings.

normalizedCompressionDistance

private double normalizedCompressionDistance(java.io.File a,
                                             java.io.File b)
Calculates the normalized compression distance between two Files.

Parameters:
a - A File.
b - Another File.
Returns:
The normalized compression distance between the two Files.

compressionDissimilarity

private double compressionDissimilarity(java.io.File a,
                                        java.io.File b)
Calculates the compression dissimilarity between two Files.

Parameters:
a - A File.
b - Another File.
Returns:
The compression dissimilarity between the two Files.

normalizedCompressionDistance

private double normalizedCompressionDistance(byte[] a,
                                             byte[] b)
Calculates the compression dissimilarity between two byte arrays.

Parameters:
a - A byte array.
b - Another byte array.
Returns:
The compression dissimilarity between the two byte arrays.

compressionDissimilarity

private double compressionDissimilarity(byte[] a,
                                        byte[] b)
Calculates the compression dissimilarity between two byte arrays.

Parameters:
a - A byte array.
b - Another byte array.
Returns:
The compression dissimilarity between the two byte arrays.

match

public java.lang.String match(java.io.File f)
Match a given File against all the Files constituting the models in the cathegorizer.

Parameters:
f - A File.
Returns:
The closest matching class (given by the File name) in the cathegorizer.

main

public static void main(java.lang.String[] args)
Sample application to use the Cathegorizer from the command line.

Parameters:
args - The command line arguments, tokenized