|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object pt.tumba.ngram.compression.CompressionCategorizer
public class CompressionCategorizer
Recent results in bioinformatics and observations about the Kolmogorov complexity seem to
suggest that simple classification systems can be built using off-the-shelf compression algorithms.
CompressionCathegorizer
implements the classification technique described in
the papers "Towards
Parameter-Free Data Mining" and
"The Similarity
Metric", respectivelly by Ming Li and Keogh et al.
Essentially, these works claim that the distance between two textual strings can be given by C(xy)/(C(x)+C(y)), where C(x) is the length of string x compressed. This way of measuring similarity can be used for classification through a simple procedure: we keep example documents of which the categories are known in a directory, and for a document with an unknown category we simply compress it together with all the example documents and select the category achieving the best similarity as the classification. This implementation can use both the Zip compression algorithm available with the Java SDK, or a more efficient arithmethic coding compressor.
Field Summary | |
---|---|
static java.io.FilenameFilter |
CompressionFilter
A FilenameFilter for filtering directory listings, recognizing
filenames for class profiles. |
protected java.util.List |
profiles
A List of example documents, for which the classes are known. |
protected boolean |
useDistance
A boolean flag indicating if we should use the normalized compression distance or the compression dissimilarity. |
protected boolean |
usePPM
A boolean flag indicating if we should use PPM compression or Zip compression. |
Constructor Summary | |
---|---|
CompressionCategorizer()
Construct an uninitialized Cathegorizer. |
|
CompressionCategorizer(java.lang.String dirName)
Construct an Cathegorizer from a whole Directory of resources. |
|
CompressionCategorizer(java.lang.String[] fileNames)
Construct an Cathegorizer from a List of resource file names. |
Method Summary | |
---|---|
private byte[] |
compress(byte[] b)
Compress a given byte array. |
private byte[] |
compress(java.io.File f)
Compress a given File. |
private byte[] |
compress(java.lang.String s)
Compress a given String. |
private double |
compressionDissimilarity(byte[] a,
byte[] b)
Calculates the compression dissimilarity between two byte arrays. |
private static double |
compressionDissimilarity(byte[] conc,
byte[] a,
byte[] b)
Returns the compression dissimilarity. |
private double |
compressionDissimilarity(java.io.File a,
java.io.File b)
Calculates the compression dissimilarity between two Files. |
private double |
compressionDissimilarity(java.lang.String a,
java.lang.String b)
Calculates the compression dissimilarity between two Strings. |
private static byte[] |
compressPPM(byte[] b)
Compress a given byte array with the arithmetic coding algorithm. |
private static byte[] |
compressPPM(java.io.File f)
Compress a given File with the arithmetic coding algorithm. |
private static byte[] |
compressPPM(java.lang.String s)
Compress a given String with the arithmetic coding algorithm. |
private static byte[] |
compressZip(byte[] b)
Compress a given byte array with the Zip algorithm. |
private static byte[] |
compressZip(java.io.File f)
Compress a given File with the Zip algorithm. |
private static byte[] |
compressZip(java.lang.String s)
Compress a given String with the Zip algorithm. |
private void |
init(java.io.File fi,
java.lang.String[] names)
Fetch the set of profiles from the disk. |
static void |
main(java.lang.String[] args)
Sample application to use the Cathegorizer from the command line. |
java.lang.String |
match(java.io.File f)
Match a given File against all the Files
constituting the models in the cathegorizer. |
private double |
normalizedCompressionDistance(byte[] a,
byte[] b)
Calculates the compression dissimilarity between two byte arrays. |
private static double |
normalizedCompressionDistance(byte[] conc,
byte[] a,
byte[] b)
Calculates the normalized compression distance |
private double |
normalizedCompressionDistance(java.io.File a,
java.io.File b)
Calculates the normalized compression distance between two Files. |
private double |
normalizedCompressionDistance(java.lang.String a,
java.lang.String b)
Calculates the normalized compression distance between two Strings. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static java.io.FilenameFilter CompressionFilter
FilenameFilter
for filtering directory listings, recognizing
filenames for class profiles. Essentially, all filenames ending with a ".corpus"
extension are valid.
protected java.util.List profiles
protected boolean usePPM
protected boolean useDistance
Constructor Detail |
---|
public CompressionCategorizer()
public CompressionCategorizer(java.lang.String dirName) throws TCatNGException, java.io.FileNotFoundException
dirName
- Pathname for the directory with the profiles.
TCatNGException
- A problem occured while reading the profiles.
java.io.FileNotFoundException
- The pathname was not found.public CompressionCategorizer(java.lang.String[] fileNames) throws TCatNGException, java.io.FileNotFoundException
fileNames
- An array with the pathnames for the profiles.
TCatNGException
- A problem occured while reading the profiles.
java.io.FileNotFoundException
- One of the pathnames was not found.Method Detail |
---|
private final void init(java.io.File fi, java.lang.String[] names) throws TCatNGException, java.io.FileNotFoundException
fi
- Base directory for the profiles.names
- Filenames of the profiles to fetch.
TCatNGException
- A problem occured while reading the profiles.
java.io.FileNotFoundException
- One of the pathnames was not found.private byte[] compress(byte[] b)
b
- A byte array.
private byte[] compress(java.lang.String s)
b
- A String.
private byte[] compress(java.io.File f)
b
- A File.
private static byte[] compressZip(java.lang.String s)
b
- A String.
private static byte[] compressZip(byte[] b)
b
- A byte array.
private static byte[] compressZip(java.io.File f)
b
- A File.
private static byte[] compressPPM(java.lang.String s)
b
- A String.
private static byte[] compressPPM(byte[] b)
b
- A byte array.
private static byte[] compressPPM(java.io.File f)
b
- A File.
private static double normalizedCompressionDistance(byte[] conc, byte[] a, byte[] b)
- private static double compressionDissimilarity(byte[] conc, byte[] a, byte[] b)
- private double normalizedCompressionDistance(java.lang.String a, java.lang.String b)
a
- A String.b
- Another String.
private double compressionDissimilarity(java.lang.String a, java.lang.String b)
a
- A String.b
- Another String.
private double normalizedCompressionDistance(java.io.File a, java.io.File b)
a
- A File.b
- Another File.
private double compressionDissimilarity(java.io.File a, java.io.File b)
a
- A File.b
- Another File.
private double normalizedCompressionDistance(byte[] a, byte[] b)
a
- A byte array.b
- Another byte array.
private double compressionDissimilarity(byte[] a, byte[] b)
a
- A byte array.b
- Another byte array.
public java.lang.String match(java.io.File f)
File
against all the Files
constituting the models in the cathegorizer.
f
- A File
.
public static void main(java.lang.String[] args)
args
- The command line arguments, tokenized
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |