Package pt.tumba.ngram.compression

Implementation of the compression-based classification technique described in the papers "Towards Parameter-Free Data Mining" and "The Similarity Metric", respectivelly by Ming Li and Keogh et al.

See:
          Description

Interface Summary
ArithCodeModel Interface for an adaptive statistical model of a stream to be used as a basis for arithmetic coding and decoding.
 

Class Summary
AdaptiveUnigramModel Provides an adaptive model based on bytes observed in the input stream.
ArithCodeInputStream An input stream which uses a statistical model and arithmetic coding for decompression of encoded bytes read from an underlying input stream.
ArithCodeOutputStream A filter output stream which uses a statistical model and arithmetic coding for compression of bytes read from an underlying arithmetic encoder.
ArithDecoder Performs arithmetic decoding, converting bit input into cumulative probability interval output.
ArithEncoder Performs arithmetic encoding, converting cumulative probability interval input into bit output.
BitInput Reads input from an underlying input stream a bit at a time.
BitOutput Writes to an underlying output stream a bit at a time.
ByteBuffer Stores a queue of bytes in a buffer with a maximum size.
ByteSet A set of bytes.
CompressionCategorizer Recent results in bioinformatics and observations about the Kolmogorov complexity seem to suggest that simple classification systems can be built using off-the-shelf compression algorithms.
ExcludingAdaptiveUnigramModel Package class for use by the PPMModel.
PPMCompress Command-line function for compressing files or streams.
PPMDecompress Command-line function for decompressing files or streams.
PPMModel Provides a cumulative, adaptive byte model implementing prediction by partial matching up to a specified maximum context size.
PPMNode A node in a depth-bounded suffix tree that represents counts of sequences of bytes.
Test Runs test suite for arithmetic coding and decoding with all of th esupplied compression models from Test.main(java.lang.String[]).
TestSet Package local helper class to compute statistics for a set of compression experiments.
TestStatistics Package local helper class to compute statistics for a single compression experiment.
UniformModel A singleton uniform distribution byte model.
 

Package pt.tumba.ngram.compression Description

Implementation of the compression-based classification technique described in the papers "Towards Parameter-Free Data Mining" and "The Similarity Metric", respectivelly by Ming Li and Keogh et al. It can use both the Zip compression algorithm available with the Java SDK, or a more efficient arithmethic coding compressor.
Arithmetic coding is a general technique for coding the outcome of a stochastic process based on an adaptive model. The expected bit rate is the cross-entropy rate of the model versus the actual process. PPM, prediction by partial matching, is an adaptive statistical model of a symbol sequence which models the likelihood of the next byte based on a (relatively short) suffix of the sequence of previous bytes.