pt.tumba.ngram.compression
Class AdaptiveUnigramModel

java.lang.Object
  extended by pt.tumba.ngram.compression.AdaptiveUnigramModel
All Implemented Interfaces:
ArithCodeModel

public final class AdaptiveUnigramModel
extends java.lang.Object
implements ArithCodeModel

Provides an adaptive model based on bytes observed in the input stream. Each byte count is initialized at 1 and incremented by 1 for each instance seen. If incrementing an outcome causes the total count to exceed MAX_COUNT, then all counts are divided by 2 and rounded up. Estimation is by frequency (also known as a maximum likelihood estimate).

Author:
Bruno Martins

Field Summary
private  int[] _count
          Counts for each outcome.
private static int EOF_INDEX
          Index in the count array for the end-of-file outcome.
private static int MAX_COUNT
          Maximum count before rescaling.
private static int NUM_BYTES
          Total number of bytes.
private static int TOTAL_INDEX
          Index in the count array for the cumulative total of all outcomes.
 
Fields inherited from interface pt.tumba.ngram.compression.ArithCodeModel
EOF, ESCAPE
 
Constructor Summary
AdaptiveUnigramModel()
          Construct an adaptive unigram model, initializing all byte counts and end-of-file to 1.
 
Method Summary
 boolean escaped(int symbol)
          Returns true if current context has no count interval for given symbol.
 void exclude(int i)
          Excludes outcome from occurring in next estimate.
private  int highCount(int i)
          The cumulative count of all outcomes below given outcome plus the count of the outcome.
 void increment(int i)
          Increments the model as if it had just encoded or decoded the specified symbol in the stream.
 void interval(int symbol, int[] result)
          Calculates {low count, high count, total count} for the given symbol in the current context.
private  int lowCount(int i)
          The cumulative count of all outcomes below given outcome.
 int pointToSymbol(int midCount)
          Returns the symbol whose interval of low and high counts contains the given count.
private  void rescale()
          Rescale the counts by adding 1 to all counts and dividing by 2.
 int totalCount()
          Returns the total count for the current context.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

_count

private int[] _count
Counts for each outcome. Indices 0 to 255 for the usual counts, 256 for end-of-file, and 257 for total. Each outcome i between 0-256 is coded by interval (_count[i],_count[i+1],_count[257]).


MAX_COUNT

private static final int MAX_COUNT
Maximum count before rescaling.

See Also:
Constant Field Values

NUM_BYTES

private static final int NUM_BYTES
Total number of bytes.

See Also:
Constant Field Values

EOF_INDEX

private static final int EOF_INDEX
Index in the count array for the end-of-file outcome.

See Also:
Constant Field Values

TOTAL_INDEX

private static final int TOTAL_INDEX
Index in the count array for the cumulative total of all outcomes.

See Also:
Constant Field Values
Constructor Detail

AdaptiveUnigramModel

public AdaptiveUnigramModel()
Construct an adaptive unigram model, initializing all byte counts and end-of-file to 1.

Method Detail

interval

public void interval(int symbol,
                     int[] result)
Calculates {low count, high count, total count} for the given symbol in the current context. The symbol is either an integer representation of a byte (0-255) or -1 to denote end-of-file. The cumulative counts in the return must be such that 0 <= low count < high count <= total count.

This method will be called exactly once for each symbol being encoded or decoded, and the calls will be made in the order in which they appear in the original file. Adaptive models may only update their state to account for seeing a symbol after returning its current interval.

Specified by:
interval in interface ArithCodeModel
Parameters:
symbol - The next symbol to decode.
result - Array into which to write range.

pointToSymbol

public int pointToSymbol(int midCount)
Returns the symbol whose interval of low and high counts contains the given count. Ordinary outcomes are positive integers, and the two special constants EOF or ESCAPE, which are negative.

Specified by:
pointToSymbol in interface ArithCodeModel
Parameters:
count - The given count.
Returns:
The symbol whose interval contains the given count.

totalCount

public int totalCount()
Returns the total count for the current context.

Specified by:
totalCount in interface ArithCodeModel
Returns:
Total count for the current context.

escaped

public boolean escaped(int symbol)
Returns true if current context has no count interval for given symbol. Successive calls to escaped(symbol) followed by interval(ESCAPE) must eventually lead to a a false return from escaped(symbol) after a number of calls equal to the maximum context size. The integer representation of symbol is as in interval.

Specified by:
escaped in interface ArithCodeModel
Parameters:
symbol - Symbol to test whether it is encoded.
Returns:
true if given symbol is not represented in the current context.

exclude

public void exclude(int i)
Excludes outcome from occurring in next estimate. A symbol must not be excluded and then coded or decoded. Exclusions in the model must be coordinated for encoding and decoding.

Specified by:
exclude in interface ArithCodeModel
Parameters:
symbol - Symbol which can be excluded from the next outcome.

increment

public void increment(int i)
Increments the model as if it had just encoded or decoded the specified symbol in the stream. May be used to prime models by "injecting" a symbol into the model's stream without coding/decoding it in the stream of coded bytes. Calls must be coordinated for encoding and decoding. Will be called automatically by the models for symbols they encode or decode.

Specified by:
increment in interface ArithCodeModel
Parameters:
symbol - Symbol to add to the model.

lowCount

private int lowCount(int i)
The cumulative count of all outcomes below given outcome.

Parameters:
i - Index of given outcome.
Returns:
Low count of interval for given symbol.

highCount

private int highCount(int i)
The cumulative count of all outcomes below given outcome plus the count of the outcome.

Parameters:
i - Index of given outcome.
Returns:
High count of interval for given symbol.

rescale

private void rescale()
Rescale the counts by adding 1 to all counts and dividing by 2.