pt.tumba.ngram.compression
Class PPMModel

java.lang.Object
  extended by pt.tumba.ngram.compression.PPMModel
All Implemented Interfaces:
ArithCodeModel

public final class PPMModel
extends java.lang.Object
implements ArithCodeModel

Provides a cumulative, adaptive byte model implementing prediction by partial matching up to a specified maximum context size. Uses Method C for estimation. Constants that control behavior include the maximum total count before rescaling, and the minimum count to retain after rescaling (an escape is always maintained with a count of at least 1).

Author:
Bruno Martins

Field Summary
private  ExcludingAdaptiveUnigramModel _backoffModel
          Model to use for short contexts.
private  ByteBuffer _buffer
          Bytes buffered for use as context.
private  int _byteCount
          Count of bytes coded to use in pruning.
private  int _contextLength
          Current context length.
private  PPMNode _contextNode
          Current context node.
private  PPMNode[] _contexts
          Nodes at depth 1 in the model.
private  ByteSet _excludedBytes
          Storage for the excluded bytes
private  int _maxContextLength
          Maximum context length to search in trie.
private  PPMNode _rootNode
          Root of the trie structure of counts.
private static int MIN_CONTEXT_LENGTH
          Minimum context length to look down sequence of nodes.
private static int PRUNE_INTERVAL
          Period between prunings in number of bytes.
 
Fields inherited from interface pt.tumba.ngram.compression.ArithCodeModel
EOF, ESCAPE
 
Constructor Summary
PPMModel(int maxContextLength)
          Construct a new model with the specified maximum length of context to use for prediction.
 
Method Summary
 boolean escaped(int symbol)
          Returns true if current context has no count interval for given symbol.
 void exclude(ByteSet bytesToExclude)
          Exclude all of the bytes in the specified byte set.
 void exclude(int i)
          Excludes outcome from occurring in next estimate.
private  void getContextNodeBinarySearch()
          Use binary search to set the context node up to the currently specified context length.
private  void getContextNodeLongToShort()
          Starting at the longest context, count down in length to set a valid context or give up.
private  void increment(byte b)
          Adds counts for given byte to model in current context and then updates the current context.
 void increment(int i)
          Increments the model as if it had just encoded or decoded the specified symbol in the stream.
 void interval(int symbol, int[] result)
          Calculates {low count, high count, total count} for the given symbol in the current context.
private  void intervalByte(int i, int[] result)
          Returns interval for byte specified as an integer in 0 to 255 range.
private  void intervalEscape(int[] result)
          Returns interval for escape in current context.
private static PPMNode lookup(PPMNode node, byte[] bytes, int offset, int length)
          Looks up a node from the given bytes, offset and length starting from the specified node.
private  PPMNode lookupNode(int contextLength)
          Returns node from the current byte buffer of the specified context length, or null if there isn't one.
 int pointToSymbol(int count)
          Returns the symbol whose interval of low and high counts contains the given count.
private  void prune()
          Method used for pruning (edited out).
 int totalCount()
          Returns the total count for the current context.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

_byteCount

private int _byteCount
Count of bytes coded to use in pruning.


_backoffModel

private final ExcludingAdaptiveUnigramModel _backoffModel
Model to use for short contexts.


_contexts

private final PPMNode[] _contexts
Nodes at depth 1 in the model. All order 0 nodes are included in the unigram


_maxContextLength

private final int _maxContextLength
Maximum context length to search in trie. Maximum count will be for maximum context length plus one.


_rootNode

private final PPMNode _rootNode
Root of the trie structure of counts. Dummy byte as symbol.


_contextLength

private int _contextLength
Current context length.


_contextNode

private PPMNode _contextNode
Current context node.


_buffer

private final ByteBuffer _buffer
Bytes buffered for use as context.


_excludedBytes

private final ByteSet _excludedBytes
Storage for the excluded bytes


MIN_CONTEXT_LENGTH

private static final int MIN_CONTEXT_LENGTH
Minimum context length to look down sequence of nodes. Shorter contexts use backoff model.

See Also:
Constant Field Values

PRUNE_INTERVAL

private static final int PRUNE_INTERVAL
Period between prunings in number of bytes.

See Also:
Constant Field Values
Constructor Detail

PPMModel

public PPMModel(int maxContextLength)
Construct a new model with the specified maximum length of context to use for prediction.

Parameters:
maxContextLength - Maximum length of context to use for prediction.
Method Detail

escaped

public boolean escaped(int symbol)
Returns true if current context has no count interval for given symbol. Successive calls to escaped(symbol) followed by interval(ESCAPE) must eventually lead to a a false return from escaped(symbol) after a number of calls equal to the maximum context size. The integer representation of symbol is as in interval.

Specified by:
escaped in interface ArithCodeModel
Parameters:
symbol - Symbol to test whether it is encoded.
Returns:
true if given symbol is not represented in the current context.

exclude

public void exclude(int i)
Excludes outcome from occurring in next estimate. A symbol must not be excluded and then coded or decoded. Exclusions in the model must be coordinated for encoding and decoding.

Specified by:
exclude in interface ArithCodeModel
Parameters:
symbol - Symbol which can be excluded from the next outcome.

interval

public void interval(int symbol,
                     int[] result)
Calculates {low count, high count, total count} for the given symbol in the current context. The symbol is either an integer representation of a byte (0-255) or -1 to denote end-of-file. The cumulative counts in the return must be such that 0 <= low count < high count <= total count.

This method will be called exactly once for each symbol being encoded or decoded, and the calls will be made in the order in which they appear in the original file. Adaptive models may only update their state to account for seeing a symbol after returning its current interval.

Specified by:
interval in interface ArithCodeModel
Parameters:
symbol - The next symbol to decode.
result - Array into which to write range.

pointToSymbol

public int pointToSymbol(int count)
Returns the symbol whose interval of low and high counts contains the given count. Ordinary outcomes are positive integers, and the two special constants EOF or ESCAPE, which are negative.

Specified by:
pointToSymbol in interface ArithCodeModel
Parameters:
count - The given count.
Returns:
The symbol whose interval contains the given count.

totalCount

public int totalCount()
Returns the total count for the current context.

Specified by:
totalCount in interface ArithCodeModel
Returns:
Total count for the current context.

increment

public void increment(int i)
Increments the model as if it had just encoded or decoded the specified symbol in the stream. May be used to prime models by "injecting" a symbol into the model's stream without coding/decoding it in the stream of coded bytes. Calls must be coordinated for encoding and decoding. Will be called automatically by the models for symbols they encode or decode.

Specified by:
increment in interface ArithCodeModel
Parameters:
symbol - Symbol to add to the model.

exclude

public void exclude(ByteSet bytesToExclude)
Exclude all of the bytes in the specified byte set.

Parameters:
bytesToExclude - Set of bytes to exclude from outcome.
Since:
1.1

intervalByte

private void intervalByte(int i,
                          int[] result)
Returns interval for byte specified as an integer in 0 to 255 range.

Parameters:
i - Integer specification of byte in 0 to 255 range.
result - Array specifying cumulative probability for byte i.

intervalEscape

private void intervalEscape(int[] result)
Returns interval for escape in current context.

Parameters:
result - Array for specifying cumulative probability for escape symbol in current context.

prune

private void prune()
Method used for pruning (edited out).


increment

private void increment(byte b)
Adds counts for given byte to model in current context and then updates the current context. Rescales counts if necessary. Called by both encoding and deocding.

Parameters:
b - Byte to add to model.

getContextNodeBinarySearch

private void getContextNodeBinarySearch()
Use binary search to set the context node up to the currently specified context length. May set it to null if not found.


getContextNodeLongToShort

private void getContextNodeLongToShort()
Starting at the longest context, count down in length to set a valid context or give up. This version finds the shortest deterministic context <= in length to the current context length, but if there is no deterministic context, returns longest context length that exists that is <= in length to the current context. Could also implement this in short to long order


lookupNode

private PPMNode lookupNode(int contextLength)
Returns node from the current byte buffer of the specified context length, or null if there isn't one.

Parameters:
contextLength - Number of bytes of context used.
Returns:
Node found at that context.

lookup

private static PPMNode lookup(PPMNode node,
                              byte[] bytes,
                              int offset,
                              int length)
Looks up a node from the given bytes, offset and length starting from the specified node.

Parameters:
node - Node from which to search.
bytes - Sequence of bytes to search.
offset - Offset into sequence of bytes of the first byte.
length - Number of bytes to look up.
Returns:
Node found for the given bytes.