pt.tumba.ngram.compression
Class PPMNode

java.lang.Object
  extended by pt.tumba.ngram.compression.PPMNode

final class PPMNode
extends java.lang.Object

A node in a depth-bounded suffix tree that represents counts of sequences of bytes. Nodes in the trie are accessed through sequences of bytes in the same way as the Map.get(java.lang.Object) method of Map. Sequences of bytes are added as descendants of the current node if necessary and their counts are incremented and rescaled if necessary through an increment operation that works in the same way as Map.put(K, V) in Map.

The entire trie is accessible through the root node. A byte is stored with each node, and a (scaled) count of the times the the sequence of bytes leading to this node in the trie were seen. Each node contains a linked list of child nodes, the first of which is provided as a member element. Each node also represents a member of the linked list of siblings, and the next sibling is provided as a member element.

The nodes provide the cumulative probability estimates for the model through the total, interval and intervalEscape methods.

Author:
Bruno Martins
See Also:
PPMModel

Field Summary
(package private)  byte _byte
          The byte for this node.
private  short _count
          The scaled count for this node.
(package private)  PPMNode _firstChild
          The first child of this node.
(package private)  PPMNode _nextSibling
          The next sibling of this node.
private  short _numberOfOutcomes
          The scaled number of outcomes used to calculate escape likelihoods.
private static int MAX_INDIVIDUAL_COUNT
          Maximum count for daughter node before rescaling all daughters.
private static int MIN_COUNT
          Minimum count for which to retain a node during rescaling.
private static int MIN_PRUNE_COUNT
          Minimum count for a node to survive pruning.
 
Constructor Summary
PPMNode(byte b)
          Construct a node with the specified byte.
PPMNode(byte b, PPMNode nextSibling)
          Construct a node with the specified byte and next sibling.
 
Method Summary
(package private)  void complete(byte[] bytes, int offset, int length)
          Extends this node with the given sequence of bytes, specified by an array, offset and length.
private  boolean hasDaughter(byte b)
          Returns true if this node has a child with the specified byte.
(package private)  boolean hasDaughter(int i)
          Returns true if this node has a child with the specified byte, specified as an integer.
(package private)  void increment(byte[] bytes, int offset, int length)
          Increment the count of all of the nodes along the sequence of bytes determined by the specified array, beginning at the specified offset and continuing for the specified length number of bytes.
(package private)  void increment(ByteBuffer buffer)
          Increment the counts for this node for the string specified in the buffer.
private  void interval(byte b, ByteSet excludedBytes, int[] result)
          Calculates the interval for the specified byte from this node and writes it into the specified array.
(package private)  void interval(int i, ByteSet excludedBytes, int[] result)
          Calculates the interval for the specified byte from this node and writes it into the specified array.
(package private)  void intervalEscape(ByteSet excludedBytes, int[] result)
          The interval for the escape count, less the set of excluded bytes.
(package private)  boolean isChildless(ByteSet excludedBytes)
          Returns true if this node has no children, not counting specified exclusions.
(package private)  boolean isDeterministic(ByteSet excludedBytes)
          Returns true if the number of children for this node is 1.
(package private)  int pointToSymbol(int midCount, ByteSet excludedBytes)
          Retrieves the symbol for which the midCount is between its low and high counts (inclusive on low, exclusive on high).
 PPMNode prune()
          Prunes this node and its children, returning null if the node's count is too low and pruning all children with counts too low.
private  PPMNode pruneSiblings()
          Prunes the siblings of this node, returning the next sibling or null if there aren't any.
private  void rescale()
          Rescale all of the counts of the children of this node.
private  PPMNode rescaleSiblings()
          Rescale the counts on this node and the siblings of this node.
(package private)  int totalCount(ByteSet excludedBytes)
          Total count for this node, not including those bytes in the specified set.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

_byte

final byte _byte
The byte for this node.


_count

private short _count
The scaled count for this node.


_numberOfOutcomes

private short _numberOfOutcomes
The scaled number of outcomes used to calculate escape likelihoods.


_firstChild

PPMNode _firstChild
The first child of this node.


_nextSibling

PPMNode _nextSibling
The next sibling of this node.


MIN_PRUNE_COUNT

private static final int MIN_PRUNE_COUNT
Minimum count for a node to survive pruning.

See Also:
Constant Field Values

MIN_COUNT

private static final int MIN_COUNT
Minimum count for which to retain a node during rescaling. Surprisingly insensitive.

See Also:
Constant Field Values

MAX_INDIVIDUAL_COUNT

private static final int MAX_INDIVIDUAL_COUNT
Maximum count for daughter node before rescaling all daughters. Max value is 8K; higher values cause overflow in the arithmetic coder. Higher values compress better, lower values are generally faster up to the point they cause thrashing. 8K is about .01 b/B more compressed, and about 25% slower vs. 4K. 2K is about .01 b/B less compressed, and roughly same speed as 4K.

See Also:
Constant Field Values
Constructor Detail

PPMNode

PPMNode(byte b,
        PPMNode nextSibling)
Construct a node with the specified byte and next sibling.

Parameters:
b - Byte represented by node.
nextSibling - The next daughter node in the list of daughters.

PPMNode

PPMNode(byte b)
Construct a node with the specified byte.

Parameters:
b - Byte represented by node.
Method Detail

isDeterministic

boolean isDeterministic(ByteSet excludedBytes)
Returns true if the number of children for this node is 1.

Parameters:
bytes - Bytes that have been seen in escaped context that should not be considered children.
Returns:
true if the scaled number of outcomes for this node is 1.

isChildless

boolean isChildless(ByteSet excludedBytes)
Returns true if this node has no children, not counting specified exclusions.

Parameters:
excludedBytes - Bytes to exclude as children
Returns:
true if this node has no children, not counting

totalCount

int totalCount(ByteSet excludedBytes)
Total count for this node, not including those bytes in the specified set.

Parameters:
excludedBytes - Set of bytes to exclude from counts.
Returns:
Total count for this node.

interval

void interval(int i,
              ByteSet excludedBytes,
              int[] result)
Calculates the interval for the specified byte from this node and writes it into the specified array.

Parameters:
b - Byte whose interval is calcuated.
excludedBytes - Set of bytes to exclude from counts.
result - Array in which to write the range for the specified byte.

interval

private void interval(byte b,
                      ByteSet excludedBytes,
                      int[] result)
Calculates the interval for the specified byte from this node and writes it into the specified array.

Parameters:
b - Byte whose interval is calcuated.
excludedBytes - Set of bytes to exclude from counts.
result - Array in which to write the range for the specified byte.

intervalEscape

void intervalEscape(ByteSet excludedBytes,
                    int[] result)
The interval for the escape count, less the set of excluded bytes.

Parameters:
excludedBytes - Set of bytes to exclude from counts.
result - Array into which to write the range for the specified bytes.

increment

void increment(ByteBuffer buffer)
Increment the counts for this node for the string specified in the buffer.

Parameters:
buffer - Buffer of bytes from which to read event to increment.

hasDaughter

boolean hasDaughter(int i)
Returns true if this node has a child with the specified byte, specified as an integer.

Parameters:
b - Byte coded as integer to check.
Returns:
true if there is a child node with the specified byte.

hasDaughter

private boolean hasDaughter(byte b)
Returns true if this node has a child with the specified byte.

Parameters:
b - Byte to check.
Returns:
true if there is a child node with the specified byte.

pointToSymbol

int pointToSymbol(int midCount,
                  ByteSet excludedBytes)
Retrieves the symbol for which the midCount is between its low and high counts (inclusive on low, exclusive on high).

Parameters:
midCount - Count for which to find symbol.
excludedBytes - Set of bytes to exclude from counts.
Returns:
Symbol with specified count.

complete

void complete(byte[] bytes,
              int offset,
              int length)
Extends this node with the given sequence of bytes, specified by an array, offset and length.

Parameters:
bytes - Byte array providing bytes to extend.
offset - Index of first byte in array.
length - Number of bytes to extend.

increment

void increment(byte[] bytes,
               int offset,
               int length)
Increment the count of all of the nodes along the sequence of bytes determined by the specified array, beginning at the specified offset and continuing for the specified length number of bytes.

Parameters:
bytes - Array from which to read bytes.
offset - Index of first byte to read from array.
length - Total number of bytes to read from array.

prune

public PPMNode prune()
Prunes this node and its children, returning null if the node's count is too low and pruning all children with counts too low.

Returns:
This node if it is above the minimum number of counts.

pruneSiblings

private PPMNode pruneSiblings()
Prunes the siblings of this node, returning the next sibling or null if there aren't any.

Returns:
Linked list of siblings above the pruning threshold.

rescale

private void rescale()
Rescale all of the counts of the children of this node. Divides by 2, rounding up, but eliminates all nodes that fall below count threshold. Total number of outcomes is also rescaled, but it will never fall below 1 to allow possiblity for escapes.


rescaleSiblings

private PPMNode rescaleSiblings()
Rescale the counts on this node and the siblings of this node. Divides by 2, rounding up, so no count every drops below 1. Returns rescaled node, which may not be original sibling or may be null if siblings scale below.