pt.tumba.ngram
Class NGram

java.lang.Object
  extended by pt.tumba.ngram.NGram
All Implemented Interfaces:
java.lang.Comparable

public class NGram
extends java.lang.Object
implements java.lang.Comparable

This class models a concrete and simple N-Gram. To make it slightly more interesting (and efficient), the class follows a Flyweight pattern, so that for each different N-Gram there will only be one instance in the System.

Author:
Bruno Martins

Field Summary
protected  byte[] bytes
          Array of bytes storing the N-gram.
protected  double count
          Number of occurences of this NGram.
static NGram[] known
          An array with the known N-grams.
protected static int knownCount
          Number of N-grams in the array of known N-grams.
protected static int knownStep
          Empty space to leave each time we have to increment the cache.
protected  int size
          Size of this N-gram.
protected static boolean useCache
          Boolean flag indicating the use of the N-gram cache.
 
Constructor Summary
protected NGram()
          Constructor for the NGram object.
  NGram(byte[] bytes, int start, int length)
          Constructor for the NGram object
  NGram(byte[] bytes, int start, int length, double count)
          Constructor for the NGram object
  NGram(NGram ng)
          Constructor for the NGram object which copies another N-gram.
  NGram(java.lang.String str)
          Constructor for the NGram object
 
Method Summary
private static int code(byte[] bytes, int start, int length)
          Encode a byte sequence.
 int compareTo(java.lang.Object e1)
          Compares the number of occurences of this N-gram with another.
 boolean equals(byte[] bytes, int start, int length)
          Compares this N-gram with another one supplied as an array of bytes.
 boolean equals(java.lang.Object e1)
          Compares this N-gram with another Object (checking if its an N-gram object being compared).
 int getByte(int pos)
          Return a single byte out of the NGram.
 int getCount()
          Returns the number of occurences of this N-gram.
static int getNGramCount()
          Gets the number of different N-Grams.
 int getSize()
          Return the size of this NGram.
 double getSmoothedCount()
          Returns the number of occurences of this N-gram, using Good-Turing smoothing.
 java.lang.String getString()
          Returns a String representation of this NGram.
 int hashCode()
          Override the hashCode, allowing to hash NGrams against tiny byte sequences.
 void inc()
          Increments the number of occurences of this N-gram.
static NGram newNGram(byte[] bytes)
          QuasiConstructor.
static NGram newNGram(byte[] bytes, int start)
          QuasiConstructor.
static NGram newNGram(byte[] bytes, int start, int length)
          QuasiConstructor.
static NGram newNGram(byte[] bytes, int start, int length, double count)
          QuasiConstructor.
static NGram newNGram(java.lang.String str)
          QuasiConstructor.
 java.lang.String toString()
          Returns a String representation of this NGram, where occurence frequency information is also present.
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

known

public static NGram[] known
An array with the known N-grams. The size must be a power of 2.


knownCount

protected static int knownCount
Number of N-grams in the array of known N-grams.


knownStep

protected static int knownStep
Empty space to leave each time we have to increment the cache.


useCache

protected static boolean useCache
Boolean flag indicating the use of the N-gram cache.


bytes

protected byte[] bytes
Array of bytes storing the N-gram.


size

protected int size
Size of this N-gram.


count

protected double count
Number of occurences of this NGram.

Constructor Detail

NGram

protected NGram()
Constructor for the NGram object.


NGram

public NGram(byte[] bytes,
             int start,
             int length,
             double count)
Constructor for the NGram object

Parameters:
bytes - An array of bytes with the N-gram.
start - Starting position in the array of bytes.
length - Ending position in the array of bytes.
count - Occurence frequency for this NGram.

NGram

public NGram(byte[] bytes,
             int start,
             int length)
Constructor for the NGram object

Parameters:
bytes - An array of bytes with the N-gram.
start - Starting position in the array of bytes.
length - Ending position in the array of bytes.

NGram

public NGram(NGram ng)
Constructor for the NGram object which copies another N-gram.

Parameters:
ng - An N-gram.

NGram

public NGram(java.lang.String str)
Constructor for the NGram object

Parameters:
str - A string with the N-gram.
Method Detail

code

private static int code(byte[] bytes,
                        int start,
                        int length)
Encode a byte sequence.

Parameters:
bytes - An array of bytes.
start - Starting position in the array of bytes.
length - Ending position in the array of bytes.
Returns:
An hash code for the array of bytes.

getNGramCount

public static int getNGramCount()
Gets the number of different N-Grams.

Returns:
The number of different N-Grams.

newNGram

public static NGram newNGram(byte[] bytes)
QuasiConstructor. FlyWeight means that we first have to look if we allready know the current N-gram.

Parameters:
bytes - Sequence of bytes with the N-gram
Returns:
The N-gram in the sequence of bytes.

newNGram

public static NGram newNGram(byte[] bytes,
                             int start)
QuasiConstructor. FlyWeight means that we first have to look if we allready know the current N-gram.

Parameters:
bytes - Sequence of bytes with the N-gram
start - Starting position in the sequence of bytes.
Returns:
The N-gram in the sequence of bytes.

newNGram

public static NGram newNGram(java.lang.String str)
QuasiConstructor. FlyWeight means that we first have to look if we allready know the current N-gram.

Parameters:
str - A string with the N-gram.
Returns:
The N-gram in the String.

newNGram

public static NGram newNGram(byte[] bytes,
                             int start,
                             int length)
QuasiConstructor. FlyWeight means that we first have to look if we allready know the current beasty.

Parameters:
bytes - Sequence of bytes with the N-gram
start - Starting position in the sequence of bytes.
length - Ending position in the sequence of bytes.
Returns:
The N-gram in the sequence of bytes.

newNGram

public static NGram newNGram(byte[] bytes,
                             int start,
                             int length,
                             double count)
QuasiConstructor. FlyWeight means that we first have to look if we allready know the current beasty.

Parameters:
bytes - Sequence of bytes with the N-gram
start - Starting position in the sequence of bytes.
length - Ending position in the sequence of bytes.
count - Occurence frequency for this NGram.
Returns:
The N-gram in the sequence of bytes.

equals

public boolean equals(byte[] bytes,
                      int start,
                      int length)
Compares this N-gram with another one supplied as an array of bytes.

Parameters:
bytes - An array of bytes with an N-gram.
start - Starting position in the array of bytes.
length - Ending position in the array of bytes.
Returns:
true if both N-grams are equal and false otherwise.

equals

public boolean equals(java.lang.Object e1)
Compares this N-gram with another Object (checking if its an N-gram object being compared).

Overrides:
equals in class java.lang.Object
Parameters:
e1 - An object.
Returns:
true if the Object being compared is an N-Gram and if both N-grams are equal. false otherwise.

getByte

public int getByte(int pos)
Return a single byte out of the NGram.

Parameters:
pos - Return the 1st, 2nd, 3rd, ... byte.
Returns:
The byte value.
Throws:
ArrayIndexOutOfBoundException - The NGram does not contain the given position.

getSize

public int getSize()
Return the size of this NGram.

Returns:
The size of this NGram.

getString

public java.lang.String getString()
Returns a String representation of this NGram.

Returns:
A String representation of this NGram.

hashCode

public int hashCode()
Override the hashCode, allowing to hash NGrams against tiny byte sequences.

Overrides:
hashCode in class java.lang.Object
Returns:
An hashcode for this NGram.

toString

public java.lang.String toString()
Returns a String representation of this NGram, where occurence frequency information is also present.

Overrides:
toString in class java.lang.Object
Returns:
A String representation of this NGram.

compareTo

public int compareTo(java.lang.Object e1)
Compares the number of occurences of this N-gram with another.

Specified by:
compareTo in interface java.lang.Comparable
Parameters:
e1 - An object (must be an instance of NGram)
Returns:
the value 0 if the argument is an NGram with equal occurence frequency; a value less than 0 if the argument is an NGram occuring more frequently; and a value greater than 0 if the argument is an Ngram occuring less frequently.
Throws:
java.lang.NullPointerException - if e1 is null.
java.lang.ClassCastException - if e1 is not an NGram object.

getCount

public int getCount()
Returns the number of occurences of this N-gram.

Returns:
The number of occurences of this N-gram.

getSmoothedCount

public double getSmoothedCount()
Returns the number of occurences of this N-gram, using Good-Turing smoothing. TODO: Only works if the cache is being used.

Returns:
The smoothed number of occurences of this N-gram.

inc

public void inc()
Increments the number of occurences of this N-gram.