|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object pt.tumba.ngram.NGramCathegorizer
public class NGramCathegorizer
NGramCathegorizer
implements the classification technique described in
Cavnar & Trenkle, "N-Gram-Based Text Categorization". It was primarily developed
for language guessing, a task on which it is known to perform with near-perfect accuracy.
The central idea of the Cavnar & Trenkle technique is to calculate a "fingerprint" of a document with an unknown category, and compare this with the fingerprints of a number of documents of which the categories are known. The categories of the closest matches are output as the classification. A fingerprint is a list of the most frequent n-grams occurring in a document, ordered by frequency.
In the oroginal proposal, fingerprints are compared with a simple out-of-place metric (see the article for more details). This package implements some extentions to the original proposal, namelly offering support for Good-Turing smoothing and new fingerprint comparison methods, based on the similarity metrics proposed by Lin in "An information-theoretic definition of similarity" and Jiand & Conranth in "Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy".
This library was made with efficiency in mind. There are couple of parameters you may wish to tweak if you intend to use it for other tasks than language guessing.
Since the speed of the classifier is roughly linear with respect to the number of models, you should consider how many models you really need. For instance in case of language guessing: do you really want to recognize every language ever invented?
Field Summary | |
---|---|
static java.io.FilenameFilter |
NGramFilter
A FilenameFilter for filtering directory listings, recognizing
filenames for N-gram profiles. |
protected java.util.List |
profiles
The list of profiles with the models for classification. |
private int |
similarityMetric
The metric used to measure the distance between the profiles. |
Constructor Summary | |
---|---|
NGramCathegorizer()
Construct an uninitialized cathegorizer that uses Lin's similarity measure. |
|
NGramCathegorizer(int similarityMetric)
Construct an uninitialized cathegorizer that uses a specific similarity measure. |
|
NGramCathegorizer(java.lang.String dirName)
Construct an cathegorizer that uses Lin's similarity measure from a directory with model profiles. |
|
NGramCathegorizer(java.lang.String[] fileNames)
Construct an Cathegorizer that uses Lin's similarity measure from an array of resource file names. |
|
NGramCathegorizer(java.lang.String[] fileNames,
int similarityMetric)
Construct an Cathegorizer that uses a specific similarity measure from an array of resource file names. |
|
NGramCathegorizer(java.lang.String dirName,
int similarityMetric)
Construct an cathegorizer that uses a specific similarity measure from a directory with model profiles. |
Method Summary | |
---|---|
void |
addProfile(Profile prof)
Add a new Profiles to the list of models. |
static double |
deltaRank(Profile prof1,
Profile prof2)
Calculate "the distance" between two profiles, using the metric proposed by Cavnar & Trenkle. |
private void |
init(java.io.File fi,
java.lang.String[] names)
Fetch the set of profiles from the disk. |
static void |
main(java.lang.String[] args)
Sample application to use the Cathegorizer from the command line. |
Profile |
match(Profile prof)
Match a given Profile against all the profiles
constituting the models in the cathegorizer. |
static double |
profileDistance(Profile prof1,
Profile prof2)
Calculate "the distance" between two profiles, according to the metric selected while instantiating this class. |
static double |
similarityJiang(Profile prof1,
Profile prof2)
Calculate "the distance" between two profiles, using Jiang's & Conranth similarity measure, as proposed in "Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy". |
static double |
similarityLin(Profile prof1,
Profile prof2)
Calculate "the distance" between two profiles, using Lin's similarity measure as proposed in "An information-theoretic definition of similarity". |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
private int similarityMetric
public static java.io.FilenameFilter NGramFilter
FilenameFilter
for filtering directory listings, recognizing
filenames for N-gram profiles. Essentially, all filenames not ending with a ".corpus"
extension are valid.
protected java.util.List profiles
Constructor Detail |
---|
public NGramCathegorizer()
public NGramCathegorizer(int similarityMetric)
similariMetric
- The similarity metric to be used in the cathegorizer.
1 for Lin's measure, 2 for Jiang's & Conranth and 3 for Cavnar & Trenkle.public NGramCathegorizer(java.lang.String dirName) throws TCatNGException, java.io.FileNotFoundException
dirName
- Pathname for the directory with the profiles.
TCatNGException
- A problem occured while reading the profiles.
java.io.FileNotFoundException
- The pathname was not found.public NGramCathegorizer(java.lang.String dirName, int similarityMetric) throws TCatNGException, java.io.FileNotFoundException
dirName
- Pathname for the directory with the profiles.similariMetric
- The similarity metric to be used in the cathegorizer.
1 for Lin's measure, 2 for Jiang's & Conranth and 3 for Cavnar & Trenkle.
TCatNGException
- A problem occured while reading the profiles.
java.io.FileNotFoundException
- The pathname was not found.public NGramCathegorizer(java.lang.String[] fileNames) throws TCatNGException, java.io.FileNotFoundException
fileNames
- An array with the pathnames for the model profiles.
TCatNGException
- A problem occured while reading the profiles.
java.io.FileNotFoundException
- One of the pathnames was not found.public NGramCathegorizer(java.lang.String[] fileNames, int similarityMetric) throws TCatNGException, java.io.FileNotFoundException
fileNames
- An array with the pathnames for the model profiles.similariMetric
- The similarity metric to be used in the cathegorizer.
1 for Lin's measure, 2 for Jiang's & Conranth and 3 for Cavnar & Trenkle.
TCatNGException
- A problem occured while reading the profiles.
java.io.FileNotFoundException
- One of the pathnames was not found.Method Detail |
---|
public static double profileDistance(Profile prof1, Profile prof2)
prof1
- A Profileprof2
- Another Profile
public static double deltaRank(Profile prof1, Profile prof2)
prof1
- A Profileprof2
- Another Profile
public static double similarityLin(Profile prof1, Profile prof2)
prof1
- A Profileprof2
- Another Profile
public static double similarityJiang(Profile prof1, Profile prof2)
prof1
- A Profileprof2
- Another Profile
public static void main(java.lang.String[] args)
args
- The command line arguments, tokenizedpublic void addProfile(Profile prof)
prof
- The new Profile to be added.private final void init(java.io.File fi, java.lang.String[] names) throws TCatNGException, java.io.FileNotFoundException
fi
- Base directory for the profiles.names
- Filenames of the profiles to fetch.
TCatNGException
- A problem occured while reading the profiles.
java.io.FileNotFoundException
- One of the pathnames was not found.public Profile match(Profile prof)
Profile
against all the profiles
constituting the models in the cathegorizer.
prof
- A Profile
.
Profile
in the cathegorizer.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |