pt.tumba.ngram
Class LanguageClass

java.lang.Object
  extended by pt.tumba.ngram.LanguageClass

public class LanguageClass
extends java.lang.Object

Understanding the language of a given document is an essential step in working with unstructured multilingual text. Without this basic knowledge, applications such as information retrieval and text mining will not be able to accurately process the data, potentially leading to a loss of critical information.

This is a simple utility to find out which language a text is written in. By incorporating it, applications can take a fully automated approach to processing unknown text by quickly and accurately determining the language of incoming data.

Essentially, the classifier relyes on the classification technique described in Cavnar & Trenkle, "N-Gram-Based Text Categorization". The central idea of the Cavnar & Trenkle technique is to calculate a language "fingerprint" of an unknown document, and compare this with the fingerprints of a number of documents of which the language is known. The language of the closest matche is output as the classification. A fingerprint is a list of the most frequent n-grams occurring in a document, ordered by frequency.

For reliable language guessing (at least with sufficiently different languages) the classifier only needs a few kilobytes max, so don't feed it 100KB of text. However, a problem occurs while trying to separate very similar languages (i.e. Portuguese from Portugal and from Brasil). In order to deal with this latter case, LanguageClass actually consists of two separate classifiers: one for the general case, and another for discriminating Portuguese from Portugal and from Brasil. The latter one also uses the technique proposed by Cavnar & Trenkle, but N-Gram profiles are speficially build in order to select the most descriminative N-grams.

Author:
Bruno Martins

Field Summary
private  NGramCathegorizer cat1
          The classifyer for the general case.
private  NGramCathegorizer cat2
          The classifyer for discriminating both versions of Portuguese (Portugal and Brasil).
private  java.lang.String directory1
          Path for the datafiles for the general case classifyer.
private  java.lang.String directory2
          Path for the datafiles for the Portuguese classifyer.
 
Constructor Summary
LanguageClass()
          Constructor for LanguageClass.
LanguageClass(java.lang.String dir)
          Constructor for LanguageClass.
LanguageClass(java.lang.String dir1, java.lang.String dir2)
          Constructor for LanguageClass.
 
Method Summary
 java.lang.String classify(java.io.File file)
          Guess the language of a text File.
 java.lang.String classify(Profile prof)
          Guess the language of a given text according to its NGram profile.
 java.lang.String classify(java.lang.String text)
          Guess the language of a String of text.
static void main(java.lang.String[] args)
          The main method of the Language Classifier.
private static void writeProfiles()
          Calculate the language profiles for the Portuguese classifyer according to corpus data.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

cat1

private NGramCathegorizer cat1
The classifyer for the general case.


cat2

private NGramCathegorizer cat2
The classifyer for discriminating both versions of Portuguese (Portugal and Brasil).


directory1

private java.lang.String directory1
Path for the datafiles for the general case classifyer.


directory2

private java.lang.String directory2
Path for the datafiles for the Portuguese classifyer.

Constructor Detail

LanguageClass

public LanguageClass()
              throws java.lang.Exception
Constructor for LanguageClass.

Throws:
java.lang.Exception - A problem occured while reading the datafiles.

LanguageClass

public LanguageClass(java.lang.String dir)
              throws java.lang.Exception
Constructor for LanguageClass.

Parameters:
dir - Base path for the datafiles.
Throws:
java.lang.Exception - A problem occured while reading the datafiles.

LanguageClass

public LanguageClass(java.lang.String dir1,
                     java.lang.String dir2)
              throws java.lang.Exception
Constructor for LanguageClass.

Parameters:
dir1 - Path for the datafiles for the general case classifyer.
dir2 - Path for the datafiles for the Portuguese classifyer.
Throws:
java.lang.Exception - A problem occured while reading the datafiles.
Method Detail

classify

public java.lang.String classify(java.lang.String text)
                          throws java.lang.Exception
Guess the language of a String of text.

Parameters:
text - A String of text.
Returns:
The language for the given String of text.
Throws:
java.lang.Exception - A problem occured with the classifyer.

classify

public java.lang.String classify(java.io.File file)
                          throws java.lang.Exception
Guess the language of a text File.

Parameters:
text - A text File.
Returns:
The language for the given text File.
Throws:
java.lang.Exception - A problem occured with the classifyer.

classify

public java.lang.String classify(Profile prof)
                          throws java.lang.Exception
Guess the language of a given text according to its NGram profile.

Parameters:
prof - An NGram profile.
Returns:
The language of a given text according to its NGram profile.
Throws:
java.lang.Exception - A problem occured with the classifyer.

writeProfiles

private static void writeProfiles()
                           throws java.lang.Exception
Calculate the language profiles for the Portuguese classifyer according to corpus data. TODO: Remove this method, tests only.

Throws:
java.lang.Exception - A problem occured while writing the profiles.

main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception
The main method of the Language Classifier.

Parameters:
args - The command line arguments, tokenized.
Throws:
java.lang.Exception - A problem occured with the classifier.