LanguageClass

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

pt.tumba.ngram
Class LanguageClass

java.lang.Object
  pt.tumba.ngram.LanguageClass

public class LanguageClass
extends java.lang.Object
extends java.lang.Object

Understanding the language of a given document is an essential step in working with unstructured multilingual text. Without this basic knowledge, applications such as information retrieval and text mining will not be able to accurately process the data, potentially leading to a loss of critical information.

This is a simple utility to find out which language a text is written in. By incorporating it, applications can take a fully automated approach to processing unknown text by quickly and accurately determining the language of incoming data.

Essentially, the classifier relyes on the classification technique described in Cavnar & Trenkle, "N-Gram-Based Text Categorization". The central idea of the Cavnar & Trenkle technique is to calculate a language "fingerprint" of an unknown document, and compare this with the fingerprints of a number of documents of which the language is known. The language of the closest matche is output as the classification. A fingerprint is a list of the most frequent n-grams occurring in a document, ordered by frequency.

For reliable language guessing (at least with sufficiently different languages) the classifier only needs a few kilobytes max, so don't feed it 100KB of text. However, a problem occurs while trying to separate very similar languages (i.e. Portuguese from Portugal and from Brasil). In order to deal with this latter case, LanguageClass actually consists of two separate classifiers: one for the general case, and another for discriminating Portuguese from Portugal and from Brasil. The latter one also uses the technique proposed by Cavnar & Trenkle, but N-Gram profiles are speficially build in order to select the most descriminative N-grams.

Author:: Bruno Martins

Field Summary
`private NGramCathegorizer`	`cat1` The classifyer for the general case.
`private NGramCathegorizer`	`cat2` The classifyer for discriminating both versions of Portuguese (Portugal and Brasil).
`private java.lang.String`	`directory1` Path for the datafiles for the general case classifyer.
`private java.lang.String`	`directory2` Path for the datafiles for the Portuguese classifyer.

Constructor Summary
`LanguageClass()` Constructor for `LanguageClass`.
`LanguageClass(java.lang.String dir)` Constructor for `LanguageClass`.
`LanguageClass(java.lang.String dir1, java.lang.String dir2)` Constructor for `LanguageClass`.

Method Summary
`java.lang.String`	`classify(java.io.File file)` Guess the language of a text File.
`java.lang.String`	`classify(Profile prof)` Guess the language of a given text according to its NGram profile.
`java.lang.String`	`classify(java.lang.String text)` Guess the language of a String of text.
`static void`	`main(java.lang.String[] args)` The main method of the Language Classifier.
`private static void`	`writeProfiles()` Calculate the language profiles for the Portuguese classifyer according to corpus data.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

cat1

private NGramCathegorizer cat1

The classifyer for the general case.

cat2

private NGramCathegorizer cat2

The classifyer for discriminating both versions of Portuguese (Portugal and Brasil).

directory1

private java.lang.String directory1

Path for the datafiles for the general case classifyer.

directory2

private java.lang.String directory2

Path for the datafiles for the Portuguese classifyer.

Constructor Detail

LanguageClass

public LanguageClass()
              throws java.lang.Exception

Constructor for LanguageClass.

Throws:: java.lang.Exception - A problem occured while reading the datafiles.

LanguageClass

public LanguageClass(java.lang.String dir)
              throws java.lang.Exception

Constructor for LanguageClass.

Parameters:: dir - Base path for the datafiles.
Throws:: java.lang.Exception - A problem occured while reading the datafiles.

LanguageClass

public LanguageClass(java.lang.String dir1,
                     java.lang.String dir2)
              throws java.lang.Exception

Constructor for LanguageClass.

Parameters:: dir1 - Path for the datafiles for the general case classifyer.; dir2 - Path for the datafiles for the Portuguese classifyer.
Throws:: java.lang.Exception - A problem occured while reading the datafiles.

Method Detail

classify

public java.lang.String classify(java.lang.String text)
                          throws java.lang.Exception

Guess the language of a String of text.

Parameters:: text - A String of text.
Returns:: The language for the given String of text.
Throws:: java.lang.Exception - A problem occured with the classifyer.

classify

public java.lang.String classify(java.io.File file)
                          throws java.lang.Exception

Guess the language of a text File.

Parameters:: text - A text File.
Returns:: The language for the given text File.
Throws:: java.lang.Exception - A problem occured with the classifyer.

classify

public java.lang.String classify(Profile prof)
                          throws java.lang.Exception

Guess the language of a given text according to its NGram profile.

Parameters:: prof - An NGram profile.
Returns:: The language of a given text according to its NGram profile.
Throws:: java.lang.Exception - A problem occured with the classifyer.

writeProfiles

private static void writeProfiles()
                           throws java.lang.Exception

Calculate the language profiles for the Portuguese classifyer according to corpus data. TODO: Remove this method, tests only.

Throws:: java.lang.Exception - A problem occured while writing the profiles.

main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception

The main method of the Language Classifier.

Parameters:: args - The command line arguments, tokenized.
Throws:: java.lang.Exception - A problem occured with the classifier.

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

pt.tumba.ngram Class LanguageClass

cat1

cat2

directory1

directory2

LanguageClass

LanguageClass

LanguageClass

classify

classify

classify

writeProfiles

main

pt.tumba.ngram
Class LanguageClass