|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object pt.tumba.ngram.LanguageClass
public class LanguageClass
Understanding the language of a given document is an essential step in working with unstructured multilingual text. Without this basic knowledge, applications such as information retrieval and text mining will not be able to accurately process the data, potentially leading to a loss of critical information.
This is a simple utility to find out which language a text is written in. By incorporating it, applications can take a fully automated approach to processing unknown text by quickly and accurately determining the language of incoming data.
Essentially, the classifier relyes on the classification technique described in Cavnar & Trenkle, "N-Gram-Based Text Categorization". The central idea of the Cavnar & Trenkle technique is to calculate a language "fingerprint" of an unknown document, and compare this with the fingerprints of a number of documents of which the language is known. The language of the closest matche is output as the classification. A fingerprint is a list of the most frequent n-grams occurring in a document, ordered by frequency.
For reliable language guessing (at least with sufficiently different languages) the classifier
only needs a few kilobytes max, so don't feed it 100KB of text. However, a problem occurs
while trying to separate very similar languages (i.e. Portuguese from Portugal and from Brasil).
In order to deal with this latter case, LanguageClass
actually consists of
two separate classifiers: one for the general case, and another for discriminating Portuguese
from Portugal and from Brasil. The latter one also uses the technique proposed by
Cavnar & Trenkle, but N-Gram profiles are speficially build in order to select the most
descriminative N-grams.
Field Summary | |
---|---|
private NGramCathegorizer |
cat1
The classifyer for the general case. |
private NGramCathegorizer |
cat2
The classifyer for discriminating both versions of Portuguese (Portugal and Brasil). |
private java.lang.String |
directory1
Path for the datafiles for the general case classifyer. |
private java.lang.String |
directory2
Path for the datafiles for the Portuguese classifyer. |
Constructor Summary | |
---|---|
LanguageClass()
Constructor for LanguageClass . |
|
LanguageClass(java.lang.String dir)
Constructor for LanguageClass . |
|
LanguageClass(java.lang.String dir1,
java.lang.String dir2)
Constructor for LanguageClass . |
Method Summary | |
---|---|
java.lang.String |
classify(java.io.File file)
Guess the language of a text File. |
java.lang.String |
classify(Profile prof)
Guess the language of a given text according to its NGram profile. |
java.lang.String |
classify(java.lang.String text)
Guess the language of a String of text. |
static void |
main(java.lang.String[] args)
The main method of the Language Classifier. |
private static void |
writeProfiles()
Calculate the language profiles for the Portuguese classifyer according to corpus data. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
private NGramCathegorizer cat1
private NGramCathegorizer cat2
private java.lang.String directory1
private java.lang.String directory2
Constructor Detail |
---|
public LanguageClass() throws java.lang.Exception
LanguageClass
.
java.lang.Exception
- A problem occured while reading the datafiles.public LanguageClass(java.lang.String dir) throws java.lang.Exception
LanguageClass
.
dir
- Base path for the datafiles.
java.lang.Exception
- A problem occured while reading the datafiles.public LanguageClass(java.lang.String dir1, java.lang.String dir2) throws java.lang.Exception
LanguageClass
.
dir1
- Path for the datafiles for the general case classifyer.dir2
- Path for the datafiles for the Portuguese classifyer.
java.lang.Exception
- A problem occured while reading the datafiles.Method Detail |
---|
public java.lang.String classify(java.lang.String text) throws java.lang.Exception
text
- A String of text.
java.lang.Exception
- A problem occured with the classifyer.public java.lang.String classify(java.io.File file) throws java.lang.Exception
text
- A text File.
java.lang.Exception
- A problem occured with the classifyer.public java.lang.String classify(Profile prof) throws java.lang.Exception
prof
- An NGram profile.
java.lang.Exception
- A problem occured with the classifyer.private static void writeProfiles() throws java.lang.Exception
java.lang.Exception
- A problem occured while writing the profiles.public static void main(java.lang.String[] args) throws java.lang.Exception
args
- The command line arguments, tokenized.
java.lang.Exception
- A problem occured with the classifier.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |