pt.tumba.ngram.blr
Class BayesianLogReg

java.lang.Object
  extended by pt.tumba.ngram.blr.BayesianLogReg

public class BayesianLogReg
extends java.lang.Object

Simple, easy-to-use, and efficient software for Bayesian Logistic Regression classification, based on the "Bayesian Logistic Regression Software" package by Alexander Genkin, David D. Lewis, and David Madigan. A "one-against-one" approach is used for multiclass classification.

A general binary regression classifier takes the form:

P(y=1|x,beta) = exp(beta*x) / ( 1 + exp(beta*x) ) where y is the class label (1 or -1), x is the predictor vector and beta is the vector of parameters.

This software finds the maximum a posteriori parameter estimates with two choices for prior: Gaussian or Laplace (The Laplace prior corresponds to Tibshirani's LASSO algorithm). To find the parameter estimates the software implements a coordinate descent algorithm that draws on the ideas of Zhang and Oles (2001). There are two ways for the user to define the hyperparameter value (laplace prior or gaussian variance): The first way is to specify the hyperparameter value explicitly. The second way is to omit any specification and allow the program to set the value by default. The program sets the default prior variance equal to the inverse average squared value of all data elements in training.

Logistic regression estimates the probability that a data vector belongs to the class with label 1. Classification requires a threshold: the model assigns a case to class 1 iff the probability estimate is greater or equal to the threshold value. The program offers the following choices for threshold tuning criteria:

  • no tuning, threshold is equal to 0.5
  • sum of errors = b+c
  • balanced error rate = (b/(a+b) + c/(c+d))/2
  • T11U = 2*a - c
  • F1 = (2*a)/(2*a + b + c)
  • T13U = 20*a - c The three latter measures are popular in text classification. A detailed technical report describing theoretical background, the algorithm, and experimental results can be found at http://stat.rutgers.edu/~madigan/PAPERS/shortFat-v13.ps.

    Author:
    Bruno Martins

    Field Summary
    private  double[] beta
               
    private  double[] classes
               
    private  double[][] data
              The training points
    private  double[] delta
               
    protected  java.lang.String[] names
               
    static java.io.FilenameFilter NGramFilter
              A FilenameFilter for filtering directory listings, recognizing filenames for class profiles.
    private  double[] r
               
    protected  java.util.List sortedGrams
               
    private  double[] theta
               
     
    Constructor Summary
    BayesianLogReg()
              Construct an uninitialized Cathegorizer.
    BayesianLogReg(java.lang.String dirName)
              Construct an Cathegorizer from a whole Directory of resources.
    BayesianLogReg(java.lang.String[] fileNames)
              Construct an Cathegorizer from a List of resource file names.
     
    Method Summary
    private  double convergenceTest(double[] deltar)
               
    private static java.util.List exchangePos(java.util.List v, int p1, int p2)
              Exchange two values in a list
    private  double gaussianOptimization(int j)
               
    private  void init(java.io.File fi, java.lang.String[] names)
              Fetch the set of profiles from the disk.
    private  void initialize()
              Initialize the Bayesian Logistic Regression classifyer.
    private  double laplaceOptimization(int j)
               
    static void main(java.lang.String[] args)
              Sample application to use the Cathegorizer from the command line.
     java.lang.String match(java.io.File f)
              Match a given File against all the classes in the cathegorizer.
    private  void optimization()
               
     
    Methods inherited from class java.lang.Object
    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
     

    Field Detail

    data

    private double[][] data
    The training points


    classes

    private double[] classes

    beta

    private double[] beta

    delta

    private double[] delta

    theta

    private double[] theta

    r

    private double[] r

    names

    protected java.lang.String[] names

    sortedGrams

    protected java.util.List sortedGrams

    NGramFilter

    public static java.io.FilenameFilter NGramFilter
    A FilenameFilter for filtering directory listings, recognizing filenames for class profiles. Essentially, all filenames not ending with a ".corpus" extension are valid.

    Constructor Detail

    BayesianLogReg

    public BayesianLogReg()
    Construct an uninitialized Cathegorizer.


    BayesianLogReg

    public BayesianLogReg(java.lang.String dirName)
                   throws TCatNGException,
                          java.io.FileNotFoundException
    Construct an Cathegorizer from a whole Directory of resources.

    Parameters:
    dirName - Pathname for the directory with the profiles.
    Throws:
    TCatNGException - A problem occured while reading the profiles.
    java.io.FileNotFoundException - The pathname was not found.

    BayesianLogReg

    public BayesianLogReg(java.lang.String[] fileNames)
                   throws TCatNGException,
                          java.io.FileNotFoundException
    Construct an Cathegorizer from a List of resource file names.

    Parameters:
    fileNames - An array with the pathnames for the profiles.
    Throws:
    TCatNGException - A problem occured while reading the profiles.
    java.io.FileNotFoundException - One of the pathnames was not found.
    Method Detail

    init

    private final void init(java.io.File fi,
                            java.lang.String[] names)
                     throws TCatNGException,
                            java.io.FileNotFoundException
    Fetch the set of profiles from the disk.

    Parameters:
    fi - Base directory for the profiles.
    names - Filenames of the profiles to fetch.
    Throws:
    TCatNGException - A problem occured while reading the profiles.
    java.io.FileNotFoundException - One of the pathnames was not found.

    exchangePos

    private static java.util.List exchangePos(java.util.List v,
                                              int p1,
                                              int p2)
    Exchange two values in a list

    Parameters:
    v - The original list
    p1 - The index of the first element
    p2 - The index of the second element
    Returns:
    The list with the two elements exchanged

    initialize

    private void initialize()
    Initialize the Bayesian Logistic Regression classifyer.


    convergenceTest

    private double convergenceTest(double[] deltar)
    Parameters:
    deltar -
    Returns:

    gaussianOptimization

    private double gaussianOptimization(int j)
    Parameters:
    j -
    Returns:

    laplaceOptimization

    private double laplaceOptimization(int j)
    Parameters:
    j -
    Returns:

    optimization

    private void optimization()

    match

    public java.lang.String match(java.io.File f)
    Match a given File against all the classes in the cathegorizer.

    Parameters:
    f - A File.
    Returns:
    The closest matching class (given by the model File name) in the cathegorizer.

    main

    public static void main(java.lang.String[] args)
    Sample application to use the Cathegorizer from the command line.

    Parameters:
    args - The command line arguments, tokenized