Introduction
============

semibound is a package of scripts for evaluating various semi-supervised
generalization error bounds.  Some details on the bounds can be found in

  Matti Kaariainen and John Langford:  A comparison of tight generalization
  error bounds, ICML 2005 (to appear)

and
  
  Matti Kaariainen:  Generalization error bounds using unlabeled data,
  COLT 2005 (to appear)

The scripting is mostly by Matti Kaariainen, and the programs for evaluating
inverse binomial tails (bound.tar.gz) and the progressive validation bound
(pv_bound.tar.gz) are by John Langford.  Any damage resulting from the 
use of the scripts is solely on the responsibility of the user.  

semibound is distributed under the terms of the GNU General Public Licence.  
For details, see COPYING.

To contact the authors, email: matti.kaariainen@cs.helsinki.fi,jl@hunch.net

Installation
============

First a word of caution:  the portability of semibound has not been tested!
We have run the scripts on a few Linux machines, but we have no experience
on other architechtures.  It is assumed that you have some standard stuff
already installed, most notably including Bash, some shell utilities, and
Perl.

To install the package, do the following:

1. Uncompress and extract the file semibound.tar.gz into a directory of
  your choice.

2. Set the environment variable SEMIBOUND_DIR to point to the directory where
  you installed semibound, e.g., by typing 
  
    export SEMIBOUND_DIR=`pwd`

  in the directory into which you untarred semibound.  Also, make sure
  . is in your path.

3. Download learning algorithms from the web.  The currently supported
  algorithms are (modifications/subsets of) the following:

    C4.5 Release 8:  http://www.rulequest.com/Personal/
    libsvm 2.8:  http://www.csie.ntu.edu.tw/~cjlin/libsvm/
    svmlight:  http://svmlight.joachims.org/
    weka:  http://www.cs.waikato.ac.nz/ml/weka/

  Note that each of the above is distributed under its own licence and some 
  are not free!  This is the reason they are not redistributed as part of
  this package but have to be downloaded separately.

  The following packages by John Langford are included in the 
  "packages" directory of semibound.tar.gz:

  bound.tar.gz -- for upper and lower tails of binomials
  pv_bound.tar.gz -- for the progressive validation bound

4. Download the learning algorithms you plan to use from the web.  The
  packages containing the algorithms are to be saved into the "packages"
  sub-directory.  They can be found at

  c4.5r8.tar.gz -- http://www.rulequest.com/Personal/c4.5r8.tar.gz
  libsvm 2.8 -- http://www.csie.ntu.edu.tw/~cjlin/cgi-bin/libsvm.cgi?+http://www.csie.ntu.edu.tw/~cjlin/libsvm+tar.gz
  svmlight -- http://download.joachims.org/svm_light/current/svm_light.tar.gz
  weka -- http://prdownloads.sourceforge.net/weka/weka-3-4-4.zip
  
  Only the packages you downloaded will be installed.

5. Run the install script in the packages subdirectory to install the
  learning algorithms.  That is, run

  cd packages
  installer

6. Install the Math:CDF and Math::Random packages to Perl if you don't have 
  them yet.  The packages can be found at 

  http://search.cpan.org/~callahan/Math-CDF-0.1/ 
  http://search.cpan.org/~grommel/Math-Random-0.67/

  and should be installed in the standard way.  For example, to instal
  the first one locally to your home directory, do the following in 
  your home directory:

  tar -xvzf Math-CDF-0.1.tar.gz
  cd Math-CDF-0.1
  perl Makefile.PL PREFIX=<your home directory>
  make 
  make test
  make install

  Also, set the PERL5LIB environment variable so that perl can find
  the modules.

After doing all of the above, the bounds should be ready to use.


Running
=======

The bounds are run by the "run" scripts located in the subdirectories
containing the bounds.  For example, the script for the cross-validation
bound is cv/run.  The commands give some information on their parameters when
invoked without arguments.

The output is stored in the file "results" in the working directory.  It
contains the following fields separated by spaces -- the meaning of other 
fields that may follow these can be decrypted by looking at the scripts.  

 1. bound for final classifier, failure probability delta
 2. bound for randomized classifier, failure probability delta/2
 3. bound for distance between final and randomized classifiers, failure 
   probability delta/2
 4. error rate of the randomized classifier on the unlabeled data (meaningful
   only if the unlabeled data is labeled) -- NONSENSE if unlabeled examples
   really are unlabeled.

NOTE: test and test_cheat do not conform to the above, they output only
a test set bound and a test set error rate for the classifier learned on
(a fraction of) the labeled data.

The scripts (server, monitor, ...) in the helpers subdirectory are there 
to help distribute experiments to a cluster of machines.


Data sets
=========

The format of data sets differs from algorithm to algorithm.  This 
variance is made invisible to the scripts computing the bounds by the 
algorithm specific wrappers introduced in the next section.

Each data set is to be stored in its own directory.  The labeled examples
are stored in a file named "labeled" and the unlabeled examples in a file
called "unlabeled".  These files should contain the examples one per a 
line, following the algorithm specific format.  The unlabeled examples can 
be labeled with dummy labels.

In addition to "labeled" and "unlabeled", the data set directories are
assumed to contain some extra files depending on the algorithm.  These
are as follows:

  C4.5:  A file "names" corresponding to the .names file needed by c4.5

  libsvm and svmlight:  A file "params" that contains parameters for the
    training algorithm (e.g., kernel type, ...).  The exact contents of 
    these "params" files depend on the algorithm.

  weka:  A file "header" containing the header part of the .arff format
    data files.  This header is not to be included in the files 
    "labeled" and "unlabeled".

There should be no other files in the data set directories.  For examples,
see the directory data and its subdirectories.


Adding new algorithms
=====================

The scripts are designed so that it should be relatively easy to use them
in computing error bounds for other algorithms, too.  An algorithm is 
added roughly as follows:

1. Install the binaries of the algorithm, say foo, to algs/foo.

2. Write three scripts:  
 
  algs/foo/train ARG1:  A wrapper that invokes the learning algorithm on the 
    data set whose name is ARG1.  This script should also take care of, e.g.,
    any pre-processing and passing data set specific parameters to the
    learning algorithm.

  algs/foo/predict ARG1 ARG2 ARG3:  Use the classifier in file ARG2 to 
    predict the labels of examples in file ARG1 and output the labels to 
    file ARG3.

  algs/foo/labels:  Extract and print to standard output the label sequence 
    (one per line) from the file(s) given as arguments (or standard input).  

  These wrappers are needed to hide the invocation and data format details
  of different learning algorithms.  For examples, see the scripts provided
  with the package.  Note that train typically communicates with predict by
  writing the learned classifier to a file.

3. Try the algorithm out and hope it works...
