Friday, October 25, 2013

CL-Libsvm Usage

CL-libsvm usage

At work we decided that we needed to use a bit of machine learning to automate a problem. I recommended using LIBSVM as it very established (read old with many bindings available). I decided to see what was involved in using it this morning, but found that it took an embarrassingly long time to get it working like I expected it to out of the box. This is my short-coming, I had forgotten too much of the machine learning that I used to know. But this post will hopefully help others get it working more quickly.

First, SVMs are pretty versatile, and there are a lot of choices that you can make when you start using LIBSVM. It is often not clear what you should use for what task, and I don't have the expertise to tell you what is best. For that information, you should check out this guide by the library's author. What I will show here is a quick and dirty approach.

I will be using Gabor Melis' bindings for LIBSVM. They work well enough, but they will issue many warnings when you load them up and use them as they use a now deprecated syntax of CFFI. Just be aware.

We will be using a radial basis function (RBF) kernel to model the "a1a" training data and test data that is available on the LIBSVM website. Already this is a bit odd, as the LIBSVM people have split their data such that ~5% is in the training data set while ~95% is in the test data set. This is probably due to the somewhat expensive nature of SVMs? But, whatever the case, we will stick with it.

Here is a simple procedure to read in the data:

(ql:quickload '(:iterate :cl-ppcre :alexandria))

(use-package :iterate)

(defun parse-data-file (file)
  "Read in a data-file, return list of target values with input as a sparse
vector suitable for cl-libsvm."
  (iter (for line :in-file file :using 'read-line)
    (collecting
     (list (read-from-string line)
           (let ((result nil))
             (ppcre:do-matches-as-strings (match "\\w+:\\w+" line)
               (push
                (apply 'cons (mapcar 'read-from-string
                                     (let ((separator (position #\: match)))
                                       (list 
                                        (subseq match 0 separator)
                                        (subseq match (+ 1 separator))))))
                result))
             (coerce (reverse result) 'vector))))))

(defun randomize-order (data)
  "Return a random permutation of the data."
  (alexandria:shuffle (copy-seq data)))

The parse-data-file returns a vector where each element is a line in the data file mapped into a list of the form (target-val sparse-vector-input). A sparse vector is a vector where each element is a cons cell containing an index as its car and a value as its cdr.

We need to further divide the testing data so that we have a data set that we can use to validate the parameters that the RBF SVM uses (the "validation" set) while we use the rest of the test set to determine the accuracy of the model (the "test" set). We could just cut the training data in half, but that would make our validation calculations very slow, and as we will see is a moment, this becomes the bottleneck of the calculation. I have chosen my validation set to be 2000 point. Whatever size you choose will be a trade off between accuracy and speed. Having 2000 points in your validation set is about my limit when it comes to playing around at the REPL (I would choose something higher if it showed benefits for the models accuracy). If we are dividing the data, we must first first randomize it to ensure that no structure of how the data was collected shows up in the analysis. This can be done with the randomize-order function above and a couple calls to subseq.

(let ((test-data (randomize-order (parse-data-file #p"~/Desktop/a1a.t"))))
  (defparameter *validation-set* (subseq test-data 0 2000))
  (defparameter *test-set* (subseq test-data 2000)))

Now, to get into the nitty-gritty of CL-Libsvm. First, you'll need to install it, and it is not available in Quicklisp yet. So, clone it from Gabor's Git repo. Naturally you will also need to have LIBSVM installed (you don't actually need libsvm-dev or libsvm-tools, but why not have them anyway):

cd ~/quicklisp/local-projects
git clone http://quotenil.com/git/cl-libsvm.git

# Install LIBSVM if needed
sudo aptitude install libsvm3 libsvm-dev libsvm-tools

CL-Libsvm is basically structured like LIBSVM, so the same documentation applies here. We must first create a problem which contains inputs and the expected outputs, then create a parameter structure that contains the parameters that define how the SVM operates (numerical constants as well as the type of SVM and kernel), and finally combine these two into a model which can be used to make predictions.

This can be done easily enough for any particular RBF parameters \(C\) and \(\gamma\).

(ql:quickload :cl-libsvm)

(let ((data (parse-data-file #p"~/Desktop/a1a")))
  (defparameter *problem* (libsvm:make-problem (map 'vector 'first data)
                                               (map 'vector 'second data))))

(defparameter *parameter* (libsvm:make-parameter :c c :gamma gamma))

(defparameter *model* (libsvm:train *problem* *parameter*))

But we want to do this for many different values of \(C\) and \(\gamma\) in order to find what parameters give us the best performance. We could do several things to find the optimum values. We will be following the LIBSVM procedure and just search the parameter space on a grid. We should note from the manual that it is probably in our best interests to search in \(\log C\) and \(\log \gamma\).

(let ((data (parse-data-file #p"~/Desktop/a1a")))
  (defparameter *problem* (libsvm:make-problem (map 'vector 'first data)
                                               (map 'vector 'second data))))

(defparameter *optimization-grid*
  (iter :outer
    (for log-c :from 1.5 :to 3.5 :by .1)
    (iter (for log-gamma :from -5 :to -3.5 :by .1)
      (in :outer (collecting
                  (list* log-c
                         log-gamma
                         (let* ((params (libsvm:make-parameter
                                         :c (exp log-c)
                                         :gamma (exp log-gamma)))
                                (model (libsvm:train *problem* params)))
                           (list (quality model *validation-set* log-c log-gamma)
                                 model))))))))

Note that there is a missing function here. We never defined quality. This function is meant to take a model and some testing data and determine a measure of how good the model is performing. For this I chose to use the Matthews Correlation Coefficient with the threshold for the prediction set to \(0.5\).

(defun logistic (x)
  (/ (+ 1 (exp (- x)))))

(defun quality (model test-data log-c log-gamma)
  "Use the Matthews Correlation Coefficient to measure how well the model does"
  (iter (for (target input) :in test-data)
    (let ((p (if (< 0.5 (logistic (libsvm:predict model input)))
                 1 -1)))
      (cond ((and (= p 1) (/= target p))
             (summing 1 :into false-positives))
            ((and (= p -1) (/= target p))
             (summing 1 :into false-negatives))
            ((and (= p 1) (= target p))
             (summing 1 :into true-positives))
            ((and (= p -1) (= target p))
             (summing 1 :into true-negatives))))
    (finally
     (let ((quality
             ;; Compute quality of model
             (if (= 0 (- (* true-positives true-negatives)
                          (* false-positives false-negatives)))
                  0d0
                  (/ (- (* true-positives true-negatives)
                        (* false-positives false-negatives))
                     (sqrt
                      (float
                       (* (+ true-positives false-positives)
                          (+ true-positives false-negatives)
                          (+ true-negatives false-positives)
                          (+ true-negatives false-negatives))
                       0d0))))))
       ;; Print some output so we know what it's doing
       (format
        t "log-c = ~A, log-gamma = ~A~@
           TP = ~A, TN = ~A, FP = ~A, FN = ~A~@
           Quality = ~A~%"
        log-c log-gamma
        true-positives true-negatives false-positives false-negatives
        quality)
       (return quality)))))

When we put this all together and playing around with the ranges of the plots, we get a plot that looks like this:

From this we can tell that there is likely some kind of optimum around \(\log C = 2.10\) and \(\log \gamma = -4.27\).

You might wonder how I was able to determine, looking at those blocky heat maps, where that tiny maximum was. While you can do what LIBSVM docs suggest and move to finer and finer grids to find optimal points, I find this pretty annoying, and finicky. I opted to use some, as yet unreleased, bindings for the NLOpt optimization library. With NLOpt we can do a global optimization followed by a local optimization, which requires next to zero human intervention and finds a pretty good optimum that I doubt I would be able to otherwise. (Small caveat here, it is entirely possible that the rough nature of my MCC heat map is merely an artifact of the small validation set size. I don't have the patience to test this for a silly example)

;; Global optimization (better grab a snack)
(nlopt:nlopt-apply
 (lambda (log-c log-gamma)
   (let* ((params (libsvm:make-parameter
                   :c (exp log-c)
                   :gamma (exp log-gamma)))
          (model (libsvm:train *problem* params)))
     (- (quality model *validation-set*))))
 '(1 1)
 :nlopt-gn-crs2-lm
 :lower-bounds '(-5 -5)
 :upper-bounds '(7 7)
 :abs-xtol 1d-1)

;; Fit parameters = (2.071364331304816d0 -4.265683211751565d0)
;; Validation MCC = -0.5509856550306286d0

;; Local optimization (this should be quick)
(nlopt:nlopt-apply
 (lambda (log-c log-gamma)
   (let* ((params (libsvm:make-parameter
                   :c (exp log-c)
                   :gamma (exp log-gamma)))
          (model (libsvm:train *problem* params)))
     (- (quality model *validation-set*))))
 '(2.071364331304816d0 -4.265683211751565d0)
 :nlopt-ln-bobyqa
 :lower-bounds '(-5 -5)
 :upper-bounds '(7 7)
 :abs-xtol 1d-5)

;; Fit parameters = (2.096969188326027d0 -4.268553108908674d0)
;; Validation MCC = -0.5522135970232868d0

(let* ((params (libsvm:make-parameter
                :c (exp 2.096969188326027d0)
                :gamma (exp -4.268553108908674d0)))
       (model (libsvm:train *problem* params)))
  (quality model *test-set*))

;; Test set MCC = 0.5497032935038368d0

This gives the optimum at \(C = 2.10\) and \(\gamma = -4.27\) and a Matthew's Correlation Coefficient of \(0.55\) (measured against the test set, naturally).

5 comments:

  1. i came here via HN. great post. FYI: your site uses 60% of one core of my core 2 duo laptop. i'm running fedora 19 with chromium 27. (thought you'd want to know. sorry my comment isn't more interesting)

    ReplyDelete
    Replies
    1. Thanks. I do push the Latex rendering off onto your browser, but that typically goes away after an initial heavy load. Something seems different today. I think that this is related to Flattr being down or something. I will try to monitor this to see if it gets better.

      Delete
    2. You know, on second inspection, I am not seeing this at all. I mistook Firefox's standard hoggish behavior as something my page was doing. Chrome and Chromium show little to no CPU usage after the initial load, however I am using 33.something on both of those. I'll keep an eye out.

      Delete
  2. Cool post I found from reddit/lisp. I'm an engineer, but don't have any machine learning background. Could you explain what is happening here? How does the code find the optimization?

    ReplyDelete
    Replies
    1. Sure, let's see if hind sight provides a good summary.

      > Could you explain what is happening here?

      This post is about supervised machine learning using support vector machines or SVMs. Unlike in unsupervised machine learning there is nothing really magical happening here. We have an unknown function $f(\vec x)$ that is always either -1 or 1 and where $\vec x$ is a "feature vector" or list inputs that we suspect $f$ might depend on. We would like to model this function with our own function, $g(\vec x, \vec a)$, where $\vec a$ is a vector of fitting parameters that allows us to tune the function $g$ to match $f$, e.g. the learning part. Our $g$ in this case is defined by a SVM (to be specific, an SVM based on radial basis functions or RBF), which is a bit complicated, but you can read about it in many books or in the link to the LIBSVM manual in the post. Suffice it to say it's flexible enough to fit a lot of functions that produce either -1 or 1. This fit (or in some cases, outright solving) for parameters $\vec a$ is done by LIBSVM based on a data set called the training data set.

      However, since the SVM function is quite flexible, it is almost certain that the SVM function is tuned, via its $\vec a$ parameters, specifically for the training dataset and not the function $f$ that we are really interested in. This behavior is called overfitting. IMO the major hurtle of supervised machine learning is making sure that you don't overfit the data. To combat this, we use a separate data set, called the cross validation data set, that we use to determine if the function $g(\vec x, \vec a)$ performs well at predicting data that it has never seen. We measure the performance via the Matthew's correlation coefficient, but that is just a way to measure how well the model fits the (cross validation) data, you could use something simpler.

      So the cross validation data set tells us how well the SVM is performing but it turns out that the RBF SVM has two free parameters, $C$ and $\gamma$, that determine how it solves/fits for $\vec a$. It doesn't really matter what these parameters mean to the SVM, all we need to know is that we can tune them to optimize our performance as measured via the cross validation data set. As the post text describes, the tutorial for SVM tells you to find good values for $C$ and $\gamma$ by plotting and picking them off a heat map or contour plot. I elected to do something different, but ultimately unnecessary, see the answer to your next question.

      It should be noted that finding the values for $C$ and $\gamma$ that maximize the score on the cross validation data set is another stage in the machine learning process. There is, again, danger that we might overfit to match the combination of the training data and the cross validation data, so we need yet another dataset, the test data set, that we use to measure the final quality of the learned model.

      > How does the code find the optimization?

      I used a global and subsequent local minimization technique to find the "global" maximum (quotes are to indicate that I'm not sure that this is actually the global maximum). I use a Free Software library called NLOpt to do this, which I think is pretty well programmed. It is available here: http://ab-initio.mit.edu/wiki/index.php/NLopt They provide documentation on the algorithms used for optimization including links to the literature where they were originally published.

      As I stated, the Lisp bindings used here are not released as I feel like the interface needs improvement, there are some stubborn bugs that I haven't worked out, and I think I have some dependencies that I would like to free the binding of. Ask and I will release. However, it should be straight forward to use the NLOpt directly via your language of choice to do what I did above.

      Hope that helps.

      Delete