At work we decided that we needed to use a bit of machine learning to automate a problem. I recommended using LIBSVM as it very established (read old with many bindings available). I decided to see what was involved in using it this morning, but found that it took an embarrassingly long time to get it working like I expected it to out of the box. This is my short-coming, I had forgotten too much of the machine learning that I used to know. But this post will hopefully help others get it working more quickly.
First, SVMs are pretty versatile, and there are a lot of choices that you can make when you start using LIBSVM. It is often not clear what you should use for what task, and I don't have the expertise to tell you what is best. For that information, you should check out this guide by the library's author. What I will show here is a quick and dirty approach.
I will be using Gabor Melis' bindings for LIBSVM. They work well enough, but they will issue many warnings when you load them up and use them as they use a now deprecated syntax of CFFI. Just be aware.
We will be using a radial basis function (RBF) kernel to model the "a1a" training data and test data that is available on the LIBSVM website. Already this is a bit odd, as the LIBSVM people have split their data such that ~5% is in the training data set while ~95% is in the test data set. This is probably due to the somewhat expensive nature of SVMs? But, whatever the case, we will stick with it.
Here is a simple procedure to read in the data:
(ql:quickload '(:iterate :cl-ppcre :alexandria)) (use-package :iterate) (defun parse-data-file (file) "Read in a data-file, return list of target values with input as a sparse vector suitable for cl-libsvm." (iter (for line :in-file file :using 'read-line) (collecting (list (read-from-string line) (let ((result nil)) (ppcre:do-matches-as-strings (match "\\w+:\\w+" line) (push (apply 'cons (mapcar 'read-from-string (let ((separator (position #\: match))) (list (subseq match 0 separator) (subseq match (+ 1 separator)))))) result)) (coerce (reverse result) 'vector)))))) (defun randomize-order (data) "Return a random permutation of the data." (alexandria:shuffle (copy-seq data)))
The parse-data-file
returns a vector where each element is a line in the data
file mapped into a list of the form (target-val sparse-vector-input)
. A
sparse vector is a vector where each element is a cons cell containing an index
as its car
and a value as its cdr
.
We need to further divide the testing data so that we have a data set that we
can use to validate the parameters that the RBF SVM uses (the "validation" set)
while we use the rest of the test set to determine the accuracy of the model
(the "test" set). We could just cut the training data in half, but that would make our validation calculations very slow, and as we will see is a moment, this becomes the bottleneck of the calculation. I have chosen my validation set to be 2000 point. Whatever size you choose will be a trade off between accuracy and speed. Having 2000 points in your validation set is about my limit when it comes to playing around at the REPL (I would choose something higher if it showed benefits for the models accuracy). If we are dividing the data, we must first
first randomize it to ensure that no structure of how the data was collected shows up
in the analysis. This can be done with the randomize-order
function above and
a couple calls to subseq
.
(let ((test-data (randomize-order (parse-data-file #p"~/Desktop/a1a.t")))) (defparameter *validation-set* (subseq test-data 0 2000)) (defparameter *test-set* (subseq test-data 2000)))
Now, to get into the nitty-gritty of CL-Libsvm. First, you'll need to install
it, and it is not available in Quicklisp yet. So, clone it from Gabor's Git
repo. Naturally you will also need to have LIBSVM installed (you don't actually
need libsvm-dev
or libsvm-tools
, but why not have them anyway):
cd ~/quicklisp/local-projects git clone http://quotenil.com/git/cl-libsvm.git # Install LIBSVM if needed sudo aptitude install libsvm3 libsvm-dev libsvm-tools
CL-Libsvm is basically structured like LIBSVM, so the same documentation applies here. We must first create a problem which contains inputs and the expected outputs, then create a parameter structure that contains the parameters that define how the SVM operates (numerical constants as well as the type of SVM and kernel), and finally combine these two into a model which can be used to make predictions.
This can be done easily enough for any particular RBF parameters \(C\) and \(\gamma\).
(ql:quickload :cl-libsvm) (let ((data (parse-data-file #p"~/Desktop/a1a"))) (defparameter *problem* (libsvm:make-problem (map 'vector 'first data) (map 'vector 'second data)))) (defparameter *parameter* (libsvm:make-parameter :c c :gamma gamma)) (defparameter *model* (libsvm:train *problem* *parameter*))
But we want to do this for many different values of \(C\) and \(\gamma\) in order to find what parameters give us the best performance. We could do several things to find the optimum values. We will be following the LIBSVM procedure and just search the parameter space on a grid. We should note from the manual that it is probably in our best interests to search in \(\log C\) and \(\log \gamma\).
(let ((data (parse-data-file #p"~/Desktop/a1a"))) (defparameter *problem* (libsvm:make-problem (map 'vector 'first data) (map 'vector 'second data)))) (defparameter *optimization-grid* (iter :outer (for log-c :from 1.5 :to 3.5 :by .1) (iter (for log-gamma :from -5 :to -3.5 :by .1) (in :outer (collecting (list* log-c log-gamma (let* ((params (libsvm:make-parameter :c (exp log-c) :gamma (exp log-gamma))) (model (libsvm:train *problem* params))) (list (quality model *validation-set* log-c log-gamma) model))))))))
Note that there is a missing function here. We never defined quality. This function is meant to take a model and some testing data and determine a measure of how good the model is performing. For this I chose to use the Matthews Correlation Coefficient with the threshold for the prediction set to \(0.5\).
(defun logistic (x) (/ (+ 1 (exp (- x))))) (defun quality (model test-data log-c log-gamma) "Use the Matthews Correlation Coefficient to measure how well the model does" (iter (for (target input) :in test-data) (let ((p (if (< 0.5 (logistic (libsvm:predict model input))) 1 -1))) (cond ((and (= p 1) (/= target p)) (summing 1 :into false-positives)) ((and (= p -1) (/= target p)) (summing 1 :into false-negatives)) ((and (= p 1) (= target p)) (summing 1 :into true-positives)) ((and (= p -1) (= target p)) (summing 1 :into true-negatives)))) (finally (let ((quality ;; Compute quality of model (if (= 0 (- (* true-positives true-negatives) (* false-positives false-negatives))) 0d0 (/ (- (* true-positives true-negatives) (* false-positives false-negatives)) (sqrt (float (* (+ true-positives false-positives) (+ true-positives false-negatives) (+ true-negatives false-positives) (+ true-negatives false-negatives)) 0d0)))))) ;; Print some output so we know what it's doing (format t "log-c = ~A, log-gamma = ~A~@ TP = ~A, TN = ~A, FP = ~A, FN = ~A~@ Quality = ~A~%" log-c log-gamma true-positives true-negatives false-positives false-negatives quality) (return quality)))))
When we put this all together and playing around with the ranges of the plots, we get a plot that looks like this:
From this we can tell that there is likely some kind of optimum around \(\log C = 2.10\) and \(\log \gamma = -4.27\).
You might wonder how I was able to determine, looking at those blocky heat maps, where that tiny maximum was. While you can do what LIBSVM docs suggest and move to finer and finer grids to find optimal points, I find this pretty annoying, and finicky. I opted to use some, as yet unreleased, bindings for the NLOpt optimization library. With NLOpt we can do a global optimization followed by a local optimization, which requires next to zero human intervention and finds a pretty good optimum that I doubt I would be able to otherwise. (Small caveat here, it is entirely possible that the rough nature of my MCC heat map is merely an artifact of the small validation set size. I don't have the patience to test this for a silly example)
;; Global optimization (better grab a snack) (nlopt:nlopt-apply (lambda (log-c log-gamma) (let* ((params (libsvm:make-parameter :c (exp log-c) :gamma (exp log-gamma))) (model (libsvm:train *problem* params))) (- (quality model *validation-set*)))) '(1 1) :nlopt-gn-crs2-lm :lower-bounds '(-5 -5) :upper-bounds '(7 7) :abs-xtol 1d-1) ;; Fit parameters = (2.071364331304816d0 -4.265683211751565d0) ;; Validation MCC = -0.5509856550306286d0 ;; Local optimization (this should be quick) (nlopt:nlopt-apply (lambda (log-c log-gamma) (let* ((params (libsvm:make-parameter :c (exp log-c) :gamma (exp log-gamma))) (model (libsvm:train *problem* params))) (- (quality model *validation-set*)))) '(2.071364331304816d0 -4.265683211751565d0) :nlopt-ln-bobyqa :lower-bounds '(-5 -5) :upper-bounds '(7 7) :abs-xtol 1d-5) ;; Fit parameters = (2.096969188326027d0 -4.268553108908674d0) ;; Validation MCC = -0.5522135970232868d0 (let* ((params (libsvm:make-parameter :c (exp 2.096969188326027d0) :gamma (exp -4.268553108908674d0))) (model (libsvm:train *problem* params))) (quality model *test-set*)) ;; Test set MCC = 0.5497032935038368d0
This gives the optimum at \(C = 2.10\) and \(\gamma = -4.27\) and a Matthew's Correlation Coefficient of \(0.55\) (measured against the test set, naturally).