Friday, October 25, 2013

CL-Libsvm Usage

CL-libsvm usage

At work we decided that we needed to use a bit of machine learning to automate a problem. I recommended using LIBSVM as it very established (read old with many bindings available). I decided to see what was involved in using it this morning, but found that it took an embarrassingly long time to get it working like I expected it to out of the box. This is my short-coming, I had forgotten too much of the machine learning that I used to know. But this post will hopefully help others get it working more quickly.

First, SVMs are pretty versatile, and there are a lot of choices that you can make when you start using LIBSVM. It is often not clear what you should use for what task, and I don't have the expertise to tell you what is best. For that information, you should check out this guide by the library's author. What I will show here is a quick and dirty approach.

I will be using Gabor Melis' bindings for LIBSVM. They work well enough, but they will issue many warnings when you load them up and use them as they use a now deprecated syntax of CFFI. Just be aware.

We will be using a radial basis function (RBF) kernel to model the "a1a" training data and test data that is available on the LIBSVM website. Already this is a bit odd, as the LIBSVM people have split their data such that ~5% is in the training data set while ~95% is in the test data set. This is probably due to the somewhat expensive nature of SVMs? But, whatever the case, we will stick with it.

Here is a simple procedure to read in the data:

(ql:quickload '(:iterate :cl-ppcre :alexandria))

(use-package :iterate)

(defun parse-data-file (file)
"Read in a data-file, return list of target values with input as a sparse
vector suitable for cl-libsvm."
(iter (for line :in-file file :using 'read-line)
(collecting
(let ((result nil))
(ppcre:do-matches-as-strings (match "\\w+:\\w+" line)
(push
(let ((separator (position #\: match)))
(list
(subseq match 0 separator)
(subseq match (+ 1 separator))))))
result))
(coerce (reverse result) 'vector))))))

(defun randomize-order (data)
"Return a random permutation of the data."
(alexandria:shuffle (copy-seq data)))


The parse-data-file returns a vector where each element is a line in the data file mapped into a list of the form (target-val sparse-vector-input). A sparse vector is a vector where each element is a cons cell containing an index as its car and a value as its cdr.

We need to further divide the testing data so that we have a data set that we can use to validate the parameters that the RBF SVM uses (the "validation" set) while we use the rest of the test set to determine the accuracy of the model (the "test" set). We could just cut the training data in half, but that would make our validation calculations very slow, and as we will see is a moment, this becomes the bottleneck of the calculation. I have chosen my validation set to be 2000 point. Whatever size you choose will be a trade off between accuracy and speed. Having 2000 points in your validation set is about my limit when it comes to playing around at the REPL (I would choose something higher if it showed benefits for the models accuracy). If we are dividing the data, we must first first randomize it to ensure that no structure of how the data was collected shows up in the analysis. This can be done with the randomize-order function above and a couple calls to subseq.

(let ((test-data (randomize-order (parse-data-file #p"~/Desktop/a1a.t"))))
(defparameter *validation-set* (subseq test-data 0 2000))
(defparameter *test-set* (subseq test-data 2000)))


Now, to get into the nitty-gritty of CL-Libsvm. First, you'll need to install it, and it is not available in Quicklisp yet. So, clone it from Gabor's Git repo. Naturally you will also need to have LIBSVM installed (you don't actually need libsvm-dev or libsvm-tools, but why not have them anyway):

cd ~/quicklisp/local-projects
git clone http://quotenil.com/git/cl-libsvm.git

# Install LIBSVM if needed
sudo aptitude install libsvm3 libsvm-dev libsvm-tools


CL-Libsvm is basically structured like LIBSVM, so the same documentation applies here. We must first create a problem which contains inputs and the expected outputs, then create a parameter structure that contains the parameters that define how the SVM operates (numerical constants as well as the type of SVM and kernel), and finally combine these two into a model which can be used to make predictions.

This can be done easily enough for any particular RBF parameters $C$ and $\gamma$.

(ql:quickload :cl-libsvm)

(let ((data (parse-data-file #p"~/Desktop/a1a")))
(defparameter *problem* (libsvm:make-problem (map 'vector 'first data)
(map 'vector 'second data))))

(defparameter *parameter* (libsvm:make-parameter :c c :gamma gamma))

(defparameter *model* (libsvm:train *problem* *parameter*))


But we want to do this for many different values of $C$ and $\gamma$ in order to find what parameters give us the best performance. We could do several things to find the optimum values. We will be following the LIBSVM procedure and just search the parameter space on a grid. We should note from the manual that it is probably in our best interests to search in $\log C$ and $\log \gamma$.

(let ((data (parse-data-file #p"~/Desktop/a1a")))
(defparameter *problem* (libsvm:make-problem (map 'vector 'first data)
(map 'vector 'second data))))

(defparameter *optimization-grid*
(iter :outer
(for log-c :from 1.5 :to 3.5 :by .1)
(iter (for log-gamma :from -5 :to -3.5 :by .1)
(in :outer (collecting
(list* log-c
log-gamma
(let* ((params (libsvm:make-parameter
:c (exp log-c)
:gamma (exp log-gamma)))
(model (libsvm:train *problem* params)))
(list (quality model *validation-set* log-c log-gamma)
model))))))))


Note that there is a missing function here. We never defined quality. This function is meant to take a model and some testing data and determine a measure of how good the model is performing. For this I chose to use the Matthews Correlation Coefficient with the threshold for the prediction set to $0.5$.

(defun logistic (x)
(/ (+ 1 (exp (- x)))))

(defun quality (model test-data log-c log-gamma)
"Use the Matthews Correlation Coefficient to measure how well the model does"
(iter (for (target input) :in test-data)
(let ((p (if (< 0.5 (logistic (libsvm:predict model input)))
1 -1)))
(cond ((and (= p 1) (/= target p))
(summing 1 :into false-positives))
((and (= p -1) (/= target p))
(summing 1 :into false-negatives))
((and (= p 1) (= target p))
(summing 1 :into true-positives))
((and (= p -1) (= target p))
(summing 1 :into true-negatives))))
(finally
(let ((quality
;; Compute quality of model
(if (= 0 (- (* true-positives true-negatives)
(* false-positives false-negatives)))
0d0
(/ (- (* true-positives true-negatives)
(* false-positives false-negatives))
(sqrt
(float
(* (+ true-positives false-positives)
(+ true-positives false-negatives)
(+ true-negatives false-positives)
(+ true-negatives false-negatives))
0d0))))))
;; Print some output so we know what it's doing
(format
t "log-c = ~A, log-gamma = ~A~@
TP = ~A, TN = ~A, FP = ~A, FN = ~A~@
Quality = ~A~%"
log-c log-gamma
true-positives true-negatives false-positives false-negatives
quality)
(return quality)))))


When we put this all together and playing around with the ranges of the plots, we get a plot that looks like this:

From this we can tell that there is likely some kind of optimum around $\log C = 2.10$ and $\log \gamma = -4.27$.

You might wonder how I was able to determine, looking at those blocky heat maps, where that tiny maximum was. While you can do what LIBSVM docs suggest and move to finer and finer grids to find optimal points, I find this pretty annoying, and finicky. I opted to use some, as yet unreleased, bindings for the NLOpt optimization library. With NLOpt we can do a global optimization followed by a local optimization, which requires next to zero human intervention and finds a pretty good optimum that I doubt I would be able to otherwise. (Small caveat here, it is entirely possible that the rough nature of my MCC heat map is merely an artifact of the small validation set size. I don't have the patience to test this for a silly example)

;; Global optimization (better grab a snack)
(nlopt:nlopt-apply
(lambda (log-c log-gamma)
(let* ((params (libsvm:make-parameter
:c (exp log-c)
:gamma (exp log-gamma)))
(model (libsvm:train *problem* params)))
(- (quality model *validation-set*))))
'(1 1)
:nlopt-gn-crs2-lm
:lower-bounds '(-5 -5)
:upper-bounds '(7 7)
:abs-xtol 1d-1)

;; Fit parameters = (2.071364331304816d0 -4.265683211751565d0)
;; Validation MCC = -0.5509856550306286d0

;; Local optimization (this should be quick)
(nlopt:nlopt-apply
(lambda (log-c log-gamma)
(let* ((params (libsvm:make-parameter
:c (exp log-c)
:gamma (exp log-gamma)))
(model (libsvm:train *problem* params)))
(- (quality model *validation-set*))))
'(2.071364331304816d0 -4.265683211751565d0)
:nlopt-ln-bobyqa
:lower-bounds '(-5 -5)
:upper-bounds '(7 7)
:abs-xtol 1d-5)

;; Fit parameters = (2.096969188326027d0 -4.268553108908674d0)
;; Validation MCC = -0.5522135970232868d0

(let* ((params (libsvm:make-parameter
:c (exp 2.096969188326027d0)
:gamma (exp -4.268553108908674d0)))
(model (libsvm:train *problem* params)))
(quality model *test-set*))

;; Test set MCC = 0.5497032935038368d0


This gives the optimum at $C = 2.10$ and $\gamma = -4.27$ and a Matthew's Correlation Coefficient of $0.55$ (measured against the test set, naturally).

Wednesday, July 31, 2013

Rudel Survival Guide

We are gearing up for our collaborative effort for ICFP and so I figured it might be nice to write out a little "how to make it work" guide for Rudel, the collaborative editing framework for Emacs. To get this out of the way first, Rudel doesn't work out of the box. In order to use it, we had to hack a bit on the source to correct what we can only guess are legitimate errors. That said, I certainly don't have the expertise on the system in order to dictate the correct way to solve these problem.

As a note, throughout this contest our setup will be:

• Rudel v0.3
• Obby backend
• TCP transport
• Port 6522
• not using the built in Rudel/Obby encryption
• all connections must be tunneled over SSH (for encryption/access control)
• No global or user passwords

The first step for using Rudel is to, naturally, install Rudel. If you have Emacs v24, you may use the package manager via M-x package-list-packages. This will make sure that it also gets any dependencies, however I don't think there are any that aren't already part of Emacs. If you are not using Emacs v24, you will need to install package.el (and this will actually be useful as I believe it will have to track down some dependencies). This can be done like this:

cd .emacs.d
wget http://repo.or.cz/w/emacs.git/blob_plain/1a0a666f941c99882093d7bd08ced15033bc3f0c:/lisp/emacs-lisp/package.el


Then from Emacs, M-x load-file that package.el file. Then you can use continue as if you were using v24 (except where noted).

Rudel is available via the the Marmalade repository. In order to enable the Marmalade repository, you should add something like this early in your .emacs file:

;; (load "~/.emacs.d/package.el") ;; If not v24

;; Load up the new package manager
(require 'package)
(package-initialize)


The install/compile should seemingly complete without any issues. (If you are not on v24, then there might be an issue with package displaying the info page of certain packages, Rudel amongst them. Instead of using M-x package-list-packages, use M-x package-install and specify "rudel" when it asks).

Now the trouble fun begins.

Try closing and restarting Emacs. On your next start, Emacs will fail to read your .emacs file with an error like:

Symbol's function definition is void: eieio-defclass-autoload

This is because there are some bugs in Rudel that make it not load properly. My solution is to not use the standard method of loading Rudel. The first order of business is stopping it from loading the correct but broken way via the package manager. Go into the shell and move the rudel directory somewhere where it cannot be found by the package manager:

cd .emacs.d
mv elpa/rudel-0.3 ./


Now Emacs should read your .emacs without issue (because it no longer is trying to load Rudel).

The next order of business, we want to be able to load and use Rudel. In order to do this, we will run Rudel's compile script. Do a M-x load-file on .emacs.d/rudel-0.3/rudel-compile.el. This will compile and generate some files, most importantly for our purposes, it will generate rudel-loaddefs.el. Perform a M-x load-file on .emacs.d/rudel-0.3/rudel-loaddefs.el and Rudel should be operational. Try it out. Use M-x rudel-host-session to start a session (protocol obby, transport tcp, port 6522). Then join that session via M-x rudel-join-session. Try publishing a buffer with M-x rudel-publish-buffer. This should all work.

We want to make this permanent, so we should add something like:

(load-file "~/.emacs.d/rudel-0.3/rudel-loaddefs.el")


…to our .emacs file after the (package-initialize) statement.

Look, I know this is extremely hackish, but I think it will work for everybody. It is the only way I have consistently been able to get things working.

1.2 Joining and leaving sessions

So, as best I can see, this is the status of this functionality in Rudel: you can join sessions and you can leave, but as far as the server cares, you can never leave. This doesn't seem like too much of a problem at first, but here is how problems start to manifest.

1. A username must be unique: This means that each time you log in, you have to pick a unique name, not the one you used last time. This manifests as a lot of "smithzv", "smithzv2", "smithzv3", etc. Luckily, you shouldn't be disconnected often.
2. A color must be unique at login time. This one isn't as bad as you can change colors once you are in the session using rudel-change-color. This means that a good practice is to log in and pick a garish and otherwise infrequently used color and then immediately change it to something more appropriate. No need to worry about conflicts after you have logged in.

1.3 Undo and Rudel

So, one of the biggest problems that I have with collaborative editing in Rudel is that undo are treated very literally. I you undo, you implicitly expect it to apply to your code. However, with Rudel, where someone could be editing somewhere else in the file, undo is happy to go about undoing their work as they type rather than the edits you just made.

The end result is that in a collaborative editing environment, undo is a very dangerous thing; doubly so when undo means what it does in Emacs land (i.e. you can redo by undoing an undo). Basically, if you are using Rudel, you probably should not be using undo, at all. This is a pretty tall order if you have deeply internalized undo into your editing commands (as I have).

In strikes me that a helpful heuristic would be to undo only within a region that contains only your edits (or even using something like multiple-regions to allow for undo of anything that you have edited but not things that others have). This means that if you are the only person editing a buffer, undo works exactly as expected, but if there are others editing the buffer, it will only undo your edits. Note that this isn't always what you want.

However, I'm not sure that such a heuristic is possible (or more likely, it is possible but is hard to do). I'll take a look. It seems that for safe, useful undoing, you need to tell everybody that is working on that buffer that they need to stop what they are doing so you may perform your undos.

I realize that other undo systems can be used with Emacs. For instance, there is Undo-Tree. I am not sure how well these work (or really how they work). Perhaps someone who is better versed in these tools can enlighten us.

1.4 When things go wrong

There are times when, try as it might, Rudel screws up and the files become out of sync. This happens fairly infrequently, thank goodness, but when it does, Rudel does not handle this gracefully. There is no "resync" function that I can see. This means that if you find that your buffer says one thing, but someone elses says something else (this usually manifests a what looks like a typo, but if you fix it, another person goes back and changes it back to its typo state), you will have do something pretty drastic in order to get Rudel to work correctly again. There are a couple of things that must work, but they both are pretty annoying:

1. Ditch this buffer, have everybody unsubscribe, rename the buffer so it will share under a different name, then republish. This way everything has started fresh with this buffer.
2. Restart the entire Rudel session.

Of course, the first method is preferred to the second.

1.5 Dealing with Annoying Colors

If you start using Rudel, sometimes a collaborator will pick a color that doesn't work with your theme or general aesthetic. Luckily, there is something you can do about it even if they refuse to change their color (or are perhaps away from keyboard), you can turn off all coloring by author. Simply specialize the variable rudel-overlay-author-display (set it to nil) and no more colors. This is pretty necessary right now because Rudel is ignorant of the difference between light and dark themes. Thus two locally appropriate choices might be completely inappropriate for the remote party.

Thursday, July 25, 2013

A couple days ago, Canonical announced a crowd funding campaign for the Ubuntu Edge, a phone sized device that is able to function as a phone but is firmly designed to be a PC replacement. Just in case someone stumbles on this page after finding such crappy reporting as this, here is the deal with Ubuntu Edge as I see it: The Ubuntu Edge is about shaping the future.

Now, it is no secret that I like Canonical, one of the only companies fighting for (and throwing money at) GNU/Linux on the desktop. Frankly, despite recent missteps in quality control and their tendency to not go with 100% Free Software, I think they are doing more for the Free Software Movement than basically anybody else out there. But I have to say that Ubuntu Edge, and it's associated crowd funding campaign, is a particular point of brilliance for the company.

Canonical has been working for the last few years on something they call converged computing. The idea is to have every consumer electronic centered around a screen controlled on a very fundamental level by one device (one device that happens to run nearly 100% Free Software). As most of you are aware, every piece of consumer electronics out there has a small, typically underpowered, computer in it. The Ubuntu Edge vision is to take all of those computers out of the electronics and simply have the manufacturers build screens with an HDMI port and power supply attached. For all of the computation, something like the Ubuntu Edge will handle it.

The way I see Canonical's end game vision is something much like the technology seen in the Total Recall remake from a few years ago. In this movie, people have technology that they carry with them (actually implanted inside their body) and may use any piece of glass as a display for that technology by pressing their hand against it. Canonical's vision is similar. Computational devices will be mobile, private, secure, personal device that you carry with you while displays will be stationary, plentiful, pseudo-public devices you hook into when needed. In such a world, every hotel room you rent, every office, every coffee shop table, and every room in your house will have a display and input devices (keyboard and mouse, or touchscreen) and a cradle to place your "phone" into. The way I see Ubuntu Edge is the way it is presented, as a taste of the future.

But that is not entirely fair. There is one place where the Ubuntu Edge fits perfectly in the current world. If you have a computer at home and a computer at work (or if you have one laptop that you use for both and you lug it back and forth), the Ubuntu Edge makes a lot of sense for you. You should think of the Ubuntu Edge as a PC tower and UPS that fits in your pocket. There is no need to sync between different disk drives because you bring your drive with you. If you have ever emailed a file to yourself, or left your computer on at home so that you would have SSH access to it in case you needed it, or if you simply have so much data that you cannot store it in something like Dropbox or some analogous cloud file storage, an Ubuntu Edge like system is a great fit. In addition, there is no need to worry about software differences between your applications and files (no Excel 07 can't read Excel 09 XML files file problems) or incompatibilities during software development because you bring the hardware and software with you. If you have ever wasted a few hours or days configuring a new computer or tweaking a computers configuration, Ubuntu Edge seems like a useful platform. Canonical isn't the first company to sniff around this idea, but they seem to be the closest to actually achieving this.

An extra benefit of something like Ubuntu Edge is that it solves many problems that people that value freedom, privacy, and freedom from vendor lock-in have felt mounting. Ubuntu Edge resonates very well with us freedom loving people as it is a solution that is centered around a well standardized interface. It means that anybody can use this with whatever device they like. It is of course not lost on me that this device will be close to a completely freedom respecting device. There will probably be proprietary device drivers, but I expect that user land will be Free as in Freedom. This also resonates with us people that support competition in the market. Any time we have a technology where any new inventor can enter the market and leverage it for their invention, my free market loving self smiles a little bit. Standardized interfaces are great for competition (e.g. see the Internet).

Even further, this also resonates very well with us privacy loving people and people that haven't drank the "cloud" Kool-Aid. The current solution to the woes of syncing multiple computers (be they PCs, tablets, phone, or whatever) is to have something like Dropbox shuffle files around. This is actually a pretty crappy solution, if we are being honest. Don't get me wrong, it is good for the world we have constructed, but it is far from optimal. For example, on any system that I am familiar with, it is not possible to sync installed applications via something like Dropbox.

Of course there is nothing wrong with cloud services in principle, and in many instances they are extremely convenient for transferring files to friends. I'm not proposing that we move away from Internet services in total, but I think there is an argument to be made for only using Internet services for things that need to be Internet services. The problem is that cloud services are run by corporations and corporations only have allegiances to their share holders and, if required by law, the government that they operate under. In addition, in the world we currently live in corporations tend to also have a disturbing level of control over "mobile" devices. In this data syncing/sharing scheme there are many points of weakness when it comes to privacy. By buying into the "cloud", we have transitioned from the normal state where files that are not shared with others are basically safe into a world where our files are more vulnerable (to either attackers, corporations, or government agencies) because we have implicitly shared them in order to sync them. Something like Ubuntu Edge is a way out of this, it is a solution for many of the problems that the cloud was designed to solve. If my files are local (i.e. they are in my pocket) and third party devices that I interact with are nothing more than a screen with no general purpose processing capabilities, I have much less to worry about.

All of these aspects packed together makes this a pretty awesome project. But now the brilliant part. By making this a crowd funding campaign, Canonical has made it clear to the consumers what is possible. I have been hearing rumblings about a system like this for years, but perhaps the ordinary people out there have not. This is certainly a way to get the word out. On the flip side, this tells the manufacturers that there are people that are willing to pay money for something like this. Just the fact that this crowd funding campaign existed, regardless of whether it is actually funded, will affect the trajectory of technology in the future. This is a win-win scenario and it is brilliant, pure and simple. I would love to see Canonical make their goal and make their phone, but in my estimation, they have already changed the world for the better.

So, no, buying an Ubuntu Edge isn't the best fit for everyone (but to be honest, neither is a tablet, which try as you might, is a poor platform for anything but consuming information). Ubuntu Edge doesn't fit in perfectly with the world as it is. It isn't even designed to fill a gap, or at least not a gap that exists today. It is designed to be that device that is necessary to make the future that many of us envision come true. That said, it is priced competitively for the package that they promise and, for people that lug a laptop back and forth from work or for people that have a two or more computers that they struggle to keep in sync (e.g. me), this will be a useful device right now. If I had $700-830 to spend on a speculative device, I would. The real question is whether they can actually deliver what they promise. Sunday, July 7, 2013 Automatic Binding Generation With Verrazano Verrazano is a library for automatically creating Foreign Function Interface (FFI) bindings for Common Lisp. It is easy to use, as long as it works, and it can save you a lot of busy work. However, it is not well documented, and the documentation that is available is incorrect. Hopefully this post will help out people that don't want to figure it out themselves (including myself in 3 months). As a sample of how to use Verrazano, I'm going to generate a set of bindings for the NLopt library from our friends over at MIT. This library is a great tool for optimization, whether it be minimization/maximization, fitting, or any optimization as it is very flexible in the constraints that the problem can have. It puts many of the optimization facilities in GSL to shame. I first started using it because of its derivative free local optimization routines and really went on to appreciate it when I realized I needed non-trivial constraints on the parameters in my model fits (GSL can't even handle trivial ones, really). Up until now, I have been using some hand crafted bindings. This library is pretty small, so it actually isn't much of a burden to maintain for my purposes. But let's see how to generate them with Verrazano. We will use the function generate-binding which takes a list which represents a backend. The only backend available right now is CFFI, which just happens is a good portable FFI, so this is okay. To generate the NLopt bindings: (generate-binding '(:cffi :package-name :nlopt-cffi-bindings :input-files ("nlopt.h") :gccxml-flags "-I /usr/local/include/"))  After this, we have a file called "nlopt-cffi-bindings.lisp". These represent a very low-level interface for the library. It basically gets you to the level of functionality you would have if programming in C. As such, this could be used as is, but that wouldn't be very Lispy, would it? I wrapped some of these utilities and will later post my end results as a proper set of Lispy bindings on GitHub. This is already very useful, but Verrazano can do much more. Generating GSL Bindings There are already several GSL bindings available for CL (I even have my own private library that I like). The most popular, and deservedly so, is GSLL. The shear level of completeness is inspiring considering Liam wrote the bindings by hand (an admirable amount of work, but in my opinion a bit wasteful). One thing I don't like about GSLL is that it feels like it was brought just to the point of usefulness, but no further. Coding with GSLL feels like you are coding in C. This has changed a bit recently, with the introduction of Antik, but the library is still more difficult to use than my hand rolled GSL library (in fact, perhaps I'm dumb or something, but I have yet to actually productively used GSLL in a project for work or otherwise). But never mind all that, let's make our own: (generate-binding (list :cffi :package-name :cffi-gsl-bindings :input-files (mapcar #'namestring (directory #p"/usr/include/gsl/*.h")) :gccxml-flags "-I /usr/include/gsl" :standard-name-transformer-replacements (,(cl-ppcre:create-scanner "(^gsl_?)") "")))  This basically looks like the example above but with a few things added. First, we are including all of the GSL header files. Also, we use the :standard-name-transformer-replacements option to remove the prefix on GSL functions using a regular expression. If you are familiar with C, you know that this prefix is necessary because the C programming language doesn't have a way to divide the symbol namespace. This means that without these prefixes the likelihood of a name collision is very high. In Common Lisp we don't have to worry about this because we have a package system. However, when you run the command above, you get an error and the bindings were not generated. This is because GSL changed some naming conventions a while back and we need to explicitly define that we want to use the new (non-backwards compatible) functions by passing an extra argument to GCCXML. We need to pass -D GSL_DISABLE_DEPRECATED to GCCXML. Well, let's try it again: (verrazano:generate-binding (list :cffi :package-name :cffi-gsl-bindings :input-files (mapcar #'namestring (directory #p"/usr/include/gsl/*.h")) :gccxml-flags "-I /usr/include/gsl -DGSL_DISABLE_DEPRECATED" :standard-name-transformer-replacements (,(cl-ppcre:create-scanner "(^gsl_?)") "")))  Bingo, this time it worked. If we look at the binding file we see that basically all of the stuff is there that we want. However there is a bunch of stuff that we don't care about and, quite frankly, probably doesn't work. Indeed, if you define the library and attempt to load that file, then you will get several errors. Also, the file is some 30 thousand lines long, making it difficult to track them all down. The errors seem to come in a few different flavors. First, sometimes Verrazano will find a type that that it doesn't know how to deal with. These are associated with comments in the generated source file that look like: ;;; No entry found for gccxml type "long double" in *GCCXML-FUNDAMENTAL-TYPE->CFFI-TYPE*  Indeed, long doubles are typically seen as an extension to C and not standard or necessarily available on every platform, and is certainly not available in CFFI (maybe someday?). The best way to get around this is to not include those headers that provide long double support in GSL. But even that doesn't clear it all up. It turns out that if you also exclude gsl_sys.h, gsl_complex.h, gsl_types.h, gsl_test.h, and gsl_version.h then it clears all of these errors up. Which brings up two points: • If there is an error, it might show up as a comment in the generated source file without so much as a warning at the REPL. This is weird, but it is the way things are right now. These will typically manifest as errors at load time. • Only include the headers that you really need. The headers that we ended up removing turned out to be ones that we really didn't want or need in the first place. It takes some time up front to pick the right header files, but it can save you a pain later. (verrazano:generate-binding (list :cffi :package-name :cffi-gsl-bindings :input-files (remove-if (lambda (x) (or (ppcre:scan "gsl_sys\\.h$" x)
(ppcre:scan "gsl_complex\\.h$" x) (ppcre:scan "gsl_types\\.h$" x)
(ppcre:scan "gsl_version\\.h$" x) (ppcre:scan "gsl_test\\.h$" x)
(ppcre:scan "long_double\\.h$" x))) (mapcar #'namestring (directory #p"/usr/include/gsl/*.h"))) :gccxml-flags "-I /usr/include/gsl -DGSL_DISABLE_DEPRECATED" :standard-name-transformer-replacements (,(cl-ppcre:create-scanner "(^gsl_?)") "")))  Now we run into the next error, case sensitivity. Apparently Verrazano isn't really designed to handle cases where a function has multiple identifiers that differ by nothing more than case. Seems like a bad programming practice, but GSL does it, and Verrazano screws it up, so we have to work around it. One place where this seems appropriate but unfortunate is with the Bessel functions where the cylindrical functions are written as upper case letters (J, K, I, and Y) and the spherical functions are written as lower case (j, k, i, and y). We can deal with it by adding a new translation rule: (verrazano:generate-binding (list :cffi :package-name :cffi-gsl-bindings :input-files (remove-if (lambda (x) (or (ppcre:scan "gsl_sys\\.h$" x)
(ppcre:scan "gsl_complex\\.h$" x) (ppcre:scan "gsl_types\\.h$" x)
(ppcre:scan "gsl_version\\.h$" x) (ppcre:scan "gsl_test\\.h$" x)
(ppcre:scan "long_double\\.h\$" x)))
(mapcar #'namestring (directory #p"/usr/include/gsl/*.h")))
:gccxml-flags "-I /usr/include/gsl -DGSL_DISABLE_DEPRECATED"
:standard-name-transformer-replacements
(,(cl-ppcre:create-scanner "(^gsl_?)")
""
,(cl-ppcre:create-scanner "bessel_(zero_)?([jkiy])")
"spherical_bessel_\\1\\2")))


This will translate the lower case Bessel functions of the form gsl_sf_bessel_j into gsl_sf_spherical_bessel_j while leaving gsl_sf_bessel_J alone.

The much more troubling situation is when GSL uses both N and n, or F and f, in a single functions argument list. This cannot work, as I'm sure is clear as day. In order to fix this, I needed to actually hack on Verrazano a bit. I had to change the write-cffi-function function from it's current version to:

;; From...
(do-arguments-of-function (argument node)
(incf index)
(if (typep argument 'gccxml:ellipsis)
(write-string "common-lisp:&rest")
(bind ((argument-name (aif (name-of argument)
(transform-name (name-of argument) :variable)
(format nil "arg~A" index)))
(argument-type (type-of argument)))
(push argument-name previous-args)
(pprint-newline :fill)
(format t " (~A " argument-name)
(write-cffi-type argument-type)
(format t ")"))))

;; ...to...

(let ((previous-args nil))
(do-arguments-of-function (argument node)
(incf index)
(if (typep argument 'gccxml:ellipsis)
(write-string "common-lisp:&rest")
(bind ((argument-name (if (and (name-of argument)
(not (member (name-of argument)
previous-args
:test #'equal)))
(transform-name (name-of argument) :variable)
(format nil "arg~A" index)))
(argument-type (type-of argument)))
(push argument-name previous-args)
(pprint-newline :fill)
(format t " (~A " argument-name)
(write-cffi-type argument-type)
(format t ")")))))


This change keeps track of what symbols we have already used in the argument list and replace it with a generic argument of the form "arg<n>". This gets rid of the last error that stops this file from loading. Now we can load this file and, again, we are basically where we would be if were a C programmer, but we can work from Lisp to build a good interface on top of this set of generated bindings. This brings up another point:

• Verrazano is not production ready; it is hacker ready. Verrazano is useful because it is a tool that developers use every once in a while for generating bindings. It is not ready for every end user to use.

Okay, so we have something that will load. However, you may have noticed that there are crap-load of warnings that are produced when those bindings are loaded. These warnings are regarding a pretty new improvement to CFFI. Not so long ago, if you wanted to pass a structure to a function using CFFI, your only option was to pass it by address. If you needed to pass it by value you were out of luck. Luckily basically nobody uses pass by value, but it is a little limiting. Now CFFI supports this, so you need to specify structs by (:struct struct-name) and pointers to structs as (:pointer (:struct struct-name)). In order to fix this we are going to have to do a bit more hacking, but for now these are just warnings.

If you have been perusing the generated bindings, you may be wondering what all these weird functions that Verrazano is making are coming from. Some of the functions look good, but some seem very odd, for instance _ZN17gsl_vector_ushortaSERKS_. These functions look very odd and they show up because Verrazano is, fundamentally, a tool for making C++ bindings, not C bindings. It is interpreting these header files as C++ header files. It doesn't offer a "assume this is C" option. This isn't Verrazano's limitation, really, the GCCXML people are not interested in working with anything but C++. Often this doesn't matter as C++ is mostly a superset of C. However, there are differences. One difference is that when it sees structs, it has to assume that these are C++ sturcts, which are basically classes in C++ with all of their members public by default. This means that there will be a bunch of code that is generated by Verrazano like constructors, destructors, and comparison operators that have special GCC style "mangled" names. These symbols are actually not in the library at all as a simple nm -D /usr/lib/libgsl.so.0 will demonstrate. I suppose that these don't really do much harm, and I don't know how to get rid of them at the time short of post-processing the file and removing them. Plus we may want them if we are linking against a library with a C++ interface some day. At least Verrazano seems kind enough to place all these oddities at the end of our bindings file.

To prove that this works:

;; Define our library and use it (stolen from GSLL)
(cffi:define-foreign-library libgslcblas
(:darwin #+ccl #.(ccl:native-translated-namestring
(gsl-config-pathname "libgslcblas.dylib"))
#-ccl #.(gsl-config-pathname "libgslcblas.dylib"))
(:cygwin "cyggslcblas-0.dll")
(:unix (:or "libgslcblas.so.0" "libgslcblas.so"))
(t (:default "libgslcblas")))

(cffi:use-foreign-library libgslcblas)

(cffi:define-foreign-library libgsl
(:darwin #+ccl #.(ccl:native-translated-namestring
(gsl-config-pathname "libgsl.dylib"))
#-ccl #.(gsl-config-pathname "libgsl.dylib"))
(:cygwin "cyggsl-0.dll")
(:unix (:or "libgsl.so.0" "libgsl.so"))
(t (:default "libgsl")))

(cffi:use-foreign-library libgsl)

;; Define our callback function
(cffi:defcallback csin :double
((x :double)
(params :pointer))
(declare (ignore params))
(float (sin x) 0d0))

(let ((workspace (gsl-cffi-bindings:integration-workspace-alloc 10)))
(unwind-protect
(cffi:with-foreign-objects ((gsl-fn '(:struct
gsl-cffi-bindings:function-struct))
(result :double)
(abs-error :double))
;; Setup the GSL function object
(setf
(cffi:foreign-slot-value
gsl-fn '(:struct gsl-cffi-bindings:function-struct)
'gsl-cffi-bindings:function)
(cffi:callback csin))
(gsl-cffi-bindings:integration-qag
(cffi:mem-aptr
gsl-fn '(:struct gsl-cffi-bindings:function-struct))
0d0 (/ pi 2d0) 1d-10 1d-5 10
(cffi:convert-to-foreign :gsl-integ-gauss-61 'gsl-cffi-bindings:.-155)
workspace result abs-error)
;; Return the result and estimate of the error of the result
(values (cffi:mem-ref result :double)
(cffi:mem-ref abs-error :double)))
;; Clean up
(gsl-cffi-bindings:integration-workspace-free workspace)))
==>
1.0
1.1102230246251565e-14


It may be ugly, but it works. Note that everything here translates pretty directly to the C code you would write to do this using GSL. In the future we'll explore how to make this a bit easier to use.

Till next time.

Sunday, June 23, 2013

Calculating Fibonacci Numbers

Yeah, I know, it's been done, it's boring, nothing new to see here. I understand, this is probably all known, but I just wanted to show this.

As I'm sure you are all aware, the Fibonacci sequence can be defined recursively via the equation:

$F(n) = F(n-1) + F(n-2)~~~~~~~F(n) = n~~\forall~~n \in \{0, 1\}$

Now, this is fine and good, but if you translate this formula into a recursive algorithm in the most naive way possible, it is also extremely computationally expensive. The fact that each number requires the knowledge of the two previous numbers means that a naive algorithm to compute this function runs in exponential time. The Lispy way (IMHO) to deal with this exponential explosion is to use memoization as a dynamic programming method. This will cache the previous results and turn this function into a linear time algorithm. Another common way to compute this is to run through the entire sequence until you get the value you are looking for. This also runs in linear time but (usually) with a smaller constant factor and has constant space requirements. Unfortunately, it requires a substantial change from the intuitive description of the problem, making the code much harder to understand, which is why I don't like it as much as the memoized recursion. However, in the name of clarity when exploring the complexity, I will be using the "sum up from the base case method".

But, anyway, my point was that you can easily write a linear time Fibonacci number calculator. That is what we would say under normal circumstances, but this isn't really true, or rather, it is only true if you assume constant time multiplication. For large $n$, it becomes clear that this algorithm is not, actually, linear. This is because it takes $O(N)$ time to add two $N$ digit numbers, and because at least some of the numbers that we wish to add are of order $F(n)$ (which have $O(\log F(n)) \sim O(n)$ digits), we see now that the actual time is more like $O(n\log F(n)) \sim O(n^2)$. This might seem a bit surprising, but look and see for yourself.

A quadratic fit to the time scaling of the Fibonacci sequence generator function

But, if we are willing to throw clarity out the window all together, we can also use a different technique to compute the number. We can use the closed form solution to the Fibonacci sequence, Binet's formula (all of this is from Wikipedia):

$F(n) = \frac {\psi^n - \phi^n}{\psi - \phi}$

…where…

$\psi = \frac{1+\sqrt 5}{2} \\ \phi = \frac{1-\sqrt 5}{2}$

And, because $\psi = -\phi^{-1}$, we can simply say:

$F(n) = \frac {\phi^n - (- \phi)^{-n}}{\sqrt 5}$

Now, I hear you thinking, these aren't going to work if you actually want the exact number, because these have nasty things like $\sqrt 5$ and and $\phi^{-n}$ in them. And you would be correct that this is not something that you can simply type into any programming language and expect a valid number back. But the fact of the matter is, these formulas are actually correct, it is the language and the language's insistence that computation on general real numbers be performed using approximate floating point arithmetic that causes such an issue in the first place. If, however, we were to turn to the venerable bc, an arbitrary precision calculator that is already on your computer if you have a UNIX like operating system, then we can try it out for various $n$ by setting the precision to a "high enough" value (the scale variable sets how many decimal digits of precision you have).

scale=100

phi = (1+sqrt(5))/2

define fib (n) {
(phi^n - (-phi)^(-n))/sqrt(5) }

fib(5)
4.999999999999999999999999999999999999999999999999999999999999999999\
9999999999999999999999999999999988

fib(10)
54.99999999999999999999999999999999999999999999999999999999999999999\
99999999999999999999999999999999728

fib(100)
354224848179261915074.9999999999999999999999999999999999999999999999\
999999999999999999999999999999981555491928230697552907

fib(200)
280571172992510140037611932413038677189524.9999999999999999999999999\
99999999999999999999999999999997069407044903016136967720432076973572\
3021244

fib(400)
17602368064501396646822694539241125077038438330449219188672599289657\
5345044216019674.999999999999996317359669914941458065726733504366353\
6782163342769596482054432795143577733011851910134

fib(500)
13942322456169788013972438287040728395007025658769730726410896294832\
5571622863290691557658876222517646901.483330817850463056986072289669\
50027087266539483120658105031310816022996182603489404860217143670694\
09


We see that as the numbers are computed in the most significant digits, the error accumulates in the least significant. Once the error meets the important numbers, which has happened at the 500th Fibonacci number, we have reached the point where we actually don't know if the last digits are correct or not. At that point we need to increase the scale and try again.

Also, if you try this at home, you will find that the computation takes longer for larger values of $n$, despite this being a closed form expression. This isn't really surprising, the closed for expression may seem magical, but it still needs to be implemented with algorithms that multiply, divide, exponentiate, and "square root" very large or very precise numbers. More on this later.

Incidentally, the CLISP implementation of Common Lisp also has arbitrary precision arithmetic and can be used to perform the same calculation much faster (also, you can also do this in other imps with some success using rational numbers if you take the time to write a sqrt function that will return a rational approximation of the square root to within the needed accuracy). If you try this at home using CLISP, note that the scale value in bc designates decimal digits while (long-float-digits) in CLISP designates binary digits (i.e. the CLISP one needs to be larger for the same precision. For example, CLISP needs a value of 3322 to give around 1000 decimal digits of precision).

(setf (long-float-digits) 350)

(let ((phi (/ (+ 1 (sqrt 5L0)) 2L0)))
(defun fib (n)
(/ (- (expt phi n) (expt (- phi) (- n)))
(sqrt 5L0))))

(fib 100)
3.542248481792619150750000000000000000000000000000000000000000000000000\
000000000000000000000000000000000005L20

(fib 200)
2.805711729925101400376119324130386771895250000000000000000000000000000\
0000000000000000000000000000000000084L41

(fib 400)
1.760236806450139664682269453924112507703843833044921918867259928965753\
4504421601967500000000000000000000109L83

(fib 500)
1.394232245616978801397243828704072839500702565876973072641089629483255\
7162286329069155765887622252129412605L104


This works okay, but we have to constantly check to make sure that we haven't run out of precision in our number. We can fix this by computing the number of significant digits needed to accurately compute $F(n)$ by first computing an approximate $F(n)$ with floating point numbers and using that approximate value as a guess of how many $b$-ary digits you need in have by taking the $\log_b$.

(defun fib-approx (n)
(let ((phi (/ (+ 1 (sqrt 5d0)) 2d0)))
(/ (- (expt (float phi 0d0) n) (expt (- (float phi 0d0)) (- n)))
(sqrt 5d0))))

(defun fib (n)
(setf (long-float-digits) (max 64 (round (* 1.5 (log (fib-approx n) 2)))))
(let ((phi (/ (+ 1 (sqrt 5L0)) 2L0)))
(nth-value 0 (round (/ (- (expt phi n) (expt (- phi) (- n)))
(sqrt 5L0))))))

(fib 5)
5

(fib 100)
354224848179261916032

(fib 1000)
43466557686937456435688527675040625802564660517371780402481729089536\
55541794905189040387984007925516929592259308032263477520968962323987\
33224711616429964409065331879382989696499285160037044761377951668492\
28875


However, this only really works up to the point when the double floats start to overflow (Fibonacci numbers less than around 300 digits). After that our fast approximation no longer works as the Fibonacci numbers are too big to fit into any machine sized number. Is there a way to just get an approximate ballpark figure for the size of a Fibonacci number without actually calculating it? Yes. All we need to do is look at the what is important in the equation when $n$ is large. If $n$ is large, $(-\phi)^{-n}$ is extremely small since $\phi^{-1}$ is smaller than 1. This means that we can approximate the large $n$ Fibonacci numbers as:

$F(n\gg 1) = \frac{\phi^n}{\sqrt 5}$

Now, we know that the number of $b$-ary digits in a number is roughly equal to the $\log_{b}$ of the number. This means that the number of digits in a large Fibonacci number is:

$\log F(n \gg 1) = \log \phi^n - \log \sqrt 5 = n\log\phi - \log\sqrt 5$

Adding this to our Fibonacci number calculator:

(defun log-fib-approx (n &optional (base 2))
(let ((phi (/ (+ 1 (sqrt 5d0)) 2d0)))
(- (* n (log phi base)) (log (sqrt 5) base))))

(defun fib (n)
(setf (long-float-digits) (max 64 (round (* 1.5 (log-fib-approx n 2)))))
(let ((phi (/ (+ 1 (sqrt 5L0)) 2L0)))
(nth-value 0 (round (/ (- (expt phi n) (expt (- phi) (- n)))
(sqrt 5L0))))))

(fib 5)
5

(fib 100)
354224848179261915075

(fib 1000)
43466557686937456435688527675040625802564660517371780402481729089536555417\
94905189040387984007925516929592259308032263477520968962323987332247116164\
2996440906533187938298969649928516003704476137795166849228875

(fib 10000)
33644764876431.......84058580903081179711592936853976188104504236323110912


At this point, we might ask how all this compares to calculating the number via the recurrence equation? On the one hand, using intelligent algorithms we have found an $O(n^2)$ time algorithm based on the recurrence relation. On the other, it isn't totally clear what the scaling behavior of the proposed closed form algorithm is. In the Binet's formula based method, we have (roughly in order carried out in the calculation):

1. An arbitrary precision square root
2. An fixed to arbitrary precision addition (adding one to a value)
3. An arbitrary precision to fixed division (divide by two)
4. Two arbitrary precision exponentiations
5. An arbitrary precision subtraction
6. An arbitrary precision division

Looking at this list, the obvious prime suspects for the bottleneck are the square root and the exponentiation, but let's look at each. Wikipedia tells me that multiplication and division have the same complexity lower bounded with $O(N\log N)$ (probably) and that addition and subtraction run in $O(N)$ time.

Using simple divide and conquer exponentiation gives an $O(N\log N \log n)$ time scaling for exponent $n$. The only thing that is left is the scaling of the square root. Looking through the GMP docs tells me that this can be done in $O(N\log N)$ time (again assuming the Schönhage-Strassen conjecture for multiplication). This means that the most expensive part of the calculation is the exponentiation, at $O(\log F(n)\log \log F(n) \log n)$ time. What does that mean for our algorithm?

The Fibonacci sequence grows in an exponential manner. This is clear from recurrence, $F(n+1) = F(n) + F(n-1) \sim 2F(n)$. Roughly speaking, increasing $n$ by a constant factor has the effect of multiplying $F$ by a constant factor, which is basically the definition of "exponential". But, in our complexity, We are taking the $\log$ of that number, which "undoes" the exponential function. More concretely, we actually calculated a large $n$ approximation of $\log F(n)$ above, which turned out to scale linearly with $n$. All of this boils down to the relationship I mentioned before: $O(\log F(n)) \sim O(n)$. Thus, we expect this calculation to be $O(n (\log n)^2)$, whereas the complexity for the recurrence based results took $O(n^2)$. This result demonstrates that Binet's formula is, in fact, an asymptotic winner in this computational contest. When you plot the timing data, it seems very linear (I certainly cannot seem to resolve the $(\log n)^2$ portion of the scaling behavior).

If you are actually interested in calculating Fibonacci numbers very quickly and are using CLISP, then it is probably a safe bet to use the Binet's formula method rather than the more intuitive algorithm even for small numbers. On CLISP, the differences in run times are around two orders of magnitude faster than the intuitive method (although it seems to be much less predictable in terms of run time in any particular run, my prime suspect on this is the GC). Using SBCL (a much more sophisticated compiler), the intuitive algorithm runs significantly faster than on CLISP, but still is no match for Binet's method (on CLISP).

Thursday, June 6, 2013

Some Customizations to W3m-Mode

For many tasks, I find that it is easier to use a text mode browser in Emacs than it is to use an external browser. This is clearly the case when you would like to copy some text from a web page into an Emacs buffer, like when you are grabbing code examples from a manual or tutorial. One of the main uses I have of this is to quickly browse the Common Lisp Hyperspec. I have found that W3m and W3m-Mode is quite good for these tasks. W3m is a text mode browser that exists independently from Emacs. W3m-Mode is a mode that connects to W3m, sends requests from Emacs to the external process, and receives the output and displays it in an Emacs buffer. Edit: If you want to set Emacs up to use this browser, or any browser, customize the browse-url-browser-function variable.

Of course W3m won't work with a lot of "modern" websites, anything that uses Flash (of course), or anything that utilizes Javascript extensively. Even with all of those barriers, it works surprisingly well because much of the information that is out there on the Internet (and most of the information that I consume using this browser) is actually just text. Further, thanks to HTML5 and the return of the separation of content (HTML) and design (CSS) many websites can be rendered usefully in text.

However, as much as I have learned to like W3m-Mode, there are some annoying aspects of the mode that are the default. Namely, I would like to be able to move through the buffer with the standard directional keys, as well as switch tabs in the way that most other browsers do via "C-<tab>", and move forward and backward in history via "M-<right>" and "M-<left>", respectively. With a little help from the Emacs folks on /r/emacs, here are some bindings that will set this up for you.

;; W3m for browsing
(defun w3m-mode-options ()
(local-set-key (kbd "<left>") 'backward-char)
(local-set-key (kbd "<right>") 'forward-char)
(local-set-key (kbd "<up>") 'previous-line)
(local-set-key (kbd "<down>") 'next-line)
(local-set-key (kbd "M-<left>") 'w3m-view-previous-page)
(local-set-key (kbd "M-<right>") 'w3m-view-next-page)
(local-set-key (kbd "C-<tab>") 'w3m-next-buffer)
(local-set-key (kbd "q") 'bury-buffer))


That last local-set-key is to rebind the "q" key. By default, "q" tells W3m-Mode to kill all W3m sessions and close the window. This isn't the right thing to do for a couple of reasons. First, if I want to kill all W3m sessions, I will do so by killing the buffer (which will kill the sessions). I don't need a separate key for that. Second, closing windows without the user expressly asking for a window to be closed is not good Emacs interface style. The correct thing to do is to go away (whatever that means) and leave the window to be filled with another buffer. The only exception to the rule is popup windows, windows that are meant to be extremely short lived (less than a minute) and which created the window in which the reside. W3m is most definitely not a popup window. Looking at this, I felt that the best solution is to map "q" to bury-buffer and leave it at that.

There are a few other bindings that are different from the standard key bindings that a browser uses, including making a new tab and kill that tab. These exist and are reasonable (they fit in with the standard Emacs user interface), just check the help for W3m-Mode via "C-h m".

Saturday, June 1, 2013

A Multiple Cursor Trick and an Improvement

I have been baking, in the back of my mind, a way to make multiple cursors more powerful. What I wanted was a way to move through a buffer, and mark certain places where I want to place my multiple cursors, and then make my changes. I think this idea is basically baked enough to reveal that… it is not actually necessary to modify the multiple-cursors library to achieve this goal.

If you want to do this, you can simply insert a sequence of characters that is unique at each point that you wish to place a cursor (something that you would never use, like some crazy Unicode character, say "ಠ", you just have to find an easy way to insert it into the buffer). Then, select that tagging sequence and use mc/mark-all-like-this and edit away. This actually works in a pinch. For instance, you can do things like this:

However, I guess that this is slightly more hackish than I like, so I came up with a new, better method. What I decided would be pretty awesome is to have multiple-cursors be able to mark spots off the top of the mark ring (either via popping or just walking the mark ring). I defined a function mc/mark-pop that will just this. I use the suggested multiple cursors and expand region key bindings and bound mc/mark-pop to "C-S-p" (which makes sense on a Dvorak keyboard layout, if you use Qwerty, you might use "C-?"). This means that you can do some pretty awesome stuff like this:

I have submitted a pull request to Magnars which has been accepted. This is still a bit inconvenient to use, but it has promise for being an excellent building block for future multiple cursor editing tricks.

Wednesday, May 29, 2013

Hunting Firefox Bugs: Hang On Focus

I am just putting this here so that this might show up a bit easier in a Google search. I recently figured out a solution (sort of) to one of these nagging issues that I have been facing for a long time. The issue is that sometimes Firefox gets in a state where it hangs for a few seconds, pegging the CPU, whenever the window comes into focus (i.e. you click on it). This means it is very difficult to switch back and forth between Firefox and another window. This was a mystery for a long while. However, I found these two bug reports today. The earlier bug was reported around two years ago, and they supposedly pushed a fix around 8 months ago, but it is still biting (me) today.

This seems to actually be a limitation of the current implementation of the global menu in Ubuntu. This works via DBUS, which apparently isn't very efficient at populating large menus. Firefox has a bookmarks menu which means that having many entries in your bookmark database will cause Firefox to hang as that menu is populated. This was actually pretty difficult to figure out as the behavior is intermittent. Often Firefox would work as expected, but sometimes this hanging would start and it would persist until reboot of the entire system (restarting Firefox is not enough). I now suspect that this state is triggered by setting a bookmark, but I am unsure.

Anyway, the fix here is to not use bookmarks too heavily. If you delete all of your bookmarks this problem will go away, and this is just what I did (after I saved them to disk). That said, this really puts me, and people who browse the Web like I do, in a predicament.

When I find an article or web page that I am interested in, I cannot usually read it when I find it. So, I like to store it for future consumption. For this purpose, I used to use tabs, but as everybody knows, this will bog your computer down as most browsers attempt to keep them in RAM. Firefox, fairly recently, started releasing them from memory after they haven't been used, which helps matters (unfortunately they made the, IMHO, stupid decision to completely reload on restore rather than cache to disk). But, even with this memory friendly behavior from Firefox, it will still routinely bloat to greater than 1GB of resident RAM and needs to be restarted (presumably so that the tabs will be flushed out of RAM). And before the "Firefox is just making efficient use of memory" people show up, Firefox is not making good use of memory if it causes my computer to hang for 30+ seconds when switching windows that have nothing to do with Firefox (which it does in certain extreme cases), or if it causes Compiz, Xorg, or my entire computer to crash (which happens every once in a while and correlates so strongly with Firefox memory consumption that I have to suspect it as a culprit).

To combat this, a while ago I started categorizing pages I find as I want to read this within a week or I want to read sometime later in the future. The ones I want to read within a week I keep as tabs and the ones that I want to read sometime in the future I set a bookmark to find it easily later. I figured that bookmarks should take practically no resources to hold other than a bit of disk space. With this bug, however, setting bookmarks on interesting pages isn't really a good option either. This means that I now have my bookmarks in a separate HTML and JSON file while I try to come up with a better solution.

I suppose that this isn't an issue with Chromium or Chrome as those don't use the global menu for bookmarks (they have a separate menu right of the address bar).

Thursday, May 23, 2013

Quickly Searching For a String in a Buffer

So this is a bit embarrassing that I have never seen this solution before, but I recently saw this problem in a Coursera lecture.

Given a buffer sequence of length $N$ and a query sequence of length $m$, find all occurrences of the query in the buffer.

The most transparent way to solve this problem is to search each starting position in the buffer for the query string. This is clearly $O(N*m)$ (for small $m$, that is. It is really around $O((N-m)*m)$. However, in big $O$ fashion, we normally say $O(N)$ and leave it at that).

(defun get-string (i length buffer)
"Extract the string without any copying or memory use."
(make-array length
:element-type 'character
:displaced-to buffer
:displaced-index-offset i))

(defun naive-search (query buffer)
"Find all occurrences of query in buffer."
(iter (for i :to (- (length buffer) (length query)))
(when (equal query (get-string i (length query) buffer))
(collect i))))


Although this all pretty silly, as Common Lisp has a sequence search built in called search. In addition, it also has third party libraries that can do string searches and much more:

;; Can't get much simpler than...
(defun search-search (query buffer)
"Find all occurrences of query in buffer."
(iter
(for i :initially (search query buffer :start2 0)
:then (search query buffer :start2 (1+ i)))
(while i)
(collect i)))

;; If you are using strings, regular expressions are pretty versatile.
(defun regex-search (query buffer)
"Find all occurrences of query in buffer."
(let (acc)
(ppcre:do-matches (start end
query buffer
(reverse acc))
(push start acc))))


The interesting part of these is that they are probably much smarter than my naive implementation. Besides being highly tuned, they also most likely use parts of previous failed matches in subsequent matches, which means that they might run in $O(N)$ time (i.e. no $m$ factor).

The point is that either way I do this it would take $O(N)$ time to simply search the text for the value in the most naive way possible. In the lecture, however, they stated that you could sort the data and then perform a bisection search. I was taken aback as it seemed to me that you cannot sort a body of text without screwing up the order, which is needed because we are actually interested in matching a sequence of characters in a larger sequence.

The way we can do this, "sort the sequence" and keep the sequence in order, is to sort a different buffer that holds the indexes into the sequence. Sorting takes $O(N\log N)$, so a single search should actually take longer to compute, but once you have the indexing you can perform subsequent searches (with the same length or shorter queries) in $O(\log N)$. In order to accommodate this usage pattern, I have included an ad-hoc, manual, partial memoization mechanism. I return the computed sorted indexes (along with the maximum length of a valid query) as a second return value and take them as an optional argument (I use a class here so that the printed value doesn't flood your terminal). If you pass invalid cache parameters, the function detects this and recomputes.

;; Define a container for our cache
(defclass cache ()
((max-length :initarg :max-length :accessor max-length)
(buffer :initarg :buffer :accessor buffer)
(indexes :initarg :indexes :accessor indexes)))

(defun indexed-search (query buffer &optional cache)
"Find all occurrences of query in buffer."
;; Extract (or compute) the cache
(destructuring-bind (max-query-length sorted-indexes)
(if (and cache
(eq buffer (buffer cache))
(<= (length query) (max-length cache)))
(list (max-length cache) (indexes cache))
(list (length query)
(stable-sort (iter (for i :to (- (length buffer) (length query)))
(collect i :result-type vector))
#'string<
:key (lambda (i) (get-string i (length query) buffer)))))
(labels ((bisection (lo hi)
;; A simple bisection search
(let ((mid (floor (+ lo hi) 2)))
(cond ((= lo hi) lo) ; we have either found the first match or
; there are no matches
((string<= query (get-string (aref sorted-indexes mid)
(length query) buffer))
(bisection lo (min mid (- hi 1))))
(t (bisection (max mid (+ lo 1)) hi))))))
;; Perform the search and return the index and cache
(values (iter (for i :from (bisection 0 (length sorted-indexes)))
(while (string= query (get-string (aref sorted-indexes i)
(length query) buffer)))
(collect (aref sorted-indexes i)))
(make-instance 'cache :max-length max-query-length
:buffer buffer
:indexes sorted-indexes)))))


And indeed, after the $O(N\log N)$ preprocessing, this runs in $O(\log N)$ time. The timings work out that the initial sort takes a very long time (presumably due to poor performance in string<= for lexical order), but the subsequent queries are basically instantaneous (too fast to measure even for large buffer sequences). So, while you probably don't want to use this for normal searches, you might want to do this if you need to make many queries (of a limited length) on a immutable sequence, and it could result in a huge performance increase.

Well, I guess this just serves as a reminder that I still have a lot of basic stuff to learn. However, after writing this up, I doubt that I will forget it…

Thursday, May 16, 2013

ICFP Contest 2013 Anyone?

As you may have seen on Reddit this morning, Microsoft Research has put up an ICFP placeholder page.  This might seem a bit premature, but I would like to have some Common Lisp friends (new and old) to partake in ICFP this year, so I am getting started early.

The contest seems to be starting August 8th sometime and should run for the next 72 hours.  If you are at all interested in programming Common Lisp on a team ICFP attempt, please comment below, or email, or whatever your favorite method of communication.

This year I have a few new tools that should work better than last year, including a way that spectators can follow along without video streams that are susceptible to crappy resolutions (anyway, I now know someone that works on the YouTube team at Google, so he could probably get the old setup to work better).  I plan to post on those tools later to try and drum up interest as the contest approaches.

Wednesday, May 15, 2013

What is wrong with Python strings?

This is a micro-rant about Python.  If this isn't your thing, please excuse this post.  I have been trying to use Python in order to become more employable (well more employable for the sort of places I would like to work).  Here are a few grievances I have found.  These complaints are, I suspect, actually about the default Python behavior.  That doesn't actually diminish the criticisms much, however, as they are definitely barriers to entry.  For some of these things I have still not been able to find a way around them, which makes them pretty serious.
1. Python actually only allows ASCII characters?  Really?!?  If you include some odd (or not so odd) character like a lambda, it will choke.  If this doesn't seem like a big deal to you, you have clearly never dealt with any language other than English, nor have you dealt with the various fancy forms of punctuation.  You can instruct Python to accept a different encoding by putting a magic comment at the beginning of your source file that looks like "# coding: <encoding>".  As a reference, the last time this was an issue in any Lisp I tried, was around 5 years ago with CLISP (a fairly out of date Lisp at the time) and it was remedied shortly after.
2. What is going on with your string type?  If I index into the string type, I can land in the middle of a multi-byte character.  The very fact that this is possible is evidence that this string type is fundamentally screwed up.
3. Why is the printed representation of a Unicode string unreadable?  Seriously, we have terminals that can handle pretty much anything that you can throw at them, why print a bunch of line noise instead of the actual character?
4. What is wrong with the way you output in Python?  Why is it that I get an encoding error when I try to pipe or redirect output from my program to another program and a non-ASCII character is encountered?  I have no idea what is going on here, or how to get around it.  This is, quite possibly the biggest hiccup in terms of productivity.  I cannot even send my output to a file or to less in order to carefully review the output.  How hard is it to dump the output of a program into a file?  In the case of pipes, you might think that this is a shortcoming of the pipe itself, or of the program on the other side of the pipe, but this is not the case.  I routinely pipe all sorts of crazy text through pipes and to shell commands without any problems; whatever the issue is, it is rooted in Python.
In Lisp (my usual language of choice), a string is a vector of characters, not ASCII bytes or a vector of data that might be interpreted as characters, which I imagine is the primary shortcoming in Python here.  Including encodings other than ASCII seems to have been an afterthought, an ugly thing that was bolted on to make the language work with text that wasn't ASCII.  Why Python didn't decide at conception to do away with the idea of that a string is a sequence of bytes, I have no idea.  It seems like Python wanted to be a high level language, but stopped short of actually making it there, at least in the way strings are handled.  They seem to have the basic building blocks of a proper string type, but stupidly set the default behavior to only deal with ASCII strings, which is basically backwards, IMO.

People say (and I believe) that Common Lisp is simultaneously the highest level and lowest level programming language you will ever use.  In Lisp you can write code that is so abstracted you have have no idea of what code is actually being passed to the CPU, much less the data-structures or  memory usage.  But, in the same program, you can write code that translates directly to machine instructions in a predictable way (or even put inline C or assembly).  Because of its high level nature, I have never had to worry about strings with odd characters in them; they just work.  And, because of its low level nature, it is simple to convert strings down to ASCII when you have to, which is a very rare need in my experience.  You would think that Python, whose earliest versions where a full 5 years after the most recent standardization of Lisp (coming up on 20 years since then), would have done this better.

I could actually go on and on about things that I perceive Python sucking at, for instance the fact that there is a REPL but no iterative development, or the fact that the error messages I get are basically garbage (however, Lisp is not that much better in this regard), or the fact that you must explicitly have a "return" statement on your functions, and of course, of course, the accursed whitespace dependence (which isn't actually annoying because of the whitespace, but because I actually have to tab around to get the correct indentation level in Emacs).  But those things are probably just matters of taste, something that usually works out with time.  The string handling in Python, however, is basically inexcusable.