Wednesday, May 15, 2013

What is wrong with Python strings?

This is a micro-rant about Python.  If this isn't your thing, please excuse this post.  I have been trying to use Python in order to become more employable (well more employable for the sort of places I would like to work).  Here are a few grievances I have found.  These complaints are, I suspect, actually about the default Python behavior.  That doesn't actually diminish the criticisms much, however, as they are definitely barriers to entry.  For some of these things I have still not been able to find a way around them, which makes them pretty serious.
  1. Python actually only allows ASCII characters?  Really?!?  If you include some odd (or not so odd) character like a lambda, it will choke.  If this doesn't seem like a big deal to you, you have clearly never dealt with any language other than English, nor have you dealt with the various fancy forms of punctuation.  You can instruct Python to accept a different encoding by putting a magic comment at the beginning of your source file that looks like "# coding: <encoding>".  As a reference, the last time this was an issue in any Lisp I tried, was around 5 years ago with CLISP (a fairly out of date Lisp at the time) and it was remedied shortly after.
  2. What is going on with your string type?  If I index into the string type, I can land in the middle of a multi-byte character.  The very fact that this is possible is evidence that this string type is fundamentally screwed up.
  3. Why is the printed representation of a Unicode string unreadable?  Seriously, we have terminals that can handle pretty much anything that you can throw at them, why print a bunch of line noise instead of the actual character?
  4. What is wrong with the way you output in Python?  Why is it that I get an encoding error when I try to pipe or redirect output from my program to another program and a non-ASCII character is encountered?  I have no idea what is going on here, or how to get around it.  This is, quite possibly the biggest hiccup in terms of productivity.  I cannot even send my output to a file or to less in order to carefully review the output.  How hard is it to dump the output of a program into a file?  In the case of pipes, you might think that this is a shortcoming of the pipe itself, or of the program on the other side of the pipe, but this is not the case.  I routinely pipe all sorts of crazy text through pipes and to shell commands without any problems; whatever the issue is, it is rooted in Python.
In Lisp (my usual language of choice), a string is a vector of characters, not ASCII bytes or a vector of data that might be interpreted as characters, which I imagine is the primary shortcoming in Python here.  Including encodings other than ASCII seems to have been an afterthought, an ugly thing that was bolted on to make the language work with text that wasn't ASCII.  Why Python didn't decide at conception to do away with the idea of that a string is a sequence of bytes, I have no idea.  It seems like Python wanted to be a high level language, but stopped short of actually making it there, at least in the way strings are handled.  They seem to have the basic building blocks of a proper string type, but stupidly set the default behavior to only deal with ASCII strings, which is basically backwards, IMO.

People say (and I believe) that Common Lisp is simultaneously the highest level and lowest level programming language you will ever use.  In Lisp you can write code that is so abstracted you have have no idea of what code is actually being passed to the CPU, much less the data-structures or  memory usage.  But, in the same program, you can write code that translates directly to machine instructions in a predictable way (or even put inline C or assembly).  Because of its high level nature, I have never had to worry about strings with odd characters in them; they just work.  And, because of its low level nature, it is simple to convert strings down to ASCII when you have to, which is a very rare need in my experience.  You would think that Python, whose earliest versions where a full 5 years after the most recent standardization of Lisp (coming up on 20 years since then), would have done this better.

I could actually go on and on about things that I perceive Python sucking at, for instance the fact that there is a REPL but no iterative development, or the fact that the error messages I get are basically garbage (however, Lisp is not that much better in this regard), or the fact that you must explicitly have a "return" statement on your functions, and of course, of course, the accursed whitespace dependence (which isn't actually annoying because of the whitespace, but because I actually have to tab around to get the correct indentation level in Emacs).  But those things are probably just matters of taste, something that usually works out with time.  The string handling in Python, however, is basically inexcusable.


  1. Your rant is directed at the now retiring (but still the most used one) Python2. Python3 solves most of those problems you pointed out by default. In python2, you usually solved this putting a little `u` in front of strings to denote that they are unicode strings. If you are in Python 2.6 or 2.7, you can also use: `from __future__ import unicode_literals` to get the python3-like behaviour.

    1. > from __future__ import unicode_literals

      Thanks for the tip. I knew there must be work-a-rounds. I wonder if it solves the weird piping behavior?

      > the now retiring (but still the most used one) Python2

      It's unfortunate that this is basically the same thing that I was hearing 4 or 5 years ago. I guess the change will happen when "which python" points to v3 instead of v2.x.

      I hope that the fact that I wrote a rant about Python doesn't come off that I dislike the language when all things are considered. I just needed to vent after a frustrating time using Python for a project. Sometimes I forget that people actually read this stuff every once. This one isn't even informative... at least I mentioned the "# coding: " magic comment I found in the docs.