10 min read
Python is amazing.
Surprisingly, that’s a fairly ambiguous statement. What do I mean by ‘Python’? Do I mean Python the abstract interface? Do I mean CPython, the common Python implementation (and not to be confused with the similarly named Cython)? Or do I mean something else entirely? Maybe I’m obliquely referring to Jython, or IronPython, or PyPy. Or maybe I’ve really gone off the deep end and I’m talking about RPython or RubyPython (which are very, very different things).
While the technologies mentioned above are commonly-named and commonly-referenced, some of them serve completely different purposes (or, at least, operate in completely different ways).
Throughout my time working with the Python interfaces, I’ve run across tons of these .*ython tools. But not until recently did I take the time to understand what they are, how they work, and why they’re necessary (in their own ways).
In this tutorial, I’ll start from scratch and move through the various Python implementations, concluding with a thorough introduction to PyPy, which I believe is the future of the language.
It all starts with an understanding of what ‘Python’ actually is.
If you have a good understanding for machine code, virtual machines, and the like, feel free to skip ahead.
“Is Python interpreted or compiled?”
This is a common point of confusion for Python beginners.
The first thing to realize when making a comparison is that ‘Python’ is an interface. There’s a specification of what Python should do and how it should behave (as with any interface). And there are multiple implementations (as with any interface).
The second thing to realize is that ‘interpreted’ and ‘compiled’ are properties of an implementation, not an interface.
So the question itself isn’t really well-formed.
That said, for the most common Python implementation (CPython: written in C, often referred to as simply ‘Python’, and surely what you’re using if you have no idea what I’m talking about), the answer is: interpreted, with some compilation. CPython compiles* Python source code to bytecode, and then interprets this bytecode, executing it as it goes.
* Note: this isn’t ‘compilation’ in the traditional sense of the word. Typically, we’d say that ‘compilation’ is taking a high-level language and converting it to machine code. But it is a ‘compilation’ of sorts.
Let’s look at that answer more closely, as it will help us understand some of the concepts that come up later in the post.
Bytecode vs. Machine Code
It’s very important to understand the difference between bytecode vs. machine code (aka native code), perhaps best illustrated by example:
- C compiles to machine code, which is then run directly on your processor. Each instruction instructs your CPU to move stuff around.
- Java compiles to bytecode, which is then run on the Java Virtual Machine (JVM), an abstraction of a computer that executes programs. Each instruction is then handled by the JVM, which interacts with your computer.
In very brief terms: machine code is much faster, but bytecode is more portable and secure.
Machine code looks different depending on your machine, but bytecode looks the same on all machines. One might say that machine code is optimized to your setup.
Returning to CPython implementation, the toolchain process is as follows:
- CPython compiles your Python source code into bytecode.
- That bytecode is then executed on the CPython Virtual Machine.
Alternative VMs: Jython, IronPython, and More
As I mentioned earlier, Python has several implementations. Again, as mentioned earlier, the most common is CPython, but there are others that should be mentioned for the sake of this comparison guide. This a Python implementation written in C and considered the ‘default’ implementation.
But what about the alternative Python implementations? One of the more prominent is Jython, a Python implementation written Java that utilizes the JVM. While CPython produces bytecode to run on the CPython VM, Jython produces Java bytecode to run on the JVM (this is the same stuff that’s produced when you compile a Java program).
“Why would you ever use an alternative implementation?”, you might ask. Well, for one, these different Python implementations play nicely with different technology stacks.
CPython makes it very easy to write C-extensions for your Python code because in the end it is executed by a C interpreter. Jython, on the other hand, makes it very easy to work with other Java programs: you can import any Java classes with no additional effort, summoning up and utilizing your Java classes from within your Jython programs. (Aside: if you haven’t thought about it closely, this is actually nuts. We’re at the point where you can mix and mash different languages and compile them all down to the same substance. (As mentioned by Rostin, programs that mix Fortran and C code have been around for a while. So, of course, this isn’t necessarily new. But it’s still cool.))
As an example, this is valid Jython code:
[Java HotSpot(TM) 64-Bit Server VM (Apple Inc.)] on java1.6.0_51 >>> from java.util import HashSet >>> s = HashSet(5) >>> s.add("Foo") >>> s.add("Bar") >>> s [Foo, Bar]
IronPython is another popular Python implementation, written entirely in C# and targeting the .NET stack. In particular, it runs on what you might call the .NET Virtual Machine, Microsoft’s Common Language Runtime (CLR), comparable to the JVM.
You might say that Jython : Java :: IronPython : C#. They run on the same respective VMs, you can import C# classes from your IronPython code and Java classes from your Jython code, etc.
It’s totally possible to survive without ever touching a non-CPython Python implementation. But there are advantages to be had from switching, most of which are dependent on your technology stack. Using a lot of JVM-based languages? Jython might be for you. All about the .NET stack? Maybe you should try IronPython (and maybe you already have).
By the way: while this wouldn’t be a reason to use a different implementation, note that these implementations do actually differ in behavior beyond how they treat your Python source code. However, these differences are typically minor, and dissolve or emerge over time as these implementations are under active development. For example, IronPython uses Unicode strings by default; CPython, however, defaults to ASCII for versions 2.x (failing with a UnicodeEncodeError for non-ASCII characters), but does support Unicode strings by default for 3.x.
Just-in-Time Compilation: PyPy, and the Future
So we have a Python implementation written in C, one in Java, and one in C#. The next logical step: a Python implementation written in… Python. (The educated reader will note that this is slightly misleading.)
Here’s where things might get confusing. First, lets discuss just-in-time (JIT) compilation.
JIT: The Why and How
Recall that native machine code is much faster than bytecode. Well, what if we could compile some of our bytecode and then run it as native code? We’d have to pay some price to compile the bytecode (i.e., time), but if the end result was faster, that’d be great! This is the motivation of JIT compilation, a hybrid technique that mixes the benefits of interpreters and compilers. In basic terms, JIT wants to utilize compilation to speed up an interpreted system.
For example, a common approach taken by JITs:
- Identify bytecode that is executed frequently.
- Compile it down to native machine code.
- Cache the result.
- Whenever the same bytecode is set to be run, instead grab the pre-compiled machine code and reap the benefits (i.e., speed boosts).
This is what PyPy implementation is all about: bringing JIT to Python (see the Appendix for previous efforts). There are, of course, other goals: PyPy aims to be cross-platform, memory-light, and stackless-supportive. But JIT is really its selling point. As an average over a bunch of time tests, it’s said to improve performance by a factor of 6.27. For a breakdown, see this chart from the PyPy Speed Center:
PyPy is Hard to Understand
But there’s a lot of confusion around PyPy (see, for example, this nonsensical proposal to create a PyPyPy…). In my opinion, that’s primarily because PyPy is actually two things:
A Python interpreter written in RPython (not Python (I lied before)). RPython is a subset of Python with static typing. In Python, it’s “mostly impossible” to reason rigorously about types (Why is it so hard? Well consider the fact that:
x = random.choice([1, "foo"])
would be valid Python code (credit to Ademan). What is the type of
x? How can we reason about types of variables when the types aren’t even strictly enforced?). With RPython, you sacrifice some flexibility, but instead make it much, much easier to reason about memory management and whatnot, which allows for optimizations.
A compiler that compiles RPython code for various targets and adds in JIT. The default platform is C, i.e., an RPython-to-C compiler, but you could also target the JVM and others.
Solely for clarity in this Python comparison guide, I’ll refer to these as PyPy (1) and PyPy (2).
Why would you need these two things, and why under the same roof? Think of it this way: PyPy (1) is an interpreter written in RPython. So it takes in the user’s Python code and compiles it down to bytecode. But the interpreter itself (written in RPython) must be interpreted by another Python implementation in order to run, right?
Well, we could just use CPython to run the interpreter. But that wouldn’t be very fast.
Instead, the idea is that we use PyPy (2) (referred to as the RPython Toolchain) to compile PyPy’s interpreter down to code for another platform (e.g., C, JVM, or CLI) to run on our machine, adding in JIT as well. It’s magical: PyPy dynamically adds JIT to an interpreter, generating its own compiler! (Again, this is nuts: we’re compiling an interpreter, adding in another separate, standalone compiler.)
In the end, the result is a standalone executable that interprets Python source code and exploits JIT optimizations. Which is just what we wanted! It’s a mouthful, but maybe this diagram will help:
To reiterate, the real beauty of PyPy is that we could write ourselves a bunch of different Python interpreters in RPython without worrying about JIT. PyPy would then implement JIT for us using the RPython Toolchain/PyPy (2).
In fact, if we get even more abstract, you could theoretically write an interpreter for any language, feed it to PyPy, and get a JIT for that language. This is because PyPy focuses on optimizing the actual interpreter, rather than the details of the language it’s interpreting.
As a brief digression, I’d like to mention that the JIT itself is absolutely fascinating. It uses a technique called tracing, which executes as follows:
- Run the interpreter and interpret everything (adding in no JIT).
- Do some light profiling of the interpreted code.
- Identify operations you’ve performed before.
- Compile these bits of code down to machine code.
For more, this paper is highly accessible and very interesting.
To wrap up: we use PyPy’s RPython-to-C (or other target platform) compiler to compile PyPy’s RPython-implemented interpreter.
After a lengthy comparison of Python implementations, I have to ask myself: Why is this so great? Why is this crazy idea worth pursuing? I think Alex Gaynor put it well on his blog: “[PyPy is the future] because [it] offers better speed, more flexibility, and is a better platform for Python’s growth.”
- It’s fast because it compiles source code to native code (using JIT).
- It’s flexible because it adds the JIT to your interpreter with very little additional work.
- It’s flexible (again) because you can write your interpreters in RPython, which is easier to extend than, say, C (in fact, it’s so easy that there’s a tutorial for writing your own interpreters).
Appendix: Other Python Names You May Have Heard
Python 3000 (Py3k): an alternative naming for Python 3.0, a major, backwards-incompatible Python release that hit the stage in 2008. The Py3k team predicted that it would take about five years for this new version to be fully adopted. And while most (warning: anecdotal claim) Python developers continue to use Python 2.x, people are increasingly conscious of Py3k.
- Cython: a superset of Python that includes bindings to call C functions.
- Goal: allow you to write C extensions for your Python code.
- Also lets you add static typing to your existing Python code, allowing it to be compiled and reach C-like performance.
- This is similar to PyPy, but not the same. In this case, you’re enforcing typing in the user’s code before passing it to a compiler. With PyPy, you write plain old Python, and the compiler handles any optimizations.
Numba: a “just-in-time specializing compiler” that adds JIT to annotated Python code. In the most basic terms, you give it some hints, and it speeds up portions of your code. Numba comes as part of the Anaconda distribution, a set of packages for data analysis and management.
IPython: very different from anything else discussed. A computing environment for Python. Interactive with support for GUI toolkits and browser experience, etc.
- Psyco: a Python extension module, and one of the early Python JIT efforts. However, it’s since been marked as “unmaintained and dead”. In fact, the lead developer of Psyco, Armin Rigo, now works on PyPy.
Python Language Bindings
RubyPython: a bridge between the Ruby and Python VMs. Allows you to embed Python code into your Ruby code. You define where the Python starts and stops, and RubyPython marshals the data between the VMs.
PyObjc: language-bindings between Python and Objective-C, acting as a bridge between them. Practically, that means you can utilize Objective-C libraries (including everything you need to create OS X applications) from your Python code, and Python modules from your Objective-C code. In this case, it’s convenient that CPython is written in C, which is a subset of Objective-C.
PyQt: while PyObjc gives you binding for the OS X GUI components, PyQt does the same for the Qt application framework, letting you create rich graphic interfaces, access SQL databases, etc. Another tool aimed at bringing Python’s simplicity to other frameworks.