May 5, 2013
I sometimes use Python for quick-and-dirty computations. Its “batteries-included” standard library makes it convenient for all kinds of tasks. Python is an interpreted programming language, and it sometimes shows. I often complain about its performance, but today I thought I would do a benchmark to back up my frustrations.
So today, I want to do the simplest possible benchmark to see just how much slower Python is than other programming languages when it comes to CPU-bound operations. We’re going to count to a billion with a simple while loop.
I’ll start with C++ as a baseline. I’m compiling the following in g++ with no optimizations:
int i = 0; while (i < 1000000000) ++i;
Normally you would use a for loop for something like this, but
whileloops are essentially identical in these languages.
The code above consistently takes about 1.78 seconds on my 2.6 GHz Intel Core i7 (so each iteration takes 1.78 nanoseconds). Just for fun, let’s take a look at the assembly for this loop:
jmp L2 L3: addl $1, -20(%rbp) L2: cmpl $999999999, -20(%rbp) setle %al testb %al, %al jne L3
Let’s go through this line by line. The first line tells the processor to jump to the
L2label, which is the conditional part of the loop. The second line is a label representing the beginning of the loop body. The third line corresponds to
++iin the source. Line 4 is the label we jumped to from Line 1, marking the start of the conditional part of the loop. Computing whether
i < 1000000000is a multi-step process: First, Line 5 computes
i - 999999999and updates the
EFLAGSregister with the result. The
setleinstruction on line 6 then looks at this register to determine whether
i <= 999999999, storing the result in the
AL(accumulator) register. The
testbinstruction on line 7 does a bitwise AND of
ALwith itself, updates
ZF(the zero flag), and discards the result (i.e. this instruction tests if
ALis zero). Finally, line 8 tells the processor to jump back to the beginning of the loop if
ZFis unset. This is not an efficient way to count to a billion!
Interestingly, the math reveals that we are doing 2.8 billion instructions per second on my 2.6 GHz CPU. On CISC processors like my Intel Core i7, it is common to execute on average more than one instruction per clock cycle. This is due to instruction-level parallelism.
Here is the assembly listing for the same code compiled with optimization level 1 (
movl $1000000000, %edx L2: subl $1, %edx jne L2
If we tell g++ to optimize the code any further, the optimizer actually removes the loop entirely! Notice that this assembly counts down from a billion rather than up. Compilers do weird things with your code. This version ran in 0.284 seconds. That’s more than 6 times faster than the version above.
The assembly is pretty straightforward. Line 1 moves the value
EDXregister. Line 2 is a label that marks the start of the loop body. Line 3 decrements the
EDXregister. Line 4 tells the CPU to jump to the beginning of the loop if
The core loop consists of two instructions (the decrement and the jump). 2 billion instructions in 0.284 seconds is 7 billion instructions per second! Again, instruction-level parallelism is responsible for this incredible performance on a single core.
Now let’s write the same code in Python.
i = 0 while i < 1000000000: i += 1
This is a direct line-by-line translation from the C++, and took 112.37 seconds in CPython, the most popular Python implementation. That’s over 63 times slower than unoptimized C++, and 396 times slower than optimized C++. Here are the opcodes, again, just for fun:
0 LOAD_CONST 1 (0) 3 STORE_FAST 0 (i) 6 SETUP_LOOP 26 (to 35) 9 LOAD_FAST 0 (i) 12 LOAD_CONST 2 (1000000000) 15 COMPARE_OP 0 (<) 18 POP_JUMP_IF_FALSE 34 21 LOAD_FAST 0 (i) 24 LOAD_CONST 3 (1) 27 INPLACE_ADD 28 STORE_FAST 0 (i) 31 JUMP_ABSOLUTE 9 34 POP_BLOCK
Lines 4-12 are the opcodes that are run each iteration of the loop. So 1 billion * 9 opcodes / 112.37 seconds = about 80,092,551 opcodes per second. It’s like we’re in the 1980s!
So, Python is slow. What else is new? Let’s compare Python to another dynamic programming language.
PyPy—a faster Python implementation
CPython is slow because it interprets the code instruction by instruction. PyPy is an alternative implementation of Python that performs JIT-compilation for much greater performance (in both speed and memory usage). Running the Python benchmark above in PyPy takes 3.967 seconds, which is more than 28 times faster than CPython, but still 14x slower than optimized C++.
var i = 0; while (i < 1000000000) ++i;