PyPy vs CPython duel: 1 billion rows challenge. • Jakub Szafran personal blog

I’ve recently came across the One Billion Row Challenge by Gunnar Morling. In this article I am comparing of a simple naive solution written in Python and executed on 2 different Python’s implementation: CPython and PyPy.

One billion rows challenge

One billion rows challenge (often shortened to just 1br challenge) is about reading a text file containing 1 billion rows containing temperature measurements and aggregating min, mean, max per weather station and displaying the output in an alphabetical order like this:

{Abha=5.0/18.0/27.4, Abidjan=15.7/26.0/34.1, Abéché=12.1/29.4/35.6, Accra=14.7/26.4/33.1, Addis Ababa=2.1/16.0/24.3, Adelaide=4.1/17.3/29.7, ...}

Original challenge’s goal was to compute these stats as quickly as possible while using Java, however internet was quickly flooded with numerous implementations in other languages.

When I’m writing this article, the best Java solution performed these calculates just under 8 seconds (00:07.999 to be exact).

Note that execution time is dependant on the underlying hardware and, to be able to compare the submissions, Gunnar benchmarks them on the same type of machine - Hetzner Cloud CCX33 instance (8 dedicated vCPU, 32 GB RAM)

Pythons duel

Since I’m using Python for day to day job, I was curious about its performance in this challenge.

Obviously, it has to be slower (probably much slower) than statically typed languages - especially if you stick with the standard library and don’t use libraries such as Polars or Pandas which leverage Rust or C speed).

It was also a good opportunity to do a little comparison between regular Python (CPython implementation) with its alternative called PyPy (which claims to be on average 4.8 times faster than CPython 3.7).

Reference code

Here’s the code I used for comparing the timings:

import time
from datetime import timedelta


def print_stats(stats):
    formatted = ", ".join(
        [
            f"{k}={v['min']}/{v['sum']/v['count']:.1f}/{v['max']}"
            for k, v in sorted(stats.items(), key=lambda x: x[0])
        ]
    )
    print("{" + formatted + "}")


def calculate_stats(filepath: str):
    stats = {}

    with open(filepath, "rt") as f:
        for row in f:
            city, temp_str = row.strip().split(";")
            temp = float(temp_str)
            if city in stats:
                city_stats = stats[city]
                city_stats["min"] = min(city_stats["min"], temp)
                city_stats["max"] = max(city_stats["max"], temp)
                city_stats["sum"] += temp
                city_stats["count"] += 1
            else:
                stats[city] = {
                    "min": temp,
                    "max": temp,
                    "sum": temp,
                    "count": 1,
                }

    return stats


if __name__ == "__main__":
    start_time = time.monotonic()

    measurements_path = "/path/to/1br/file.txt"
    print_stats(calculate_stats(measurements_path))

    time_elapsed = round(time.monotonic() - start_time, 2)
    print(f"Total application time: {timedelta(seconds=time_elapsed)}")

Function calculate_stats goes through provided file line by line (we use iterate to avoid loading whole file into memory as it weights over 12GB) and parses it into weather station string and a temperature measurement float.

Parsed data is aggregated in a stats dictionary which maps weather stations to dictionary with its statistics (min, max, sum and count). Sum and count will be later used for calculating average (so that we don’t have to store a list with all measurements for calculating mean value).

If weather station already exists, we use built-in min and max functions to determine if parsed temperature is smaller/greater than the values we already have and we increment the sum and count values. Otherwise, we instantiate a dictionary filled with parsed temperature value and count = 1.

print_stats function prints the aggregated results in the format that is expected by 1BR challenge.

Installing PyPy

I use pyenv for managing multiple Python versions on my local machine.

To see available pypy versions, type:

pyenv install -l | grep pypy

I’ll use pypy3.9-7.3.11 for this experiment:

pyenv install pypy3.9-7.3.11

Let’s make it our local version:

pyenv local pypy3.9-7.3.11

And it’s ready:

$ pypy
Python 3.9.16 (feeb267ead3e6771d3f2f49b83e1894839f64fb7, Dec 29 2022, 15:36:58)
[PyPy 7.3.11 with GCC Apple LLVM 13.1.6 (clang-1316.0.21.2.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

Results

I am using a Mac Book Pro 2019 with 2,6 GHz 6-Core Intel Core i7 processor (12 cores) and 16GB of RAM.

For comparing CPython and PyPy, I used 3.9 versions for both implementations. On top of that, I executed the same code using Python 3.120a4 (just to check how noticeable are the performance improvements added in 3.11)

Besides Python executions, I also tested the baseline Java application available in 1BR repo (there’s no point to compare it against Python runs as the code is different - it’s here just out of my curiosity).

Results look as following:

interpreter/language	total time (seconds)
CPython (3.9)	1216
CPython (3.12)	1032
PyPy (3.9)	585
Java	216

And if put into a simple chart:

Chart of the execution time results

As you can see, PyPy’s execution time is 2 times faster when compared CPython 3.9 and 1.76 faster compared to CPython 3.12. Nice!

Can Python do better?

Definitely - see this solution. Reading file in chunks and processing chunks in separate process allowed author to go below 1 minute. Remember that execution time is dependent on the underlying hardware (processor speed, available cores) as my machine needed ~2.5 minutes to process 1 billion rows with this code.

(5/52) This is a 5th post for my blogging challenge (publishing 52 posts in 2024).