skip to content
Jakub Szafran

1 Billion Rows challenge followup: trying Golang

/ 3 min read

A couple of days ago I’ve written a post which covered an attempt on solving 1 Billion Row challenge with Python - no multiprocessing, no optimizations - just iterating and parsing the file line by line.

If you’re curious about how long it took to process 1 billion rows in CPython and PyPy, here’s the article.

In this one, we’ll rewrite Python’s code to more or less equivalent in Go and test its execution time.

Go Gopher, go!

The idea overall approach stays the same as in Python’s version - we have a struct StationStatistics that will keep track of weather station min, max, sum, and count values. It’s a receiver of String() function that prepares its values into a format expected by the 1BR challenge.

Station statistics are stored in map.

We use buffered scanner for iterating over lines of text file - we split a line by ;, convert temperature value to float and update statistics with parsed data.

Once all data has been processed, we create a slice of StationStatistics, sort it by weather station and print to standard output.

package main

import (
	"bufio"
	"fmt"
	"log"
	"math"
	"os"
	"sort"
	"strconv"
	"strings"
	"time"
)

type StationStatistics struct {
	station string
	min     float64
	max     float64
	sum     float64
	count   int
}

func (ss StationStatistics) String() string {
	return fmt.Sprintf("%s=%.1f/%.1f/%.1f", ss.station, ss.min, ss.sum/float64(ss.count), ss.max)
}

func calculateStats(filePath string) {
	f, err := os.Open(filePath)
	if err != nil {
		log.Fatal(err)
	}
	defer f.Close()

	stats := make(map[string]StationStatistics)

	scanner := bufio.NewScanner(f)
	for scanner.Scan() {
		data := strings.Split(scanner.Text(), ";")
		temp, err := strconv.ParseFloat(data[1], 64)
		if err != nil {
			log.Fatal(err)
		}

		station := data[0]
		if cityStats, ok := stats[station]; ok {
			cityStats.min = math.Min(cityStats.min, temp)
			cityStats.max = math.Max(cityStats.max, temp)
			cityStats.count += 1
			cityStats.sum += temp
			stats[station] = cityStats
		} else {
			stats[station] = StationStatistics{
				station: station,
				min:     temp,
				max:     temp,
				sum:     temp,
				count:   1,
			}
		}
	}

	stations := make([]string, 0)
	for s := range stats {
		stations = append(stations, s)
	}

	sort.SliceStable(stations, func(i, j int) bool {
		return stations[i] < stations[j]
	})

	fmt.Print("{")
	sep := " "
	for i, s := range stations {
		if i == len(stations)-1 {
			sep = ""
		}
		fmt.Print(stats[s].String(), sep)
	}
	fmt.Print("}\n")
}

func main() {
	now := time.Now()
	calculateStats(os.Args[1])
	fmt.Printf("Elapsed: %s\n", time.Since(now))
}

Results

After several attempts it turned out that performance of such Golang application was minimally better than the Java baseline application - 3 minutes 33 seconds.

Let’s add it to the previous results chart:

Results chart with Golang added

(7/52) This is a 7th post from my blogging challenge (publishing 52 posts in 2024).