A couple of days ago I’ve written a post which covered an attempt on solving 1 Billion Row challenge with Python - no multiprocessing, no optimizations - just iterating and parsing the file line by line.
If you’re curious about how long it took to process 1 billion rows in CPython and PyPy, here’s the article.
In this one, we’ll rewrite Python’s code to more or less equivalent in Go and test its execution time.
Go Gopher, go!
The idea overall approach stays the same as in Python’s version - we have a struct StationStatistics
that will keep track of weather station min, max, sum, and count values. It’s a receiver of String()
function that prepares its values into a format expected by the 1BR challenge.
Station statistics are stored in map.
We use buffered scanner for iterating over lines of text file - we split a line by ;
, convert temperature value to float and update statistics with parsed data.
Once all data has been processed, we create a slice of StationStatistics
, sort it by weather station and print to standard output.
package main
import (
"bufio"
"fmt"
"log"
"math"
"os"
"sort"
"strconv"
"strings"
"time"
)
type StationStatistics struct {
station string
min float64
max float64
sum float64
count int
}
func (ss StationStatistics) String() string {
return fmt.Sprintf("%s=%.1f/%.1f/%.1f", ss.station, ss.min, ss.sum/float64(ss.count), ss.max)
}
func calculateStats(filePath string) {
f, err := os.Open(filePath)
if err != nil {
log.Fatal(err)
}
defer f.Close()
stats := make(map[string]StationStatistics)
scanner := bufio.NewScanner(f)
for scanner.Scan() {
data := strings.Split(scanner.Text(), ";")
temp, err := strconv.ParseFloat(data[1], 64)
if err != nil {
log.Fatal(err)
}
station := data[0]
if cityStats, ok := stats[station]; ok {
cityStats.min = math.Min(cityStats.min, temp)
cityStats.max = math.Max(cityStats.max, temp)
cityStats.count += 1
cityStats.sum += temp
stats[station] = cityStats
} else {
stats[station] = StationStatistics{
station: station,
min: temp,
max: temp,
sum: temp,
count: 1,
}
}
}
stations := make([]string, 0)
for s := range stats {
stations = append(stations, s)
}
sort.SliceStable(stations, func(i, j int) bool {
return stations[i] < stations[j]
})
fmt.Print("{")
sep := " "
for i, s := range stations {
if i == len(stations)-1 {
sep = ""
}
fmt.Print(stats[s].String(), sep)
}
fmt.Print("}\n")
}
func main() {
now := time.Now()
calculateStats(os.Args[1])
fmt.Printf("Elapsed: %s\n", time.Since(now))
}
Results
After several attempts it turned out that performance of such Golang application was minimally better than the Java baseline application - 3 minutes 33 seconds.
Let’s add it to the previous results chart:
(7/52) This is a 7th post from my blogging challenge (publishing 52 posts in 2024).