Have you ever needed to split a list of something into equally sized chunks? I have - and I always ended up writing my own code. While the problem is quite straightforward, Python 3.12 brings an addition to itertools
module - a batched
class - which solves this problem in a convenient way with just a single line of code.
How can I use itertools.batched
?
itertools.batched
expects two arguments: an iterable
and n
(an integer representing size of the batch). It allows for batching data from iterable into batches of n
size (and size of the last batch might be smaller than n
).
Batches are returned as tuples (consisting of elements from the iterable).
Let’s take a look at quick example:
import itertools
nums = [
1,
2,
3,
4,
5,
6,
]
even_batches = itertools.batched(nums, 3)
uneven_batches = itertools.batched(nums, 4)
We’ve created 2 itertools.batched
objects with different batch sizes. How can we access/consume the batches?
You can iterate over them or lazily fetch the batch one by one:
for batch in even_batches:
print(batch)
# (1, 2, 3)
# (4, 5, 6)
next(uneven_batches)
# (1, 2, 3, 4)
next(uneven_batches)
# (5, 6)
next(uneven_batches)
# StopIteration exception is raised
# as we've exhausted the iterator
If, for some reason, you need to produce all batches at once and store them somewhere, it is also an easy thing to do:
import itertools
some_nums = (i for i in range(1, 10))
some_nums_batched = list(itertools.batched(some_nums, 3)) # materializing batches into a list (or tuple or whatever you need)
print(some_nums_batched)
# [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
My recent use case for itertools.batched
I needed to fetch some data from an external API (over HTTP). There was an API endpoint that exposed certain information about users and you could get info about only 50 users in a single request. I needed to go over 170000 of users and, to speed things up, I decided to use async IO for this workload.
I also wanted to limit concurrent requests to avoid any problems with rate limits of this API (I decided to send a maximum number of 100 concurrent requests at a time).
Taking above requirements into consideration, I wrote something like this
import itertools
user_ids = range(170_000)
max_concurrent_requests = 100
emails_per_request_limit = 50
batched_workload = itertools.batched(
itertools.batched(user_ids, emails_per_request_limit),
max_concurrent_requests,
)
# do some asynchronous stuff with batched workload
# ...
Then I’d simply iterate over batched_workload
and asynchronously call this external API.
(17/52) This is a 17th post from my blogging challenge (publishing 52 posts in 2024).