Python’s itertools for Memory-Efficient Iteration - NBD Lite #23
Efficiently handle large dataset iteration
If you are interested in more audio explanations, you can listen to the article in the AI-Generated Podcast by NotebookLM!👇👇👇
One of the most common activities in Python is to iterate over the data.
However, the native Python iteration process can be slow and memory-consuming.
It’s especially true when we iterate over extensive data; it can become exhaustive.
Using Python’s itertools module is a great way to efficiently handle large data iterations, especially when memory usage is a concern.
The package provides memory-efficient tools and a fast way to iterate over large datasets.
In this edition, we will discuss several functions from itertools. Here is the summary of what we will discuss:
In this notebook, you can check out the whole code for this article and compare its time/memory to the native Python function.
1. Chain
Chain is a function from itertools that iterates over multiple datasets as if they were a single dataset.
It’s a valuable function because it iterates over the dataset without creating a new object to hold the data points.
The following is how to use the chain
function:
import itertools
list1 = [1, 2, 3]
list2 = [4, 5, 6]
list3 = [7, 8, 9]
for item in itertools.chain(list1, list2, list3):
print(item)
If we compared the time and memory comparison to the native Python usage, here is the result:
As you can see, the native list might be faster, but the memory usage is much bigger than the chain
function.
The chain
A function is much more memory-efficient as it doesn't create new objects, so you might want to use it even if it’s slower.
2. islice
The islice
a function to iterate some part of the dataset in a memory-efficient manner.
The function works by generating an element lazily instead of creating a copy.
You can apply the function in this code.
import itertools
large_data = range(1000000)
for item in itertools.islice(large_data, 100, 110):
print(item)
In the code above, you only iterate the data at positions 100 to 110.
Time-wise, islice
function slightly slower but more memory efficient, just like the other code.
3. cycle
The cycle
function is used to iterate over an iterable infinitely until it breaks.
It doesn’t create a new structure but keeps looping through the original data.
This is useful for repeated operations over a dataset where you'd like to restart from the beginning.
Example Python code implementation is like the following:
import itertools
counter = 0
for item in itertools.cycle([1, 2, 3]):
print(item)
counter += 1
if counter == 10:
break
In contrast to our previous functions, cycle
function memory usage is slightly higher but might be faster than the native's.
4. Count
The count
is a function to generate infinite numbers, which is helpful if number generator iterations are needed without storing them.
You can use the following code to use the function:
import itertools
for i in itertools.count(start=10, step=2):
if i > 20:
break
print(i)
You can see that the count
time and memory usage were much less than the native range
function.
It would be best if you used the count
function if it’s applicable.
5. groupby
The groupby
function groups consecutive elements in an iterable based on a key function.
It's similar to SQL's GROUP BY
but works on sorted iterables.
Let’s see how it works with the Python code.
import itertools
data = [1, 1, 2, 2, 2, 3, 3, 1]
for key, group in itertools.groupby(data):
print(f"Key: {key}, Group: {list(group)}")
It works by grouping consecutive data with the same values and assigning the value as the key.
Let’s see how the speed and memory compare to the native usage.
As you can see, the manual grouping is faster but takes much more memory than the groupby
function, which is much more memory-efficient.
6. product
The product
function is used to create a cartesian product for the input variables.
Let’s see the implementation with a code example:
import itertools
for item in itertools.product([1, 2], ['A', 'B']):
print(item)
You can see that it’s useful to get all the possible combinations from the data we pass.
Let’s see how the time and memory comparison compared to the manual iteration.
The memory consumption of product
the function is much more efficient even though it’s kinda slower.
7. Permutation
The Permutation
function is similar to the product
function where it try to find all the combinations. However, the permutation
function only handles the data combination from one set of inputs.
We can see the example in the code below.
import itertools
for item in itertools.permutations([1, 2, 3]):
print(item)
We can see all the combinations are now available to us.
Let’s see their speed and memory comparison compare to the native.
You can see that the permutation
function is faster and the memory consumption is way less than the manual permutation.
It’s a no-brainer to use the permutation
function if you could.
8. Combination
The combination
function is similar to the permutation
function in that it tries to find all the combinations from one set of inputs.
The difference is that we can set the output length using the combination
function.
import itertools
for item in itertools.combinations([1, 2, 3], 2):
print(item)
As you can see, we pass an additional parameter to the function.
Let’s see how the speed and memory compare to the native.
Just like the permutation
, the combination
function is much more efficient in terms of speed and memory. It would be best if you used them whenever you can.
That’s all the Python’s itertools function you should know!
Are there any more things you would love to discuss? Let’s talk about it together!
👇👇👇