Python’s itertools for Memory-Efficient Iteration - NBD Lite #23

Efficiently handle large dataset iteration

Oct 04, 2024

If you are interested in more audio explanations, you can listen to the article in the AI-Generated Podcast by NotebookLM!👇👇👇

1×

0:00

-14:19

One of the most common activities in Python is to iterate over the data.

However, the native Python iteration process can be slow and memory-consuming.

It’s especially true when we iterate over extensive data; it can become exhaustive.

Using Python’s itertools module is a great way to efficiently handle large data iterations, especially when memory usage is a concern.

The package provides memory-efficient tools and a fast way to iterate over large datasets.

In this edition, we will discuss several functions from itertools. Here is the summary of what we will discuss:

In this notebook, you can check out the whole code for this article and compare its time/memory to the native Python function.

1. Chain

Chain is a function from itertools that iterates over multiple datasets as if they were a single dataset.

It’s a valuable function because it iterates over the dataset without creating a new object to hold the data points.

The following is how to use the chain function:

import itertools

list1 = [1, 2, 3]
list2 = [4, 5, 6]
list3 = [7, 8, 9]

for item in itertools.chain(list1, list2, list3):
    print(item)

If we compared the time and memory comparison to the native Python usage, here is the result:

As you can see, the native list might be faster, but the memory usage is much bigger than the chain function.

The chain A function is much more memory-efficient as it doesn't create new objects, so you might want to use it even if it’s slower.

2. islice

The islice a function to iterate some part of the dataset in a memory-efficient manner.

The function works by generating an element lazily instead of creating a copy.

You can apply the function in this code.

import itertools

large_data = range(1000000)

for item in itertools.islice(large_data, 100, 110): 
    print(item)

In the code above, you only iterate the data at positions 100 to 110.

Time-wise, islice function slightly slower but more memory efficient, just like the other code.

3. cycle

The cycle function is used to iterate over an iterable infinitely until it breaks.

It doesn’t create a new structure but keeps looping through the original data.

This is useful for repeated operations over a dataset where you'd like to restart from the beginning.

Example Python code implementation is like the following:

import itertools

counter = 0
for item in itertools.cycle([1, 2, 3]):
    print(item)
    counter += 1
    if counter == 10:  
        break

In contrast to our previous functions, cycle function memory usage is slightly higher but might be faster than the native's.

4. Count

The count is a function to generate infinite numbers, which is helpful if number generator iterations are needed without storing them.

You can use the following code to use the function:

import itertools

for i in itertools.count(start=10, step=2):
    if i > 20:
        break
    print(i)

You can see that the count time and memory usage were much less than the native range function.

It would be best if you used the count function if it’s applicable.

Thanks for reading Non-Brand Data! This post is public so feel free to share it.

5. groupby

The groupby function groups consecutive elements in an iterable based on a key function.

It's similar to SQL's GROUP BY but works on sorted iterables.

Let’s see how it works with the Python code.

import itertools

data = [1, 1, 2, 2, 2, 3, 3, 1]
for key, group in itertools.groupby(data):
    print(f"Key: {key}, Group: {list(group)}")

It works by grouping consecutive data with the same values and assigning the value as the key.

Let’s see how the speed and memory compare to the native usage.

As you can see, the manual grouping is faster but takes much more memory than the groupby function, which is much more memory-efficient.

6. product

The product function is used to create a cartesian product for the input variables.

Let’s see the implementation with a code example:

import itertools

for item in itertools.product([1, 2], ['A', 'B']):
    print(item)

You can see that it’s useful to get all the possible combinations from the data we pass.

Let’s see how the time and memory comparison compared to the manual iteration.

The memory consumption of product the function is much more efficient even though it’s kinda slower.

7. Permutation

The Permutation function is similar to the product function where it try to find all the combinations. However, the permutation function only handles the data combination from one set of inputs.

We can see the example in the code below.

import itertools
for item in itertools.permutations([1, 2, 3]):
    print(item)

We can see all the combinations are now available to us.

Let’s see their speed and memory comparison compare to the native.

You can see that the permutation function is faster and the memory consumption is way less than the manual permutation.

It’s a no-brainer to use the permutation function if you could.

Join Cornellius Yudha Wijaya’s subscriber chat

Available in the Substack app and on web

8. Combination

The combination function is similar to the permutation function in that it tries to find all the combinations from one set of inputs.

The difference is that we can set the output length using the combination function.

import itertools
for item in itertools.combinations([1, 2, 3], 2):
    print(item)

As you can see, we pass an additional parameter to the function.

Let’s see how the speed and memory compare to the native.

Just like the permutation, the combination function is much more efficient in terms of speed and memory. It would be best if you used them whenever you can.

That’s all the Python’s itertools function you should know!

Are there any more things you would love to discuss? Let’s talk about it together!

👇👇👇