Audio Data Separation with Machine Learning Model

Isolate any sounds you wants from the audio data

Aug 13, 2023

black and green audio mixer — Photo by James Kovin on Unsplash

Audio data is an exciting subject within machine learning research, as the data and the result aren’t necessarily represented in a tabular format. Instead, the user must represent them in the wave of frequencies and be aware of the time factor.

Much research is happening within the audio data field, from audio classification and text-to-speech to audio synthetic generation. It’s an exciting field with many breakthroughs that might transform our work.

In August 2023, a paper by Liu et al. (2023) introduced the foundation model AudioSep, which allows the user to separate specific sounds within audio with natural language queries. By describing what sound we want from the audio file, we can isolate that sound.

Let’s explore further how AudioSep works.

AudioSep

Computational Auditory Scene Analysis (CASA) is a machine learning research field aiming to develop a system miming the human auditory system. Within the CASA is a new research paradigm, Language-queried audio source separation (LASS).

LASS tasks aim to separate specific sounds using natural language queries from audio sources. However, many challenges were found within the field as the research grew, especially with the associated queries.

The challenge comes from text variability where natural text language could include complex text such as “The acoustic guitar played with upbeat tone” or simple queries such as “Speech” or “Music”. Also, the same audio could be described with different language queries.

This creates a situation where LASS must be able to capture these phrases and their relationships in the language description while still separating one or more sound sources that match the language query from the audio mixture—which is challenging.

To address the issue above, the research group tried to develop a new foundation model called AudioSep that utilizes multimodal contrastive pre-training models with the framework shown in the image below.

AudioSep Framework (Source: Liu *et al.* (2023))

The AudioSep framework contains two components: a text encoder and a separation model.

The text encoders use either the contrastive language-image pre-training model (CLIP) or contrastive language-audio pre-training model (CLAP) to process the input queries. While the separation model is based on ResUNet Model. To bridge the text encoder and the separation model, the research group utilized the Feature-wise Linearly modulated (FiLm) layer.

Enough with the technicality; let’s see how well AudioSep isolates certain sounds from the audio mixture data.

Accordion

Here is a mixture of audio data from the demo page.

1×

0:00

-0:10

With the text query “accordion”, here is the result.

1×

0:00

-0:10

The accordion sound is isolated from the mixture of audio. Let’s see another audio example.

Laughing

Here is the mixture of audio event data.

1×

0:00

-0:05

And this is the result of the text query “laughing”.

1×

0:00

-0:05

AudioSep does a good job separating the laughing sound from the audio mixture data.

Let’s see the separation result if we use more complex queries.

A man speaks then a small bird chirps.

In this audio mixture data, many sounds are mixed up.

1×

0:00

-0:10

We would then try to separate a series of sounds using the queries ”A man speaks then a small bird chirps”. Here is the result.

1×

0:00

-0:10

The AudioSep result could still follow the intended queries even with additional complexity. ItIt'sndeed shown a promising application in the future.

If you want to explore further all the result samples, you can visit the demo page.

As of the time this article was written, all the code and pre-trained model hasn’t yet been released to the public, but expect a soon release. Watch the AudioSep repository to not miss their update.

Thank you for reading Non-Brand Data. This post is public so feel free to share it.

Conclusion

Language-queried audio source separation (LASS) is a recent research paradigm that aims to isolate sounds from the audio mixture data. In the latest research, Liu et al. (2023) introduce a new foundation model called AudioSep to help facilitate this task. As of this article, the foundation model has yet to be released to the public, but it will be released soon.

Thank you, everyone, for subscribing to my newsletter. If you have something you want me to write or discuss, please comment or directly message me through my social media!

Non-Brand Data

Audio Data Separation with Machine Learning Model

Isolate any sounds you wants from the audio data

AudioSep

Conclusion

Discussion about this post