Interesting Research: OneLLM Unified Eight Modalities into One

The one framework that align all modalities into language

Dec 13, 2023

Multimodal LLM is a model that can receive input data from multiple forms (text, image, audio, etc.), and could provide the form of textual language as output. Basically, LLM could understand any data it provides.

One research I found interesting is the latest research by Han et al. (2023), which provide a unified LLM framework from various modality. I feel it’s shown promising results and useability in the future.

So, how does it work? Let’s get into it.

OneLLM Introduction

The paper introduces OneLLM, a novel Multimodal Large Language Model (MLLM) designed to integrate eight different modalities with language effectively using a unified framework. We can see the data that OneLLM could accept in the following image.

cmp — OneLLM Multimodal Data (Han *et al*. (2023))

OneLLM supports eight different modalities compared to many previous models, which is three. An example of the model input and the output can be shown in the image below.

OneLLM input and the output (Han *et al*. (2023)

Unlike traditional MLLMs that use modality-specific encoders, OneLLM employs a unified multimodal encoder and a progressive multimodal alignment pipeline.

This alignment process begins with an image projection module that connects a vision encoder to the Large Language Model (LLM). This is followed by the creation of a Universal Projection Module (UPM), formed by combining multiple image projection modules with dynamic routing.

Subsequently, the UPM is used to align additional modalities to the LLM progressively. Furthermore, to maximize OneLLM's capabilities in following instructions, a comprehensive multimodal instruction dataset comprising 2 million items from various modalities like image, audio, video, point cloud, depth/normal map, IMU, and fMRI brain activity was curated.

Overall, the OneLLM Architecture can be shown in the image below.

MY ALT TEXT — OneLLM Architecture (Han *et al*. (2023)

OneLLM has been evaluated across 25 diverse benchmarks, covering tasks such as multimodal captioning, question answering, and reasoning, demonstrating outstanding performance. OneLLM has been shown to outperform existing methods in video-text, audio-video-text, audio-text, and point-clod-text tasks, which has shown great potential.

Thank you for reading Non-Brand Data. This post is public so feel free to share it.

Trying out OneLLM

Currently, the OneLLM researcher has released the inference code and the model weights in their GitHub Repository. You can install the packages by using the following code.

First, you need to clone the repository.

git clone https://github.com/csuhan/OneLLM

cd OneLLM

Then, you could install the packages.

conda create -n onellm python=3.9 -y
conda activate onellm

pip install -r requirements.txt

# install pointnet
cd model/lib/pointnet2
python setup.py install

We can also use the HuggingFace Demo to try out the model. For example, I tried the model with an image sample.

The result showed a great understanding of the image input I give. I can also try Video data as an input.

Additionally, I can use the example point cloud data as an input.

There are many possibilities now with the OneLLM, and I can see that this model would be important in the future.

I hope it helps!

Join Cornellius Yudha Wijaya’s subscriber chat

Available in the Substack app and on web

Thank you, everyone, for subscribing to my newsletter. If you have something you want me to write or discuss, please comment or directly message me through my social media!

Non-Brand Data

Interesting Research: OneLLM Unified Eight Modalities into One

The one framework that align all modalities into language

OneLLM Introduction

Trying out OneLLM

Discussion about this post