Semantic Data Analysis with LOTUS: A Query Engine Powered by LLMs
NBD Lite - #44 Powerful Pandas-like API for semantically manipulating your data
With the rise of AI and LLM products, our work has been transformed significantly.
For data professionals, tools like ChatGPT and Gemini have made it easier to generate quick plans, write code, and streamline data analysis processes.
In the data analysis space, several impressive AI assistants, such as PandasAI, act as conversational agents to enhance and elevate our interactions with data.
In this article, I want to introduce another LLM-powered data analysis tool you shouldn’t miss—LOTUS.
So, what exactly is LOTUS, and how can it enhance your data analysis workflow?
Let’s get into it!
Semantic Data Analysis with LOTUS
LOTUS, or LLMs Over Tables of Unstructured and Structured Data, is a Python library for Pandas-like API query engine that uses semantic operators to enable powerful reasoning-based pipelines over structured and unstructured data.
The library will provide semantic operators using natural language to perform data analysis. It will implement the logic and provide output as we intended by passing the query we want.
Let’s try out the library to understand it better. First, you need to install them using the following code.
pip install lotus-ai
Then, we will need to select the LLM we want to use. The LOTUS library mainly relies on LiteLLM, so any provider that supports LiteLLM works for LOTUS.
For this example, we will use the Gemini-1.5-Flash model, so remember to acquire the Gemini API Key as well. Of course, you can change the model to the one you prefer.
Next, we will set up the model. I am using Google Colab for this project, so replace the userdata.get('GOOGLE_API_KEY'))
with your actual API Key.
import pandas as pd
import lotus
from lotus.models import LM
import os
from google.colab import userdata
lm = LM(model="gemini/gemini-1.5-flash", api_key =userdata.get('GOOGLE_API_KEY'))
lotus.settings.configure(lm=lm)
With the model ready, let’s see how powerful LOTUS can be for data analysis.
We’ll start by creating two Pandas DataFrames: one containing Course Names and the other containing Skills.
# create dataframes with course names and skills
courses_data = {
"Course Name": [
"History of the Atlantic World",
"Riemannian Geometry",
"Operating Systems",
"Food Science",
"Compilers",
"Intro to computer science",
]
}
skills_data = {"Skill": ["Math", "Computer Science"]}
courses_df = pd.DataFrame(courses_data)
skills_df = pd.DataFrame(skills_data)
Using LOTUS, we can join both data frames more advanced. Traditionally, Data Frames are joined based on matching keys, resulting in a straightforward merge. However, with LOTUS, we can join them using semantic queries.
For example, in the process below, we join the DataFrames by selecting which skills can be learned from each course.
res = courses_df.sem_join(skills_df, "Taking {Course Name} will help me learn {Skill}")
The result shows that each course matches the relevant skills. Thanks to the semantic process, this approach is faster and more efficient, eliminating the need for manual processing.
You can also check the LLM processing usage using the following code.
lm.print_total_usage()
Output>>
Total cost: $0.000075
Total prompt tokens: 942
Total completion tokens: 24
Total tokens: 966
Total cache hits: 0
There are also many methods you can still use with LOTUS. For example, here is data filtering.
courses_df.sem_filter("{Course Name} requires a lot of math")
Or here is mapping to generate a study plan over our course in the DataFrame.
res.sem_map("Generate a short study plan to succeed in {Course Name}")
Check out the full documentation for many more things you can do. There is so much potential for improving your data analysis using LOTUS.
That’s all you need to know about LOTUS for data analysis. I hope it helps your work!
Is there anything else you’d like to discuss? Let’s dive into it together!
👇👇👇