Build a Multimodal Chat App using LLava, Chainlit, and Replicate

In the rapidly evolving landscape of artificial intelligence (AI), multimodal vision models stand out as an incredible innovation. These models merge the visual understanding of images with the linguistic comprehension of text to create systems that can interpret and interact with the world in ways similar to humans.

However, the true potential of such groundbreaking AI technologies can only be realized when they are made accessible to a broader audience. This is where the importance of user-friendly interfaces, like Chainlit, become unmistakably clear. By wrapping complex AI capabilities in interfaces that are intuitive and easy to navigate, these technologies enable users from various backgrounds to leverage the power of AI without the need for deep technical expertise.

This tutorial showcases how to build a chat interface, using Chainlit for the front end and LLaVA for powering the back-end API.

You can deploy the multimodal vision chat application as configured in this guide using the Deploy to Koyeb button below.

Requirements

To successfully follow this tutorial, you will need the following:

Replicate Account: A Replicate account must use their API that allows interaction with a LlaVa model.
Koyeb Account: A Koyeb account will be required for deploying and managing the chat application in a cloud environment, taking advantage of Koyeb’s seamless integration and deployment capabilities.
GitHub Account: A GitHub account is necessary for version control and managing the project's codebase.

Understanding of the components

Overview of Multimodal Vision Models

Multimodal vision models are a major advancement in the field of artificial intelligence, standing at the intersection of visual perception and natural language processing (NLP). These sophisticated models interpret and analyze data from multiple sources or modalities, primarily focusing on visual (images, videos) and textual (descriptions, questions) inputs.

By using convolutional neural networks (CNNs) or transformers to analyze visual data, these models extract features and patterns that define the content and context of an image. Simultaneously, they employ NLP techniques to understand and process textual data, enabling them to grasp the semantics of user queries or descriptions related to the images.

When these models integrate the processed information from both modalities, they can not only identify objects within an image but also understand their attributes, the relationships between them, and the overall context or scene depicted. This more comprehensive understanding allows the models to generate nuanced responses to queries, make inferences, and even generate descriptive texts or answer questions about unseen images accurately.

Overview of LLaVA

LLaVA represents a state-of-the-art open-source framework designed to facilitate the integration of language and vision models. It enables the seamless processing of multimodal queries, where users can ask questions or make requests that involve both textual and visual information. By combining the capabilities of language models with vision processing algorithms, LLaVA provides a robust backend solution capable of understanding and responding to complex queries about images.

It incorporates the latest advancements in AI and machine learning, including deep learning models for image recognition and natural language processing. Being open-source, it allows for customization and optimization according to specific project needs, making it a versatile choice for developers. Designed with scalability in mind, it is capable of handling a wide range of query volumes and complexities.

Overview of Replicate

Replicate is a web-based platform that allows users to deploy and scale machine learning models easily. The platform provides a simple interface for managing machine learning models and handling tasks such as data preprocessing, model training, and model deployment.

It is designed to make it easy for developers and data scientists to build and deploy machine learning applications without having to worry about the underlying infrastructure. The platform supports a variety of machine learning frameworks, including TensorFlow, PyTorch, and Scikit-learn, and allows users to deploy models in a variety of environments, including on-premises hardware, virtual machines, and cloud-based infrastructure.

It also provides features such as version control, collaboration tools, and automated testing to help teams work together more effectively and ensure that their machine-learning models are accurate and reliable. Overall, the platform is designed to simplify the process of building and deploying machine learning applications, allowing developers and data scientists to focus on building great models rather than worrying about infrastructure and deployment.

Steps

To build this chat interface you will follow these few steps:

Set Up the Environment: Here you will set up your project folder, install any dependencies, and prepare environment variables.
Set Up Chainlit: In this section, you will install Chainlit and set up the initial chat interface.
Integrate LlaVa API from Replicate: In this section, you will integrate with a LlaVa API from Replicate that will process the images and return the text response.
Run Examples: Testing the newly created application with a set of examples.
Deploy to Koyeb

Set Up the Environment

First, let’s start by creating a new project. To keep your Python dependencies organized you should create a virtual environment.

You can create a local folder on your computer with:

# Create and move to the new folder
mkdir VisionChat
cd VisionChat

# Create a virtual environment
python -m venv venv

# Active the virtual environment (Windows)
.\venv\Scripts\activate.bat

# Active the virtual environment (Linux)
source ./venv/bin/activate

Next, you can install the required dependencies:

pip install chainlit openai replicate requests python-decouple

Along with the expected libraries for Chainlit, and Replicate, we also installed OpenAI (for Chainlit) and requests to be able later on to upload the image files, and python-decouple for loading environment variables.

Don’t forget to save your dependencies to the requirements.txt file:

pip freeze > requirements.txt

As mentioned before, you will need a Replicate account to access a Llava model, if you don’t have an account, you can create one here. After that, you will get access to the API key.

The next step is precisely to create a .env file to store the API key and Model settings for Replicate:

REPLICATE_API_KEY=<YOUR_REPLICATE_API_KEY>
REPLICATE_MODEL=yorickvp/llava-v1.6-mistral-7b
REPLICATE_MODEL_VERSION=19be067b589d0c46689ffa7cc3ff321447a441986a7694c01225973c2eafc874

For this tutorial, we will use a Llava model that also contains the Mistral 7B model.

Set Up Chainlit

Now you can start implementing the Chainlit application. The code implementation will reside in one single file, so you can create a new file app.py:

import time
import chainlit as cl
import replicate
import requests
from chainlit import user_session
from decouple import config

Overall, this code sets up the necessary imports and functions for a script that uses the chainlit and replicate libraries to deploy and interact with a machine learning model. The requests and decouple libraries are also used for making HTTP requests and managing configuration settings, respectively.

The first function to be called when a chat starts is on_chat_start, so let’s write its implementation:

# On chat start
@cl.on_chat_start
async def on_chat_start():
    # Message history
    message_history = []
    user_session.set("MESSAGE_HISTORY", message_history)
    # Replicate client
    client = replicate.Client(api_token=config("REPLICATE_API_KEY"))
    user_session.set("REPLICATE_CLIENT", client)

This code defines a function on_chat_start() that is decorated with @cl.on_chat_start. This decorator indicates that the function should be executed when a new chat session is started in a Chainlit application.

Inside the on_chat_start() function, two things happen:

A new empty list called message_history is created. This list will be used to store the history of messages exchanged between the user and the chatbot. The list is then stored in the user's session using the user_session.set() function.
A new instance of the replicate.Client class is created using an API token that is retrieved from the configuration using the config() function from the decouple library. The replicate.Client class is used to interact with the Replicate platform. The instance is then stored in the user's session using the user_session.set() function.

Integrate LlaVa API from Replicate

The Llava model that we will use requires that the image uploaded be accessible over the internet with a remote URL. To make this available, we will use a Replicate endpoint where the images can be uploaded and a remote URL is returned. This endpoint is normally used to upload images for training models, but in this case, we will use it as a file repository.

Let’s implement that function in the [app.py](http://app.py) file:

# Upload image to Replicate
def upload_image(image_path):
    # Get upload URL from Replicate (filename is hardcoded, but not relevant)
    upload_response = requests.post(
        "https://dreambooth-api-experimental.replicate.com/v1/upload/filename.png",
        headers={"Authorization": f"Token {config('REPLICATE_API_KEY')}"},
    ).json()
    # Read file
    file_binary = open(image_path, "rb").read()
    # Upload file to Replicate
    requests.put(upload_response["upload_url"], headers={'Content-Type': 'image/png'}, data=file_binary)
    # Return URL
    url = upload_response["serving_url"]
    return url

This code defines a function called upload_image() that takes a single argument, image_path, which is the file path of an image file to be uploaded to the Replicate platform.

The function performs the following steps:

It sends a POST request with an authorization header that includes the Replicate API key. The response from this request is a JSON object containing an upload URL.
It reads the binary data from the image file using the open() function with the "rb" (read binary) mode.
It sends a PUT request to the upload URL obtained previously with the binary data of the image file as the request body and a header indicating the content type. This uploads the image file to the Replicate platform.
It extracts the serving URL from the JSON response obtained previously and returns it as the function output.

Overall, this function uploads an image file to the Replicate platform and returns the serving URL that can be used to access the uploaded image.

With this helper function prepared, you can now write the core of the Chainlit application, the function that processes messages and integrates with the Replicate model:

# On message
@cl.on_message
async def main(message: cl.Message):
    # Send empty message for loading
    msg = cl.Message(
        content=f"",
        author="Vision Chat",
    )
    await msg.send()

    # Processing images (if any)
    images = [file for file in message.elements if "image" in file.mime]

    # Setup prompt
    prompt = """You are a helpful Assistant that can help me with image recognition and text generation.\n\n"""
    prompt += """Prompt: """ + message.content

    # Retrieve message history
    message_history = user_session.get("MESSAGE_HISTORY")

    # Retrieve Replicate client
    client = user_session.get("REPLICATE_CLIENT")

    # Check if there are images and set input
    if len(images) >= 1:
        # Clear history (we clear history when we have a new image)
        message_history = []
        # Upload image to Replicate
        url = upload_image(images[0].path)
        # Set input with image and without history
        input_vision = {
            "image": url,
            "top_p": 1,
            "prompt": prompt,
            "max_tokens": 1024,
            "temperature": 0.5,
        }
    else:
        # Set input without image and with history
        input_vision = {
            "top_p": 1,
            "prompt": prompt,
            "max_tokens": 1024,
            "temperature": 0.5,
            "history": message_history
        }

    # Call Replicate
    output = client.run(
        f"{config('REPLICATE_MODEL')}:{config('REPLICATE_MODEL_VERSION')}",
        input=input_vision
    )

    # Process the output
    ai_message = ""
    for item in output:
        # Stream token by token
        await msg.stream_token(item)
        # Sleep to provide a better user experience
        time.sleep(0.1)
        # Append to the AI message
        ai_message += item
    # Send the message
    await msg.send()

    # Add to history
    user_text = message.content
    message_history.append("User: " + user_text)
    message_history.append("Assistant:" + ai_message)
    user_session.set("MESSAGE_HISTORY", message_history)

This code defines a function called main() that is decorated with @cl.on_message. This decorator indicates that the function should be executed when a new message is received in a Chainlit chat session.

Inside the main() function, the following steps are performed:

An empty message is sent to indicate that the chatbot is processing the user's message.
Any images attached to the user's message are extracted and stored in the images list.
A prompt is set up that includes a description of the chatbot's capabilities and the user's message.
The chat history is retrieved from the user's session using the user_session.get() function.
The replicate.Client instance is retrieved from the user's session using the user_session.get() function.
Suppose there are any images attached to the user's message. In that case, the chat history is cleared, the image is uploaded to the Replicate platform using the upload_image() function, and the input to the Llava model is set to include the uploaded image and the prompt. If there are no images, the input to the Llava model is set to include only the prompt and the chat history.
The Llava model is called using the client.run() function with the appropriate input.
The output from the Llava model is processed token by token and streamed to the user. The output is also stored in the ai_message variable.
The final output message is sent to the user using the msg.send() function.
The user's message and the chatbot's response are added to the chat history and stored in the user's session using the user_session.set() function.

Overall, this code implements a chatbot that can handle both text and image inputs. That is all the code necessary to implement a multimodal chat with Chainlit, Replicate, and Lllava.

Next, let’s take a look at some working examples.

Run Examples

To run the Chainlit application you just need to execute in the terminal:

chainlit run app.py

A browser window automatically opens with the landing screen for the Chainlit application:

Chainlit landing screen

Let’s see some examples in action:

Disclaimer: When using machine learning models deployed on the Replicate platform, it is important to note that the first execution of a model after a period of inactivity may take longer than subsequent executions. This is known as a "cold boot" or "cold start".

This is because when a model is not being used, the platform may shut down some of the resources allocated to it to conserve resources and reduce costs. When the model is invoked again, these resources need to be reallocated and the model needs to be loaded back into memory, which can take some time.

The duration of the cold boot period can vary depending on the size and complexity of the model, as well as the current load on the platform. In general, expect that the first image processed in the chat may take longer to process and it might even timeout. If that happens, simply upload the image and send a message again.

Deploy to Koyeb

Now that you have the application running locally you can also deploy it on Koyeb and make it available on the Internet.

Create a repository on your GitHub account, for instance, called VisionChat.

You can download a standard .gitignore file for Python from GitHub to exclude certain folders and files from being pushed to the repository:

curl -L https://raw.githubusercontent.com/github/gitignore/main/Python.gitignore -o .gitignore

Run the following commands in your terminal to commit and push your code to the repository:

echo "# VisionChat" >> README.md
git init
git add .
git commit -m "first commit"
git branch -M main
git remote add origin [Your GitHub repository URL]
git push -u origin main

You should now have all your local code in your remote repository. Now it is time to deploy the application.

Within the Koyeb control panel, while on the Overview tab, click Create Web Service to begin:

Select GitHub as your deployment method.
Choose the repository where your code resides.
In the Builder section, click the Override toggle associated with the Run command and enter chainlit run app.py in the field.
In the Environment variables and files section, click the Add variable button to add your Replicate API key named REPLICATE_API_KEY . Add also the REPLICATE_MODEL and REPLICATE_MODEL_VERSION variables.
Choose a name for your App and Service, for example, vision-chat. Note the name will be used to create the public URL for this app. You can add a custom domain later if you’d like.
Click Deploy.

Once the application is deployed, you can visit the Koyeb service URL (ending in .koyeb.app) to access the chatbot interface.

Conclusion

In this guide, we used Chainlit and LLaVA to create a user-friendly interface for complex AI operations. These tools enable building applications that enhance our collective interaction with digital content and democratize the use of advanced AI for a broader audience.

Multimodal models combine visual understanding with linguistic understanding. They open up new possibilities for human-computer interactions and make AI technologies more accessible and interactive.

The potential implications of this development are vast, ranging from education and accessibility to entertainment and beyond. It represents a shift towards more natural and intuitive ways of interacting with technology, where conversation with images becomes as commonplace as texting.