Using OpenAI Whisper to Transcribe Podcasts on Koyeb

Introduction

Real-time automated transcription is incredibly useful for anyone who needs to capture spoken content quickly and accurately. Whether you're creating subtitles for videos, transcribing podcast episodes, or documenting meeting notes, having an automated system can save you a lot of time and effort.

In practical terms, automated transcription can be used in various real-world scenarios. For instance, journalists can transcribe interviews on the fly, educators can provide real-time captions for their lectures, and businesses can document conference calls and meetings more efficiently.

OpenAI Whisper is an open-source solution built for this purpose. It uses state-of-the-art machine learning algorithms to transcribe speech with high accuracy, even handling different accents and speaking speeds.

In this tutorial, you will learn how to set up a Streamlit application, integrate OpenAI Whisper for real-time podcast transcription, and deploy the application using Docker and Koyeb, creating a scalable transcription service.

You can consult the project repository as work through this guide. You can deploy the podcast transcription application as built in this tutorial using the Deploy to Koyeb button below:

Requirements

To successfully follow this tutorial, you will need the following:

Git installed
FFmpeg installed
Python 3.6+ or later
A Koyeb account

Demo

Before we jump into the technical details, let me give you a sneak peek of what you will be building in this tutorial:

Understanding the components

OpenAI Whisper

OpenAI Whisper is a sophisticated speech-to-text (STT) model designed to transcribe spoken words into written text with high accuracy. Utilizing advanced machine learning algorithms, Whisper is capable of recognizing various accents, dialects, and speaking speeds. It can be integrated into voice assistants, dictation software, and real-time translation services to convert spoken language into text efficiently.

OpenAI Whisper can be used in sectors such as healthcare for medical dictation, in customer service for automated call transcriptions, and in media for generating subtitles for videos and podcasts. Its ability to handle complex speech patterns and languages makes it the go-to service in any application requiring high-quality speech-to-text conversion.

Streamlit

Streamlit is an open-source Python library designed to create interactive data applications, often referred to as dashboards. It empowers developers to build and share data apps simply and intuitively, eliminating the need for extensive web development expertise.

Streamlit apps are created as Python scripts, which are then executed within the Streamlit environment. The library offers a set of functions that can be used to add interactive elements to the app such as upload file button.

Steps

To build the transcription service, we'll complete the following steps:

Set up the environment: Start by setting up your project directory, installing necessary dependencies, and configuring environment variables.
Set up Streamlit: Next, install Streamlit and create the initial user interface for your application.
Transcribe podcasts with OpenAI Whisper: Use OpenAI Whisper to transcribe podcasts into text with timestamps.
Dockerize the Streamlit application: Create a Dockerfile to containerize your application for consistent deployment.
Deploy to Koyeb: Finally, deploy your application on the Koyeb platform.

Set up the environment

Let's start by creating a new Streamlit project. To keep your Python dependencies organized you should create a virtual environment.

First, create and navigate into a local directory:

# Create and move to the new directory
mkdir example-whisper-koyeb-gpu
cd example-whisper-koyeb-gpu

Afterwards, create and activate a new virtual environment:

# Create a virtual environment
python -m venv venv

# Active the virtual environment (Windows)
.\venv\Scripts\activate.bat

# Active the virtual environment (Linux)
source ./venv/bin/activate

Now, you can install the required dependencies. Open a requirements.txt file in your project directory with the following contents:

streamlit
openai-whisper
watchdog

Pass the file to pip to install the dependencies:

pip install -r requirements.txt

For the dependencies, we have included Streamlit for building the web app, OpenAI Whisper for real-time transcription, and watchdog to monitor file system events.

Don't forget to save your dependencies to a requirements.txt file:

pip freeze > requirements.txt

Now, let's move on to creating a new Streamlit project.

Set up Streamlit

In this step, you will set up the Streamlit UI that will allow users to upload an audio file, click a button to start the transcribing process, and finally present the segmented transcriptions in an user-friendly manner. All of the logic for the project will reside in this file, so you can start by creating a app.py file with the following code:

# File: app.py

import streamlit

stream_button_styles = """
<style>
    header { display: none !important; }
</style>
"""

page_styles = """
<style>
    h1 { font-size: 2rem; font-weight: 700; }
    h2 { font-size: 1.7rem; font-weight: 600; }
    .timestamp { color: gray; font-size: 0.9rem; }
</style>
"""

page_title = "Using OpenAI Whisper to Transcribe Podcasts"

page_description = "A demo showcasing the use of OpenAI Whisper to accurately and efficiently convert spoken content from podcasts into written text."

koyeb_box = "To deploy Whisper within minutes, <a href=\"https://koyeb.com/ai\">Koyeb GPUs</a> provide the easiest and most efficient way. Koyeb offers a seamless platform for deploying AI models, leveraging high-performance GPUs to ensure fast and reliable transcriptions."

step_1 = "1. Upload Podcast"

step_2 = "2. Invoke OpenAI Whisper to transcribe podcast 👇🏻"

step_3 = "3. Transcription &nbsp; 🎉"

def unsafe_html(tag, text):
    return streamlit.markdown(f"<{tag}>{text}</{tag}>", unsafe_allow_html=True)

def main():
    # Set title for the page
    streamlit.set_page_config(page_title, layout="centered")
    # Inject hide buttons CSS
    streamlit.markdown(stream_button_styles, unsafe_allow_html=True)
    # Inject page CSS
    streamlit.markdown(page_styles, unsafe_allow_html=True)
    # Create a H1 heading on the page
    streamlit.title(page_title)
    unsafe_html("h2", page_description)
    unsafe_html("p", koyeb_box)
    audio_file = streamlit.file_uploader(step_1, type=["mp3", "mp4", "wav", "m4a"])
    if audio_file:
        # If file is received
        # Write the file on the server
        # Show next step
        unsafe_html("small", step_2)
        if streamlit.button("Transcribe"):
            # Get the transcription
            unsafe_html("small", step_3)
            # Showcase the transcription

if __name__ == "__main__":
    main()

The code above does the following:

Begins by importing the Streamlit module
Defines CSS for hiding the navigation bar and styling the headings
Defines text values for the page's title, description, and each step
Defines the unsafe_html function to dynamically create the HTML tags with content
Accepts an audio file using Streamlit's builtin file_uploader function

With this, you have setup a UI that is able to accept podcast audio files from the user. Now, let's move on to transcribing the audio file obtained.

Transcribe podcasts with OpenAI Whisper

In this step, you will invoke OpenAI Whisper's base model to transcribe an audio file. By default, the model is able to return the timestamps along with the transcription. This enables you to use the generated transcriptions as subtitles as well. Make the following additions in the app.py file:

# File: app.py

# Existing imports
# . . .
import whisper # [!code ++]

model = whisper.load_model("base") # [!code ++]

# ...

def unsafe_html(tag, text):
    # ...

# Generate transcription of each segment
def timestamp_html(segment): # [!code ++]
    return f'<span class="timestamp">[{segment["start"]:.2f} - {segment["end"]:.2f}]</span> {segment["text"]}' # [!code ++]

# Transcribe an audio file
def transcribe_audio(audio_file): # [!code ++]
    return model.transcribe(audio_file.name) # [!code ++]

# Write the audio file on server
def write_audio(audio_file): # [!code ++]
    with open(audio_file.name, "wb") as f: # [!code ++]
        f.write(audio_file.read()) # [!code ++]

def main():
    # ...
    if audio_file:
        # If file is received
        # Write the file on the server
        write_audio(audio_file) # [!code ++]
        # Show next step
        unsafe_html("small", step_2)
        if streamlit.button("Transcribe"):
            # Get the transcription
            transcript_text = transcribe_audio(audio_file) # [!code ++]
            unsafe_html("small", step_3)
            # Showcase the transcription
            for segment in transcript_text["segments"]: # [!code ++]
                unsafe_html("div", timestamp_html(segment)) # [!code ++]

if __name__ == "__main__":
    main()

The changes above do the following:

Import and instantiate the OpenAI Whisper base model
Define a timestamp_html function to display the transcription with start and end timestamps
Define a transcribe_audio function which invokes the model to generate transcriptions of the audio file
Define a write_audio function to write the audio file on the server
If an audio file is found, it writes the file on the server
If the Transcribe button is clicked in the UI, transcribe_audio and timestamp_html functions are invoked to generate and display the transcriptions of the podcast

Now, you can run the Streamlit application with:

streamlit run ./app.py --server.port 8000

The application would now be ready on http://localhost:8000. Test the application in action by uploading one of your favorite podcasts file and see the transcriptions generated in real-time.

Next, let's dockerize the application to ensure consistency between multiple deployments.

Dockerize the Streamlit application

Dockerizing deployments creates a consistent and reproducible environment, ensuring that the application runs the same way on any system. It simplifies dependency management and enhances scalability, making deployments more efficient and reliable. To dockerize, create a Dockerfile at the root of your project with the following content:

FROM python:3.9 AS builder

WORKDIR /app

RUN python3 -m venv venv
ENV VIRTUAL_ENV=/app/venv
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

COPY requirements.txt .
RUN pip install -r requirements.txt

FROM python:3.9 AS runner

WORKDIR /app

RUN apt-get update && apt-get install -y ffmpeg && rm -rf /var/lib/apt/lists/*

COPY --from=builder /app/venv venv
COPY app.py app.py

ENV VIRTUAL_ENV=/app/venv
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

EXPOSE 8000

CMD ["streamlit", "run", "./app.py", "--server.port", "8000"]

Apart from the usual Dockerfile to deploy Python applications, the following tweaks and additions have been made in this code:

RUN apt-get update && apt-get install -y ffmpeg && rm -rf /var/lib/apt/lists/* is used to install ffmpeg, and then clean up package lists to reduce image size
EXPOSE 8000 is used to specify the port on which the Streamlit application will run
CMD ["streamlit", "run", "./app.py", "--server.port", "8000"] is used to define the command to start the Streamlit app on port 8000

With everything configured, let's move on to deploy the application to Koyeb.

Deploy to Koyeb

Now that you have the application running locally you can also deploy it on Koyeb and make it available on the internet.

Create a new repository on your GitHub account so that you can push your code.

You can download a standard .gitignore file for Python from GitHub to exclude certain directories and files from being pushed to the repository:

curl -L https://raw.githubusercontent.com/github/gitignore/main/Python.gitignore -o .gitignore

Run the following commands in your terminal to commit and push your code to the repository:

git init
git add app.py Dockerfile requirements.txt .gitignore
git commit -m "first commit"
git branch -M main
git remote add origin [Your GitHub repository URL]
git push -u origin main

You should now have all of your local code in your remote repository. Now it is time to deploy the application.

Within the Koyeb control panel, while on the Overview tab, initiate the app creation and deployment process by clicking Create Service and then choosing Create web service.

Select GitHub as the deployment source.
Select your repository from the menu. Alternatively, deploy from the example repository associated with this tutorial by entering https://github.com/koyeb/example-whisper-transcription in the public repository field.
In the Instance selection, select a GPU Instance.
In the Builder section, choose Dockerfile.
Finally, click the Deploy button.

Once the application is deployed, you can visit the Koyeb service URL (ending in .koyeb.app) to access the Streamlit application.

Conclusion

In this tutorial, you built a podcast transcription application with the Streamlit framework and OpenAI Whisper. During the process, you learned how to invoke the OpenAI Whisper model in Python to generate transcription with timestamps, and how to use the Streamlit framework to quickly prototype a user interface with a functioning upload button in a few lines of code.

Given that the application was deployed using the Git deployment option, subsequent code push to the deployed branch will automatically initiate a new build for your application. Changes to your application will become live once the deployment is successful. In the event of a failed deployment, Koyeb retains the most recent operational production deployment, ensuring the uninterrupted operation of your application.