Fine-Tune MistralAI and Evaluate the Fine-Tuned Model on Koyeb Serverless GPUs
20 minMistralAI is an advanced language model designed for tasks like text generation, sentiment analysis, translation, summarization, and more.
By default, MistralAI is trained on general language data. It performs even better when fine-tuned to specific domains like finance, law, or medicine.
Fine-tuning retrains the model on domain-specific data, enabling it to understand the specific terms, patterns, and concepts used in that field. For instance, in finance, supplementary information for retraining a model includes financial reports, stock market data, or legal documents.
With fine-tuning, MistralAI becomes more accurate at understanding complex financial terms, market trends, and regulatory requirements. This enhancement makes the model more adept at predicting financial outcomes, generating insightful analysis, and supporting decision-making in areas like trading or risk management.
Requirements
To successfully complete this tutorial, you will need the following:
- GitHub Account: Needed for managing the fine-tuning code. Sign up at GitHub if you don’t have one.
- Koyeb Account: Required for accessing Koyeb’s cloud infrastructure, including GPU resources. Create an account at Koyeb if you don’t have one.
- Koyeb GPU Access: Make sure your Koyeb account has access to GPU instances for fine-tuning and to deploy GPU-enabled instances through the Koyeb dashboard.
- Basic Knowledge: Familiarity with Python (running scripts, setting up virtual environments). Basic understanding of Docker.
- NewsAPI.org API Key: Access to a NewsAPI API Key. This is needed to retrieve content for the financial data set.
- MistralAI: Access to a Mistral AI API key. This will be needed to prepare the financial data set with AI.
Steps
This tutorial is divided into the following steps:
- Cloning and Exploring the GitHub Repository
- Understanding the Fine-Tuning Workflow
- Preparing the Financial Dataset
- Preparing Training and Evaluation Datasets
- Configure the Training Script
- Deploying to Koyeb GPU
- Running the Fine-Tuning process
- Evaluating the Fine-Tuned Model
Cloning and Exploring the GitHub Repository
To start fine-tuning MistralAI, the first step is to clone the official GitHub repository, which has all the necessary scripts and settings for training.
Clone the repository with the following command:
Exploring the Key Files and Folders
After cloning the repository, take a moment to look around its structure. Understanding these files will help you customize and run the fine-tuning process effectively.
The file example/7B.yaml
is particularly important:
- This configuration file defines the model architecture and settings such as batch size, and number of training cycles.
- It is crucial for setting up the training environment and should be reviewed and adjusted based on your specific needs, especially if you are fine-tuning for a specialized area like finance.
Other important files are:
validate_data.py
:- This script is used to validate the dataset before training. It ensures that the data is complete, correctly formatted, and free of errors that could impact the training process.
- Running this script helps identify and resolve any issues with the dataset, ensuring smooth training.
reformat_data.py
:- This script is used to reformat the dataset if necessary. It ensures that the data is in the correct format required by the model for training.
- This step is important to maintain consistency and accuracy in the dataset, which is essential for effective fine-tuning.
Understanding and properly configuring the 7B.yaml
file is essential for effective fine-tuning of the MistralAI model. We will see later on the necessary settings for our fine-tune process for the Mistral 7B model.
Understanding the Fine-Tuning Workflow
Fine-tuning a language model like MistralAI involves a systematic workflow to ensure the model is properly adapted to a specific domain or task.
1. Prepare the Dataset
- Gather the Data (Content): Collect domain-specific data relevant to the task you are fine-tuning the model for. This could include financial reports, market data, customer reviews, or any other textual data that reflects the language and concepts you want the model to learn. Ensure the data is comprehensive and diverse enough to cover different scenarios and contexts within the domain.
- Proper Formatting: Format the collected data into a structure that the model can process. This typically involves organizing the text into a sequence of interactions (e.g., question-answer pairs) or continuous text segments. Ensure consistency in formatting across all data samples to avoid confusion during training. This might include tokenization, lowercasing, and handling special characters or symbols.
2. Prepare Training and Evaluation Datasets
- Splitting the Original Dataset: Divide the prepared dataset into two parts: the training set and the evaluation (or validation) set. The training set is used to teach the model, while the evaluation set is used to monitor performance and avoid overfitting. A common split ratio is 80/20 or 90/10, but this can be adjusted based on the size of your dataset and the specific requirements of your task. Ensure that both sets are representative of the full dataset, covering all relevant aspects of the domain.
- Balancing the Datasets: Check for balance in the training and evaluation datasets. For example, if the data includes different categories (e.g., different financial instruments or market conditions), ensure that each category is well-represented in both the training and evaluation sets. This step is crucial to avoid bias in the model's predictions.
3. Configure the Training Script
- Set Batch Size: Batch size determines how many samples are processed before the model's weights are updated. Larger batch sizes can make training faster but require more memory, while smaller batches can lead to better generalization but might make training slower. Experiment with different batch sizes to find the optimal setting for your hardware and data.
- Define Training Steps: Specify the number of training steps or epochs. This determines how many times the model will iterate over the entire training dataset. Monitor the model's performance during training to decide whether more or fewer steps are needed.
4. Verify the Dataset (Training + Evaluation)
- Check Data Integrity: Verify that the data in both training and evaluation sets is complete and correctly formatted. Look for missing values, corrupted files, or inconsistencies that could impact training. Run preliminary checks to ensure that the data loads correctly into the training pipeline.
- Validate Data Distribution: Confirm that the distribution of data in the training and evaluation sets aligns with the expected real-world distribution. This is especially important in domains like finance, where different market conditions need to be represented.
5. Train the Model
- Initiate Training: Begin the fine-tuning process by running the configured training script.
- Monitor Performance: Regularly evaluate the model's performance on the validation set during training.
- Save Checkpoints: Save model checkpoints at regular intervals to preserve the model’s state at different points in training.
Preparing the Financial Dataset
To prepare the dataset for training, it needs to be structured in a specific format that the MistralAI fine-tune process can understand. The format typically follows a structure similar to this:
Each JSON object contains a list of messages, with each message having a role
field to indicate the speaker (either "user" or "assistant") and a content
field to store the text of the interaction.
This file type is called a JSONL (JSON lines files), because it contains several JSON objects separated by a newline.
Preparing Training and Evaluation Datasets
To gather the data (content), we will use an API from NewsAPI.org to get financial news content. You will need to register to get a free API key from NewsAPI.org if you don’t have one.
You will also need access to an API key from MistralAI. You can register for one here if you don’t have one.
Then you can write the dataset_processing.py
script.
This Python script automates the process of collecting financial news data, formatting it for fine-tuning a MistralAI language model, and preparing it into training and evaluation datasets:
fetch_financial_news
: This function fetches financial news articles from NewsAPI based on a specified topic.save_news_to_csv
: This function saves the fetched news data into a CSV file.process_csv_to_jsonl
: This function converts the CSV data into a JSONL format, generating chat-based interactions using MistralAI.separate_jsonl_train_eval
: This function splits the JSONL data into training and evaluation datasets.
The main function orchestrates the entire workflow:
- Fetching news data.
- Saving the data to a CSV file.
- Converting the CSV data to JSONL format.
- Splitting the JSONL data into training and evaluation datasets.
This script will generate the following dataset files, which will be used to train the model (you will run this script later on the remote machine):
financial_market_news_train.jsonl
: Contains the training data, in this case, 90 records of questions and answers related to news source data.financial_market_news_eval.jsonl
: Contains the evaluation data, in this case, 10 records of questions and answers related to news source data.
Configure the Training Script
Before we deploy to the Koyeb CPU to validate the dataset and train the model, you can start preparing the training configuration file. This file, which is a YAML file, will include all the necessary settings for the training process, as mentioned earlier.
So, go ahead and create a 7B.yaml
file:
This is the important information that you need to fill in:
instruct_data
: This is the path to the training dataset. This dataset will be generated when you run thedataset_processing.py
script on the remote machine.eval_instruct_data
: This is the path to the evaluation dataset. This dataset will also be generated when you run thedataset_processing.py
script on the remote machine.model_id_or_path
: This is the identifier or path of the model you will be training. You will download this model later on the remote machine.batch_size
: You can adjust this if needed, but a batch size of 1 will work well for this case.max_steps
: This is the number of steps to train the model with. The default of 300 steps provides a good balance between speed and training capabilities. You can reduce to 100 steps for faster processing at a cost of less accuracy.run_dir
: This is the directory where the trained model will be saved.
After deployment, you will need to download the model to train, execute the dataset script, and then train the model. These settings are prepared for the commands you will execute later on.
Deploying to Koyeb GPU
To deploy the fine-tuning process to Koyeb, you will need to create a Dockerfile that sets up the environment for training the model, a repository to store the code, and finally deploy the app to Koyeb via git and built using the Dockerfile.
Create a Dockerfile
We'll start by preparing a Dockerfile to ensure we have all the necessary dependencies installed, especially for GPU support. Create a Dockerfile
with the following contents:
This Dockerfile is designed to set up an environment for fine-tuning the MistralAI language model. It automates the process of cloning the necessary repository, installing dependencies, and copying both the training configuration and the training dataset script.
Create the repository
The final step is to create a new repository on GitHub to store the project files.
Once you're ready, run the following commands in your terminal to commit and push your code to the repository:
You should now have all your local code in your remote repository. Now it is time to deploy the Dockerfile.
Deploy to Koyeb
In the Koyeb control panel, while on the Overview tab, initiate the app creation and deployment process by clicking Create App. You can select a Worker application.
On the App deployment page:
- Select GitHub as your deployment method.
- Choose the repository where your code resides. For example,
MistralFineTuning
. - Select the GPU you wish to use, for example,
A100
. The training might work on other GPUs, but for performance and training accuracy, this is the recommended. - In the Builder section, choose Dockerfile.
- In the Service name section, choose an appropriate name.
- Finally, click Deploy.
Running the Fine-Tuning process
Once the deployment is complete, you can start preparing and running the fine-tuning of the model.
The Dockerfile deployment has set up the base system needed to train the model, but it didn’t download a model, so that will be one of the first steps.
Since the next commands need interaction with the remote machine, you'll use the Koyeb CLI to access the remote machine through the terminal.
First, make sure you have the Koyeb CLI installed. You can find the installation instructions here. Then, generate an API Token, which you can do here.
Now you are ready to log in with the Koyeb CLI:
First, input your API token key when asked for it.
To see a list of running instances, use the following command:
Note the instance ID you want to connect to. Then, create a remote terminal session to the remote machine:
You now have an active remote session to the remote machine. All commands executed from now on will be on the remote machine.
As mentioned, the first step is to download the model to train, in this case the Mistral 7B Instruct:
It might take a couple of minutes for the model to be downloaded and extracted.
Next, to ensure proper compatibility, make sure that the Numpy package installed is at version 1.26.4:
Now you can install the necessary libraries for executing the dataset script:
You can then copy the necessary information for the .env
file:
Make sure to replace the values with your own API keys.
And then you can execute the dataset_processing.py
script:
It might take a couple of minutes to prepare the dataset. After it finishes, you should have two JSONL files corresponding to the training and evaluation datasets.
You can now validate those datasets with:
You should get an estimate on the ETA for the training and there should not be any validation errors. If there are errors, you can fix them with:
Validate the dataset again (if needed) and now there should be no errors:
Everything is now ready to train the model, which you can do with:
The CUDA_VISIBLE_DEVICES=0
is necessary to make sure the training script recognizes the GPU on the remote machine.
This process will take several minutes, possibly even hours. It will show an estimate of the remaining processing time.
After it is finished, you will be able to evaluate the trained model against the standard model, which we will see how to do in the next section.
Evaluating the Fine-Tuned Model
To evaluate the fine-tuned model, we first need to establish a baseline with the default model.
First, you need to install the necessary package on the remote machine:
Now you can test the default model by running:
It will ask you for a prompt. Let’s try this one:
As you can see, the default model gave a very generic answer.
Now let’s run the fine-tuned model with:
And we use the same prompt:
As you can see, the fine-tuned model gave a much more accurate and precise answer.
Impact on Domain Knowledge
Fine-tuning MistralAI on financial data significantly improves the model's ability to understand and operate within the financial domain. This process transforms the model into a specialized tool that has a deep understanding of the financial domain.
Here we have just exposed the model to a subset of recent news, but by exposing the model to more domain-specific data, such as financial reports, market analysis, and regulatory documents, it learns the precise meanings and nuances of financial terminology.
Fine-tuning also helps the model stay current with ongoing trends in the financial industry. This includes understanding the implications of market movements, economic indicators, and geopolitical events on financial markets.
This enhanced understanding and specialization enable the model to perform a wide range of finance-related tasks with greater accuracy, relevance, and compliance. This makes it an invaluable asset for financial professionals and organizations, helping them to make more informed decisions and improve their overall performance in the financial domain.
Conclusion
You've just completed this tutorial on fine-tuning MistralAI on Koyeb Serverless GPUs.
You can check out the example repository for this tutorial on Koyeb's GitHub for fine-tuning MistralAI with serverless GPUs.
While this guide focused on fine-tuning MistralAI for finance, the approach and techniques covered here are the same across various domains. Whether you're working with healthcare data, legal documents, technical manuals, or customer service interactions, fine-tuning can significantly improve the relevance and accuracy of AI models.
Have fun experimenting with your own datasets and seeing how fine-tuning can add value and improve performance in your specific area of interest!