All appsQwen 2.5 7B Instruct

Qwen 2.5 7B Instruct

Deploy Qwen 2.5 7B Instruct on Koyeb high-performance GPU

Deploy Qwen 2.5 7B Instruct on Koyeb high-performance infrastructure. Instantly launch a dedicated endpoint on GPU to handle inference requests with zero-configuration.

Scale to millions of requests with built-in autoscaling. Scale up and scale down to zero during idle periods.

Overview of Qwen 2.5 7B Instruct

Qwen 2.5 7B Instruct model is a cutting-edge, open-source large language model built for generating high-quality. With 7 billion parameters, it excels in content generation, conversational AI, and sophisticated data analysis.

Additionally, it supports an extended context window of up to 128K tokens, making it suitable for long-document processing and generating responses up to 8K tokens in length.

Qwen 2.5 7B Instruct will be served using vLLM inference engine, ensuring high-througput, low-latency performance, and runs on a Nvidia A100 by default. Adjust the GPU instance type to fit your workload requirements.

Quickstart

The Qwen 2.5 7B Instruct one-click model is powered by the vLLM engine. vLLM is an advanced inference engine designed for high-throughput and low-latency model serving. Optimized for large language models, it provides efficient performance and compatibility with the OpenAI API.

After you deploy the model, copy the Koyeb App public URL similar to https://<YOUR_DOMAIN_PREFIX>.koyeb.app and create a simple Python file with the following content to start interacting with the model.

import os

from openai import OpenAI

client = OpenAI(
  api_key = os.environ.get("OPENAI_API_KEY", "fake"),
  base_url="https://<YOUR_DOMAIN_PREFIX>.koyeb.app/v1",
)

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Tell me a joke.",
        }
    ],
    model="Qwen/Qwen2.5-7B-Instruct",
    max_tokens=30,
)

print(chat_completion.to_json(indent=4))

The snippet above is using the OpenAI SDK to interact with the Qwen 2.5 7B Instruct model thanks to vLLM OpenAI compatibility.

Take care to replace the base_url value in the snippet with your Koyeb App public URL.

Executing the Python script will return the model's response to the input message.


python main.py

{
    "id": "chatcmpl-a94edf120cb74cc995d93ec82afc4b53",
    "choices": [
        {
            "finish_reason": "length",
            "index": 0,
            "logprobs": null,
            "message": {
                "content": "A man walks into a library and asks the librarian, \"Do you have any books on Pavlov's dogs and Schrödinger's cat",
                "role": "assistant",
                "tool_calls": []
            },
            "stop_reason": null
        }
    ],
    "created": 1732135919,
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 30,
        "prompt_tokens": 40,
        "total_tokens": 70,
        "prompt_tokens_details": null
    },
    "prompt_logprobs": null
}

Securing the Inference Endpoint

To ensure that only authenticated requests are processed, we recommend setting up an API key to secure your inference endpoint. Follow these steps to configure the API key:

  1. Generate a strong, unique API key to use for authentication
  2. Navigate to your Koyeb Service settings
  3. Add a new environment variable named VLLM_API_KEY and set its value to your secret API key
  4. Save the changes and redeploy to update the service

Once the service is updated, all requests to the inference endpoint will require the API key.

When making requests, ensure the API key is included in the headers. If you are using the OpenAI SDK, you can provide the API key through the api_key parameter when instantiating the OpenAI client. Alternatively, you can set the API key using the OPENAI_API_KEY environment variable. For example:

OPENAI_API_KEY=<YOUR_API_KEY> python main.py

Deploy AI apps to production in minutes

Koyeb is a developer-friendly serverless platform to deploy apps globally. No-ops, servers, or infrastructure management.
All systems operational
© Koyeb