Deploy QwQ 32B One-Click App

Deploy QwQ 32B on Koyeb's high-performance cloud infrastructure.

With one click, get a dedicated GPU-powered inference endpoint ready to handle requests with built-in autoscaling and scale-to-zero.

Deploy QwQ 32B for free

Get $200 in credit on your first invoice!

Claim credit

Overview of QwQ 32B

QwQ 32B is reasoning model of the Qwen series. With 32 billion parameters, QwQ 32B uses Reinforcement Learning to enhance reasoning capabilities for coding and mathematical tasks, achieving performance comparable to much larger models. The model is capable of achieving competitive performance against state-of-the-art reasoning models, like DeepSeek-R1.

QwQ 32B will be served with the vLLM inference engine, optimized for high-throughput and low-latency model serving.

The default GPU for running this model is the Nvidia 2xA100 instance type. You are free to adjust the GPU instance type to fit your workload requirements.

Quickstart

The QwQ 32B one-click model is served using the vLLM engine. vLLM is an advanced inference engine designed for high-throughput and low-latency model serving. Optimized for large language models, it provides efficient performance and compatibility with the OpenAI API.

After you deploy QwQ 32B, copy the Koyeb App public URL similar to https://<YOUR_DOMAIN_PREFIX>.koyeb.app and create a simple Python file with the following content to start interacting with the model.

import os

from openai import OpenAI

client = OpenAI(
  api_key = os.environ.get("OPENAI_API_KEY", "fake"),
  base_url="https://<YOUR_DOMAIN_PREFIX>.koyeb.app/v1",
)

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Tell me a joke.",
        }
    ],
    model="Qwen/QwQ-32B",
    max_tokens=30,
)

print(chat_completion.to_json(indent=4))

The snippet above is using the OpenAI SDK to interact with the QwQ 32B thanks to vLLM OpenAI compatibility.

Take care to replace the base_url value in the snippet with your Koyeb App public URL.

Executing the Python script will return the model's response to the input message.


python main.py

{
    "id": "chatcmpl-c3e33659-a3bf-90ff-abbe-87a8a742b721",
    "choices": [
        {
            "finish_reason": "length",
            "index": 0,
            "logprobs": null,
            "message": {
                "content": "Okay, the user wants a joke. Let me think of something light and funny. Maybe a classic setup and punchline. Pizza jokes are usually safe and relatable.\n\nHmm, why did the scarecrow win an award? Because he was outstanding in",
                "role": "assistant",
                "tool_calls": [],
                "reasoning_content": null
            },
            "stop_reason": null
        }
    ],
    "created": 1741260121,
    "model": "Qwen/QwQ-32B",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 50,
        "prompt_tokens": 15,
        "total_tokens": 65,
        "prompt_tokens_details": null
    },
    "prompt_logprobs": null
}

Securing the Inference Endpoint

To ensure that only authenticated requests are processed, we recommend setting up an API key to secure your inference endpoint. Follow these steps to configure the API key:

Generate a strong, unique API key to use for authentication
Navigate to your Koyeb Service settings
Add a new environment variable named VLLM_API_KEY and set its value to your secret API key
Save the changes and redeploy to update the service

Once the service is updated, all requests to the inference endpoint will require the API key.

When making requests, ensure the API key is included in the headers. If you are using the OpenAI SDK, you can provide the API key through the api_key parameter when instantiating the OpenAI client. Alternatively, you can set the API key using the OPENAI_API_KEY environment variable. For example:

OPENAI_API_KEY=<YOUR_API_KEY> python main.py

QwQ 32B

Overview of QwQ 32B

Quickstart

Securing the Inference Endpoint

Deploy AI apps to production in minutes