Apr 30, 2024
4 min read

Serverless GPUs in Private Preview: L4, L40S, V100, and more

Today, we’re excited to share that Serverless GPUs are available for all your AI inference needs directly through the Koyeb platform!

Note

Serverless GPUs are now in public preview! Deploy your AI inference workloads backed by high-end GPUs in minutes.

We're starting with GPU Instances designed to support AI inference workloads including both heavy generative AI models and lighter computer vision models. These GPUs provide up to 48GB of vRAM, 733 TFLOPS and 900GB/s of memory bandwidth to support large models including LLMs and text-to-image models.

We're starting with 4 Instances, starting at $0.50/hr and billed by the second:

RTX 4000 SFF ADAV100L4L40S
GPU vRAM20GB16GB24GB48GB
GPU Memory Bandwidth280GB/s900GB/s300GB/s864GB/s
FP3219.2 TFLOPS15.7 TFLOPS30.3 TFLOPS91.6 TFLOPS
FP8--240 TFLOPS733 TFLOPS
RAM44GB44GB44GB96GB
vCPU681530
On-demand price$0.50/hr$0.85/hr$1/hr$2/hr

All these Instances have dedicated vCPUs that are equivalent to a hyperthread.

These GPUs come with the same serverless deployment experience that you know on the platform with one-click deployment of Docker containers, built-in load-balancing, and seamless horizontal autoscaling, zero downtime deployments, auto-healing, vector databases, observability, and real-time monitoring.

By the way, you can pause your GPU Instances when they're not in use. This is a great way to stretch your computing budget when you don't need to keep your GPU instances running 24/7.

Serverless GPUs on Koyeb

To access these GPU Instances, join the preview on koyeb.com/ai. We're gradually onboarding users to ensure the best experience for everyone.

If you need GPUs in volume, let us know in your request or book a call.

Getting started with GPUs

To get started and deploy your first service backed by a GPU, you can use the Koyeb CLI or the Koyeb Dashboard.

As usual, you can deploy using pre-built containers or directly connect your GitHub repository and let Koyeb handle the build of your applications.

Here is how you can deploy a Ollama service in one CLI command:

koyeb app init ollama \
  --docker ollama/ollama \
  --instance-type l4 \
  --regions fra \
  --port 11434:http \
  --route /:11434 \
  --docker-command serve

That's it! In less than 60 seconds, you will have Ollama running on Koyeb using a L4 GPU.

You can then pull your favorite models and start interacting with them.

Need more? We’re just getting started

We're super excited by this release and we believe this dramatically simplifies deploying production-grade inference workloads, APIs, and endpoints thanks to the serverless capabilities of the Koyeb platform.

Our goal is to let you easily access, build, experiment, and deploy on the best accelerators from AMD, Intel, Furiosa, Qualcomm, and Nvidia using one unified platform.

If you're looking for specific configurations, GPUs, or accelerators, we'd love to hear from you. We're currently adding more GPUs and accelerators to the platform and are working closely with early users to design our offering.

Let's get in touch! Whether you require high-performance GPUs, specialized accelerators, or unique hardware configurations, we want to hear from you.

Sign up for the platform today and join the GPU serverless private preview!

Keep up with all the latest updates by joining our vibrant and friendly serverless community or follow us on X at @gokoyeb.


Deploy AI apps to production in minutes

Koyeb is a developer-friendly serverless platform to deploy apps globally. No-ops, servers, or infrastructure management.
All systems operational
© Koyeb