Dec 11, 2024
5 min read

Scale to Zero: Optimize GPU and CPU Workloads

This December comes with some magic to it with a pivotal milestone: Scale to Zero is now in public preview and available for everyone!

We’ve said it, our goal is to provide a serverless experience: Scale to Zero combined to Autoscaling makes serverless real. Starting today, your workloads running on GPU and CPU adapt fully automatically to traffic - they sleep and wake up automatically depending on requests, and scale out horizontally according to your criteria.

We’ve worked with our customers and partners on scale-to-zero to enable key use-cases:

  • Inference Efficiency: Inference is compute intensive, you need high-performance GPUs to answer requests quickly, but you might only need them for a couple of minutes every few hours. Scale-to-zero dramatically improves costs and efficiency for inferencing tasks with intermittent traffic, with zero infrastructure management.
  • Dedicated Services for Multi-Tenant SaaS and Platforms: Scale to zero allows you to deploy dedicated and isolated services per tenant with controlled performance and costs. You can operate fleets with thousands of services and pay only for real usage.
  • Infinite Development Environments: Software engineering teams need environments identical to production to run integration tests. Creating dozens of services to replicate your production is now cost-effective thanks to scale-to-zero and our automation tools (API, CLI, Terraform, Pulumi). Every developer in your team can have a full replica of the production setup, billed per second of usage.
  • Compute Efficiency: For apps with high CPU demands but intermittent traffic, scale-to-zero automatically optimizes your infrastructure and costs.
  • Global Deployments: Multi-region deployments can quickly become expensive. With scale-to-zero, you can deploy globally without incurring a base fee for each additional region.

Here is what it looks like in action:

We need to talk about cold starts: when we receive a new request, it takes from 1 to 5 seconds to create a new dedicated virtual machine and works across CPUs and GPUs - it's hardware agnostic. We call this scale-to-zero release "deep sleep".

In a couple of weeks, we will bring this down to a few hundreds milliseconds with memory snapshotting. If you want to learn more about cold start optimization, we spoke about it at dotAI.

To sum up, scale-to-zero is designed to simplify infrastructure management and is built around the same principles as autoscaling:

  • Flexible: you can define the idle time period to make sure you don’t have a per-request latency hit. The default is set to 5 minutes and can be customized on demand.
  • Controllable: scale-to-zero is an opt-in setting, you can set the minimum scale at one if you’re dealing with real-time applications where any cold start is unacceptable.
  • Global: similarly to autoscaling, scale-to-zero can be customized per region.

On the pricing side, a sleeping service is totally free, you're billed per second of usage.

Read on to learn how to get started with scale-to-zero on the platform!

Getting started with scale-to-zero

Enabling scale-to-zero from the control panel or the CLI is as easy as:

  • Updating your service configuration to set the minimum number of instances to 0 using the control panel:

Scale to zero control panel

  • A single command using Koyeb CLI:
koyeb service update my-app/my-service --min-instances 0

How Does Scale to Zero Work?

To decide when to scale your services to zero, we monitor incoming requests. By default, a Service is deemed inactive if it receives no requests during 5 minutes.

When scaling to zero, the service status changes to Sleeping and all associated instances are stopped. Then, as soon as the Service receives new requests, the Service wakes up and an instance starts to handle traffic.

Scale-to-zero works with autoscaling: your Service sleeps when there are no incoming requests, and automatically scales up when load increases, according to your scaling criteria.

Check our scale-to-zero documentation to learn more.

Pay per second of usage

Scale to Zero is available on Standard CPU and GPU instances with no extra fees:

  • per second billing when your instances are active,
  • $0 when your instances are sleeping.

When you configure your services via the control panel, we automatically display the minimum and maximum cost of your deployment based on the number of Instances per region you define.

Scale to Zero redefines what’s possible with serverless infrastructure

Whether handling bursts of traffic or performing real-time inference, flexible infrastructure is more critical than ever. This is where scale to zero comes in:

  • Unparalleled Cost Savings: You only pay for what you use. When there’s no traffic, your instances scale to zero, so you're not wasting money.
  • Instant Readiness: Your workloads automatically spin up when new requests come in.
  • Smarter Resource Use: Focus on building and innovating while our platform handles and optimizes your infrastructure needs.

As of today, scale to zero is available in public preview. Everyone can scale their applications, APIs, and workloads in minutes.

Get started today and see how scale to zero automatically optimizes your infrastructure and dramatically improves efficiency.


Deploy AI apps to production in minutes

Koyeb is a developer-friendly serverless platform to deploy apps globally. No-ops, servers, or infrastructure management.
All systems operational
© Koyeb