Scale-to-zero
Scale-to-zero allows you to automatically scale your Service Instances down to zero when there is no incoming traffic. This feature allows you to optimize your costs by only paying for real compute usage.
Note: Scale-to-zero is currently in public preview.
To enable scale-to-zero, you need to set the minimum number of instances to zero.
Then when your Service remains idle without receiving any requests for over 5 minutes, it will automatically scale down your active Services instances to zero and update your Service to the Sleeping
status.
As soon as a new request is received, the Service is waked up and scaled up to at least one instance or more depending on your autoscaling criteria.
How scale-to-zero works
Your Service will be scaled down to zero if all of the following criteria are met over the idle period of 5 minutes:
- No traffic is received from the Internet.
- No held connection (e.g. websocket or HTTP/2 stream) from the Internet to your Service.
- No new deployment occured
When to use scale-to-zero?
Scale-to-zero is ideal for a wide range of use cases that involve handling intermittent traffic, like:
- Inference Efficiency: Inference is compute intensive, you need high-performance GPUs to answer requests quickly, but you might only need them for a couple of minutes every few hours. Scale-to-zero dramatically improves costs and efficiency for inferencing tasks with intermittent traffic, without infrastructure management.
- Dedicated Services for Multi-Tenant SaaS and Platforms: Scale-to-zero allows you to deploy dedicated and isolated services per tenant with controlled performance and costs. Operate fleets with thousands of services, paying only for real usage.
- Infinite Development Environments: Software engineering teams need environments identical to production to run integration tests. Creating dozens of services to replicate your production is now cost-effective thanks to scale-to-zero and our automation tools (API, CLI, Terraform, Pulumi). Every developer in your team can have a full replica of the production setup, billed per second of usage.
- Compute Efficiency: For apps with high CPU demands but intermittent traffic, scale-to-zero automatically optimizes your infrastructure and costs.
- Global Deployments: Multi-region deployment can quickly become expensive. With scale-to-zero you can deploy globally without incurring a base fee for each additional region that you add.
Limitations
- Inbound requests to a sleeping Service may be slower due to a cold start, which typically takes 1 to 5 seconds to create a new dedicated virtual machine
- Scale-to-zero works only for Services exposed to the Internet.
- HTTP/2 requests cannot be used to wake up a sleeping Service.
- You can wake a Service up using a WebSocket connection, but that connection may only live for a few minutes.