Skip to content

We want to use spot instances for batch work but they keep getting interrupted

Spot is typically 60-90 percent cheaper than on-demand at the price of a 2-minute termination warning at any time. Fine for stateless batch, not fine for stateful systems.

Try this first

  1. 1Architect the workload so an instance can stop mid-job without data loss. Write progress to S3, not local disk.
  2. 2Use AWS Batch, EC2 Auto Scaling with Mixed Instances, or Spot Fleet. Don't run spot RunInstances by hand. Mixed Instances spreads across instance types and pools.
  3. 3On Azure: Spot VMs in a Virtual Machine Scale Set with eviction policy Deallocate. On GCP: Preemptible or Spot in a MIG.
  4. 4For long-running batch (hours): mix spot bulk with a few on-demand instances as 'insurance'. An 80/20 mix limits impact during mass eviction.
  5. 5For latency-critical web or databases: never spot. There the savings aren't worth the operational pain.

When to bring us in

For ML training jobs of 12+ hours on spot, checkpointing isn't trivial. A short review of your SageMaker or Vertex config can save days of work.

See also

None of the above fits?

Describe your situation below. We pass your input plus the steps you already saw to our AI and return tailored next-step advice. If it's too risky to DIY, we'll say so.

Who are you?

For the AI question we need your email and company, so we can follow up if the AI gets stuck, and to prevent abuse.

Limited to 2 questions per hour and 5 per day, kept lean so the AI stays useful. For more, contacting us directly works better for you and us.

Or skip the DIY entirely

Our Managed IT clients do not look these things up. One point of contact, a fixed monthly price, resolved within working hours.