We want to use spot instances for batch work but they keep getting interrupted
Spot is typically 60-90 percent cheaper than on-demand at the price of a 2-minute termination warning at any time. Fine for stateless batch, not fine for stateful systems.
Try this first
- 1Architect the workload so an instance can stop mid-job without data loss. Write progress to S3, not local disk.
- 2Use AWS Batch, EC2 Auto Scaling with Mixed Instances, or Spot Fleet. Don't run spot RunInstances by hand. Mixed Instances spreads across instance types and pools.
- 3On Azure: Spot VMs in a Virtual Machine Scale Set with eviction policy Deallocate. On GCP: Preemptible or Spot in a MIG.
- 4For long-running batch (hours): mix spot bulk with a few on-demand instances as 'insurance'. An 80/20 mix limits impact during mass eviction.
- 5For latency-critical web or databases: never spot. There the savings aren't worth the operational pain.
When to bring us in
For ML training jobs of 12+ hours on spot, checkpointing isn't trivial. A short review of your SageMaker or Vertex config can save days of work.
See also
- Everyone logs in with the AWS root accountRoot is for emergencies and billing. Day-to-day work belongs in IAM users or SSO.
- Every developer has AdministratorAccessAdministratorAccess everywhere is convenient now, painful later. Start with role-based policies.
- Everyone has individual IAM users with their own passwordIdentity Center (formerly AWS SSO) links to your IdP and issues temporary credentials per session.
None of the above fits?
Describe your situation below. We pass your input plus the steps you already saw to our AI and return tailored next-step advice. If it's too risky to DIY, we'll say so.
Or skip the DIY entirely
Our Managed IT clients do not look these things up. One point of contact, a fixed monthly price, resolved within working hours.