Throughout these past 4 years since AWS ECS became generally available, I’ve had the opportunity to manage 4 major ECS cluster deployments.
Across these deployments, I’ve built up knowledge and tools to help manage them, make them safer, more reliable, and cheaper to run. This article has a bunch of tips and tricks I’ve learned along the way.
Note that most of these tips are rendered useless if you use Fargate! I usually use Fargate these days, but there are still valid reasons for managing your own cluster.
ECS clusters are great places to use spot instances, especially when managed by a Spot Fleet. As long as you handle the “spot instance is about to be terminated” event, and set the container instance to draining status, it works pretty well. When ECS is told to drain a container instance, it will stop the tasks cleanly on the instance and run them somewhere else. I’ve made the source code for this Lambda function available on GitHub.
Just make sure your app is able to stop itself and boot another instance in 2 minutes (the warning time you have before the spot instance is terminated). I’ve experienced overall savings of around 60% when using a cluster exclusively comprised of spot instances (EBS is not discounted).
Autoscaling Group Lifecycle Hooks
If you need to use on-demand instances for your ECS cluster, or you’re using a mixed spot/on-demand cluster, I recommend using an Autoscaling Group to manage your cluster instances.
To prevent the ASG from stopping instances with tasks currently running, you have to write your own integration. AWS provides some sample code, which I’ve modified and published on GitHub.
The basic gist of this integration is:
- When an instance is scheduled for termination, the Autoscaling Group sends a message to an SNS topic.
- Lambda is subscribed to this topic, and receives the message.
- Lambda tells the ECS API to drain the instance that is scheduled to be terminated.
- If the instance has zero running tasks, Lambda tells the Autoscaling Group to continue with termination. The Autoscaling Group terminates the instance at this point.
- If the instance has more than zero running tasks, Lambda waits for some time and sends the same message to the topic, returning to step (2).
By default, I set the timeout for this operation to 15 minutes. This value depends on the specific application. If your applications require more than 15 minutes to cleanly shut down and relocate to another container instance, you’ll have to set this value accordingly. (Also, you’ll have to change the default ECS StopTask SIGTERM timeout — look for the “ECS_CONTAINER_STOP_TIMEOUT” environment variable)
Cluster Instance Scaling
Cluster instance scale-out is pretty easy. Set some CloudWatch alarms on the ECS CPUReservation and MemoryReservation metrics, and you can scale out according to those. Scaling in is a little more tricky.
I originally used those same metrics to scale in. Now, I use a Lambda script that runs every 30 minutes, cleaning up unused resources until a certain threshold of available CPU and memory is reached. This technique further reduces service disruption. I’ll post this on GitHub sometime in the near future.
I’ve gone through a few application deployment strategies.
- Hosted CI + Deploy Shell Script
- Pros: simple.
- Cons: you need somewhere to run it, easily becomes a mess. Shell scripts are a pain to debug and test.
- Hosted CI + Deploy Python Script (I might put this on GitHub sometime)
- Pros: powerful, easier to test than using a bunch of shell scripts.
- Cons: be careful about extending the script. It can quickly become spaghetti code.
- Pros: powerful.
- Cons: Jenkins.
- CodeBuild + CodePipeline
- Pros: simple; ECS deployment was recently added; can be managed with Terraform.
- Cons: Subject to limitations of CodePipeline (pretty limited). In our use case, the sticking points are not being able to deploy an arbitrary Git branch (you have to deploy the branch specified in the CodePipeline definition).
Other tips and tricks
- Docker stdout logging is not cheap (also, performance is highly variable across log drivers — I recently had a major problem with the fluentd driver blocking all writes). If your application blocks on logging (looking at you, Ruby), performance will suffer.
- Having a few large instances yields more performance than many small instances (with the added benefit of having the layer cache when performing deploys).
- The default placing strategy should be: binpack on the resource that is most important to your application (CPU or memory), AZ-balanced
- Applications that can’t be safely shut down in less than 1 minute do not work well with Spot instances. Use a placement constraint to make sure these tasks don’t get scheduled on a Spot instance (you’ll have to set the attribute yourself, probably using the EC2 user data)
- Spot Fleet + ECS = ❤️