AWS Application Auto-scaling for ECS with Terraform

Update: Target tracking scaling is now available for ECS services.

I’ve been working on setting up autoscaling settings for ECS services recently, and here are a couple notes from managing auto-scaling for ECS services using Terraform.

Creating multiple scheduled actions at once


Terraform will perform the following actions:

  + aws_appautoscaling_scheduled_action.green_evening
      id:                                    
      arn:                                   
      name:                                  "ecs"
      resource_id:                           "service/default-production/green"
      scalable_dimension:                    "ecs:service:DesiredCount"
      scalable_target_action.#:              "1"
      scalable_target_action.0.max_capacity: "20"
      scalable_target_action.0.min_capacity: "2"
      schedule:                              "cron(0 15 * * ? *)"
      service_namespace:                     "ecs"

  + aws_appautoscaling_scheduled_action.wapi_green_morning
      id:                                    
      arn:                                   
      name:                                  "ecs"
      resource_id:                           "service/default-production/green"
      scalable_dimension:                    "ecs:service:DesiredCount"
      scalable_target_action.#:              "1"
      scalable_target_action.0.max_capacity: "20"
      scalable_target_action.0.min_capacity: "3"
      schedule:                              "cron(0 23 * * ? *)"
      service_namespace:                     "ecs"

This fails with:


* aws_appautoscaling_scheduled_action.green_evening: ConcurrentUpdateException: You already have a pending update to an Auto Scaling resource.

To fix, the scheduled actions need to be executed serially.


resource "aws_appautoscaling_scheduled_action" "green_morning" {
  name               = "ecs"
  service_namespace  = "${module.green-autoscaling.service_namespace}"
  resource_id        = "${module.green-autoscaling.resource_id}"
  scalable_dimension = "${module.green-autoscaling.scalable_dimension}"
  schedule           = "cron(0 23 * * ? *)"

  scalable_target_action {
    min_capacity = 3
    max_capacity = 20
  }
}

resource "aws_appautoscaling_scheduled_action" "green_evening" {
  name               = "ecs"
  service_namespace  = "${module.green-autoscaling.service_namespace}"
  resource_id        = "${module.green-autoscaling.resource_id}"
  scalable_dimension = "${module.green-autoscaling.scalable_dimension}"
  schedule           = "cron(0 15 * * ? *)"

  scalable_target_action {
    min_capacity = 2
    max_capacity = 20
  }

  # Application AutoScaling actions need to be executed serially
  depends_on = ["aws_appautoscaling_scheduled_action.green_morning"]
}

ECS ChatOps with CodePipeline and Slack

I’m currently working on migrating a Rails application to ECS at work. The current system uses a heavily customized Capistrano setup that’s showing its signs, especially when deploying to more than 10 instances at once.

While patiently waiting for EKS, I decided to use ECS over manage my own Kubernetes cluster on AWS using something like kops. I was initially planning on using Lambda to create the required task definitions and update ECS services, but native CodePipeline deploy support for ECS was announced right before I started planning the project, which greatly simplified the deploy step.

The current setup we have now is: a few Lambda functions to link CodePipeline and Slack together, two CodePipeline pipelines per service (one for production and one for staging), and the associated ECS resources.

First, a deploy is triggered by saying “deploy [environment] [service]” in the deploy channel. Slack sends an event to Lambda (via API Gateway), and Lambda starts an execution of the CodePipeline pipeline if it is not already in progress (because of the way CodePipeline API operations work, it’s hard to work with multiple concurrent runs). This Lambda function also records some basic state in DynamoDB — namely, the Slack channel, user, and timestamp. This information is used to determine what channel to send replies to, and what user to mention if something in the deploy process goes awry.

CodePipeline then starts CodeBuild, which is configured to create Docker image(s) and a simple JSON file that is used to tell CodePipeline’s ECS integration the image tags the task definition should be updated with.

When CodeBuild is finished, a “manual approve” action is used to request human approval before continuing with the deploy. In the example here, I have it turned on for staging environments, but it’s usually only used in production. In production, we normally have 3 stages in the release cycle — the first canary deployment, followed by 25%, then the remaining 75%.

The rest is relatively straightforward — just CodePipeline telling ECS to deploy images. If errors are detected along the way, a “rollback” command is used to manually roll back changes.

When the deploy is finished, a Lambda function is used to send a message to the deploy channel.

If you’re interested in making systems like these, we’re looking for infrastructure / DevOps engineers (on-site in Tokyo; Japanese required). Get in touch with the contact form if you’re interested. Follow this link for details.

IAM Policy for KMS-Encrypted Remote Terraform State in S3

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::<bucket name>/*",
                "arn:aws:s3:::<bucket name>"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "kms:Encrypt",
                "kms:Decrypt",
                "kms:GenerateDataKey"
            ],
            "Resource": [
                "<arn of KMS key>"
            ]
        }
    ]
}

Don’t forget to update the KMS Key Policy, too. I spent a bit of time trying to figure out why it wasn’t working, until CloudTrail helpfully told me that the kms:GenerateDataKey permission was also required. Turn it on today, even if you don’t need the auditing. It’s an excellent permissions debugging tool.

Convox

I stumbled upon Convox a couple weeks ago, and found it pretty interesting. It’s led by a few people formerly from Heroku, and it certainly feels like it. A simple command-line interface to manage your applications on AWS, with almost no AWS-specific configuration required.

An example of how simple it is to deploy a new application:

$ cd ~/my-new-application
$ convox apps create
$ convox apps info
Name       my-new-application
Status     creating
Release    (none)
Processes  (none)
Endpoints  
$ convox deploy
Deploying my-new-application
Creating tarball... OK
Uploading... 911 B / 911 B  100.00 % 0       
RUNNING: tar xz
...
... wait 5-10 minutes for the ELB to be registered ...
$ convox apps info
Name       my-new-application
Status     running
Release    RIIDWNBBXKL
Processes  web
Endpoints  my-new-application-web-L7URZLD-XXXXXXX.ap-northeast-1.elb.amazonaws.com:80 (web)

Now, you can access your application at that ELB specified in the “Endpoints” section.

I haven’t used Convox with more complex applications, but it definitely looks interesting. It uses a little too much infrastructure than I would like for personal, small projects (a dedicated ELB for the service manager, for example). However, when you’re managing multiple large deploys of complex applications, the time saved by Convox doing the infrastructure work for you seems like it would pay for itself.

The philosophy behind Convox:

The Convox team and the Rack project have a strong philosophy about how to manage cloud services. Some choices we frequently consider:

  • Open over Closed
  • Integration over Invention
  • Services over Software
  • Robots over Humans
  • Shared Expertise vs Bespoke
  • Porcelain over Plumbing

I want to focus on “Robots over Humans” here — one of AWS’s greatest strengths is that almost every single thing can be automated via an API. However, I feel like its greatest weakness is that the APIs are not very user-friendly — they’re very disjointed and not consistent between services. The AWS-provided GUI “service dashboard” is packed with features, and you can see some kind of similarity with UI elements, but it basically stops there. Look at the Route 53, ElastiCache, and EC2 dashboards — they’re completely different.

Convox, in my limited experience, abstracts all of this unfriendliness away and presents you with a simple command line interface to allow you to focus on your core competency — being an application developer.

I, personally, am an application / infrastructure developer (some may call me DevOps, but I’m not particularly attached to that title), and Convox excites me because it has the potential to throw half of the work necessary to get a secure, private application cluster running on AWS.

Playing around with AWS Certificate Manager

I’m a big Let’s Encrypt fan. They provide free SSL certificates for your web servers so you can protect the traffic from prying eyes. In fact, the connection between your web browser and my blog server is made private thanks to Let’s Encrypt.

Using Let’s Encrypt requires some setup and automation on your part if you want to use it in the AWS cloud, but AWS recently launched something called the AWS Certificate Manager or “ACM”. ACM takes care of issuing, renewing, and provisioning certificates for you — which is great because uploading SSL certificates to CloudFront and Elastic Load Balancers is not the most fun thing to do. I would pay for this, but Amazon has decided to give it to everyone for free. 🙂

As with anything AWS, this has a couple catches, but if you run your cloud resources in AWS you probably won’t be worried about them:

  • You don’t have access to the private key, which means you can’t use the same certificate elsewhere.
  • ACM is currently only available in the us-east-1 region. ACM is currently available in all major AWS regions1.
  • You can’t use ACM certificates across regions (with the exception of CloudFront, which doesn’t have a region — note that CloudFront ACM certificates must be located in the us-east-1 region, though.).

So, I decided to test it out on a silly little CloudFront distribution I have running I made for a blog post as a demonstration of how you can use S3 and CloudFront to serve a single-page JavaScript app with “proper” URLs.

Test it out for yourself:

https://single-page-test.kkob.us


  1. As of this update (2016/06/07), ACM is available all regions except US GovCloud and China Beijing)

Hosting a Single-Page App on S3, with proper URLs

Amazon S3 is a great place to store static files. You might want to even serve a single-page application (SPA) written in JavaScript there.

When you’re writing a single-page app, there are a couple ways to handle URLs:

A) http://example.com/#!/path/of/resource
B) http://example.com/path/of/resource

A is easy to serve from S3. The server only sees the http://example.com/ part, and so it serves that file to everyone.

B, however, is a little tricky. Single-page apps usually use pushState or replaceState to change the current URL without reloading, but once you reload (or give the URL to someone else) — BAM! You’ll get presented with a 404 Not Found error.

So why don’t we just use A? There are quite a few advantages to using B, over just being more elegant than putting that pesky #! in there. In my opinion, the biggest advantage of using B is that you’ll be able to make backend changes in the future without having to redirect URLs. For example, as your app gets bigger, you want to render some (or all) components server-side (see Isomorphic or Universal JavaScript).

To implement the B strategy, we need to serve the same index.html file to any URL requested by the client. As I mentioned earlier, we can’t do this with S3 itself, so we’ll enlist the help of CloudFront.

First, create a CloudFront distribution for the S3 bucket. Since CloudFront caches items for quite a long time, you might want to either set Cache-Control headers on your S3 files, or set the default TTL to something short, like a few seconds, in the CloudFront distribution settings. Once everything is set up (and you can access index.html by itself), click the “Error Pages” tab.

Screen Shot 2015-11-24 at 9.28.46 AM

Click the big blue button, “Create Custom Error Response”:

Screen Shot 2015-11-24 at 9.28.55 AM

Now, I think you can tell what I’m up to now. Enabling “Customize Error Response” allows you to change a 404 from the backend (in this case, S3) in to a 200! Note that S3 will return a 403 response if you use the “S3 Origin” option instead of the S3-hosted origin. If you’re getting a 403 error from S3, customize the 403 error as well.

Screen Shot 2015-11-24 at 9.29.23 AM

You can try out this setup below:

https://d3qxx6yxxvp94v.cloudfront.net/
https://d3qxx6yxxvp94v.cloudfront.net/test
https://d3qxx6yxxvp94v.cloudfront.net/l87v3

These all serve the same index.html. If you inspect the headers, the first link should be X-Cache: RefreshHit from cloudfront or Miss from cloudfront. However, if you look at the other requests, it will be X-Cache: Error from cloudfront. The status returned, however, is 200 — just as we wanted it.

Any questions? Contact me or leave a comment in the box below.

Heroku + SSL = Expensive?

Note: This blog post covers the legacy SSL Endpoint. Heroku now recommends the use of Heroku SSL, which can provide you with a free certificate and HTTPS (provided you are using the Hobby tier or higher).


If you use Heroku, you probably know a couple things:

  1. You can’t use an apex domain for your site (unless you use a DNS service that emulates ALIAS / ANAME records).
  2. Using your own SSL certificate costs $20/month.

I’m going to solve both of these problems with one stone: AWS CloudFront.

AWS CloudFront is a content delivery network — think of it as a proxy, distributed around the globe. It’s usually used for static content that can be cached for long periods of time, but we’re going to use it for dynamic (or semi-dynamic) content today.

Point CloudFront to your *.herokuapp.com domain, then assign CloudFront the domain of your choice (in the “Alternate Domain / CNAME” field). You can then upload your SSL certificate. CloudFront supports something called SSL Server Name Indication — SNI for short. Remember those days where each server serving a separate SSL certificate had to be on a different IP address? No more — SNI lets the same server on the same IP serve multiple SSL certificates at the same time.

Now, add the “Host” header to “Whitelist Headers”, and you’re ready to go.

Wait. How do you wire CloudFront, which uses dynamic IPs, up to the apex domain? Fortunately, AWS has you covered. Route 53, the DNS service that AWS provides, has an ALIAS feature for resources in your AWS account (CloudFront, Elastic Load Balancer, S3, etc).

For low-traffic sites, this will almost always work out to be less than $20/month.

My preliminary experience with using this setup has been quite good for a mostly-static site, but performance suffered when with user sessions. CloudFront excels at serving cached content — it’s not so fast on cache misses. YMMV.