AWS Application Auto-scaling for ECS with Terraform

Update: Target tracking scaling is now available for ECS services.

I’ve been working on setting up autoscaling settings for ECS services recently, and here are a couple notes from managing auto-scaling for ECS services using Terraform.

Creating multiple scheduled actions at once


Terraform will perform the following actions:

  + aws_appautoscaling_scheduled_action.green_evening
      id:                                    
      arn:                                   
      name:                                  "ecs"
      resource_id:                           "service/default-production/green"
      scalable_dimension:                    "ecs:service:DesiredCount"
      scalable_target_action.#:              "1"
      scalable_target_action.0.max_capacity: "20"
      scalable_target_action.0.min_capacity: "2"
      schedule:                              "cron(0 15 * * ? *)"
      service_namespace:                     "ecs"

  + aws_appautoscaling_scheduled_action.wapi_green_morning
      id:                                    
      arn:                                   
      name:                                  "ecs"
      resource_id:                           "service/default-production/green"
      scalable_dimension:                    "ecs:service:DesiredCount"
      scalable_target_action.#:              "1"
      scalable_target_action.0.max_capacity: "20"
      scalable_target_action.0.min_capacity: "3"
      schedule:                              "cron(0 23 * * ? *)"
      service_namespace:                     "ecs"

This fails with:


* aws_appautoscaling_scheduled_action.green_evening: ConcurrentUpdateException: You already have a pending update to an Auto Scaling resource.

To fix, the scheduled actions need to be executed serially.


resource "aws_appautoscaling_scheduled_action" "green_morning" {
  name               = "ecs"
  service_namespace  = "${module.green-autoscaling.service_namespace}"
  resource_id        = "${module.green-autoscaling.resource_id}"
  scalable_dimension = "${module.green-autoscaling.scalable_dimension}"
  schedule           = "cron(0 23 * * ? *)"

  scalable_target_action {
    min_capacity = 3
    max_capacity = 20
  }
}

resource "aws_appautoscaling_scheduled_action" "green_evening" {
  name               = "ecs"
  service_namespace  = "${module.green-autoscaling.service_namespace}"
  resource_id        = "${module.green-autoscaling.resource_id}"
  scalable_dimension = "${module.green-autoscaling.scalable_dimension}"
  schedule           = "cron(0 15 * * ? *)"

  scalable_target_action {
    min_capacity = 2
    max_capacity = 20
  }

  # Application AutoScaling actions need to be executed serially
  depends_on = ["aws_appautoscaling_scheduled_action.green_morning"]
}

CodePipeline と Slack による ECS ChatOps 運用

この記事は ECS ChatOps with CodePipeline and Slack の日本語訳です

現在、Rails アプリケーションを ECS に移行する作業を進めています。現在のシステムでは Capistrano でデプロイを行っていますが、そろそろ限界が見えてきました。

EKS が使えるようになるのを待っている間、自分で Kubernetes のクラスターを管理するより ECS を採用することに決めました。当初、Lambda 関数を使って必要なタスク定義を作り、ECS サービスを更新する予定でしたが、プロジェクトの設計を開始する直前に CodePipeline の ECS デプロイの対応が発表されました 🎉。

現在のリソースは、CodePipeline と Slack を連携するためのいくつかの Lambda 関数、サービスごとに2つのCodePipeline パイプライン(本番用と検証用)、関連 ECS リソースです。

まず、デプロイするには、Slack チャンネルに deploy [environment] [service] と発言すると起動します。Slack は(API Gateway 経由で)Lambda にイベントを送信し、Lambda は CodePipeline パイプラインの実行を開始します(CodePipeline API の仕組み上、複数の同時実行は扱いにくい)。この Lambda 関数は、DynamoDB のいくつかの基本的な状態(Slack チャンネル、ユーザー、タイムスタンプ)を記録します。この情報は、返信を送信するチャンネルと、デプロイプロセスの何かがステータスが変わったり失敗した場合に通知するために使われます。

CodePipeline がまず、Docker イメージを作成する CodeBuild を起動し、タスク定義を更新するための新しい Docker イメージのタグが入った単純な JSON ファイルを生成します。

CodeBuild が終了すると、デプロイを続行する前に人の承認を要求するために「手動承認」アクションが使用されます。ここの例では検証環境用に有効にしていますが、通常は本番環境でのみ使用されます。本番環境では、まず1台のカナリアデプロイ、次に25%、残りの75%という三段階で運用しています。

残りの部分は比較的単純です。CodePipeline がイメージをデプロイするように ECS に指示します。途中でエラーが検出された場合、rollback コマンドを使用して手動で変更をロールバックします。

デプロイが完了すると、Lambda 関数がデプロイ用チャンネルにメッセージを送信します。

このようなシステムを作ることに興味があれば、弊社ではインフラおよび開発エンジニアを探しています(東京)。興味がある方はお問い合わせフォームから連絡してください。会社の詳細はこちら

ECS ChatOps with CodePipeline and Slack

I’m currently working on migrating a Rails application to ECS at work. The current system uses a heavily customized Capistrano setup that’s showing its signs, especially when deploying to more than 10 instances at once.

While patiently waiting for EKS, I decided to use ECS over manage my own Kubernetes cluster on AWS using something like kops. I was initially planning on using Lambda to create the required task definitions and update ECS services, but native CodePipeline deploy support for ECS was announced right before I started planning the project, which greatly simplified the deploy step.

The current setup we have now is: a few Lambda functions to link CodePipeline and Slack together, two CodePipeline pipelines per service (one for production and one for staging), and the associated ECS resources.

First, a deploy is triggered by saying “deploy [environment] [service]” in the deploy channel. Slack sends an event to Lambda (via API Gateway), and Lambda starts an execution of the CodePipeline pipeline if it is not already in progress (because of the way CodePipeline API operations work, it’s hard to work with multiple concurrent runs). This Lambda function also records some basic state in DynamoDB — namely, the Slack channel, user, and timestamp. This information is used to determine what channel to send replies to, and what user to mention if something in the deploy process goes awry.

CodePipeline then starts CodeBuild, which is configured to create Docker image(s) and a simple JSON file that is used to tell CodePipeline’s ECS integration the image tags the task definition should be updated with.

When CodeBuild is finished, a “manual approve” action is used to request human approval before continuing with the deploy. In the example here, I have it turned on for staging environments, but it’s usually only used in production. In production, we normally have 3 stages in the release cycle — the first canary deployment, followed by 25%, then the remaining 75%.

The rest is relatively straightforward — just CodePipeline telling ECS to deploy images. If errors are detected along the way, a “rollback” command is used to manually roll back changes.

When the deploy is finished, a Lambda function is used to send a message to the deploy channel.

If you’re interested in making systems like these, we’re looking for infrastructure / DevOps engineers (on-site in Tokyo; Japanese required). Get in touch with the contact form if you’re interested. Follow this link for details.

2017: A year in review

A few years ago when I was doing client work, we would regularly host clients’ sites and apps for them. During this time, I was responsible for both development and keeping them up and running as much as possible. Most of the money being in new development, it was difficult to assign priority to improving the operations of existing applications. In this period, I wanted an “operations person” to teach me how to make new applications that would need minimal operations support from the beginning. Failing this, I decided to become “the operations person” myself.

Following that decision, I found myself working at BizReach on the infrastructure team for HRMOS, a Software-as-a-Service product focused on applicant tracking for medium to large enterprises, in the end of 2016. Following that job, I then went to a small startup, dely, as a Site Reliability Engineer for their flagship product Kurashiru, a recipe video app for iOS and Android.

This is the first full year I’ve been working full-time as a dedicated infrastructure / operations / SRE / DevOps engineer, and I feel like I’ve grown a lot. On the technical side, I was able to lead the migration of complex legacy monolith systems to scalable and resilient independent systems. On the not-so-technical side, I’ve experienced different types of company cultures, managerial styles, and I’ve gotten accustomed to working with teams of engineers — the experience I’ve had up until this year was mostly working in extremely small teams.

While I do have a passion for making, maintaining, and improving services, I am also very interested in company culture — what makes it and what breaks it — especially when it comes to remote work. I believe most technical engineering work can be done as efficiently (if not more) remote, but there are definite challenges that need to be addressed before I can start leading a change in any position I’m in.

Here’s to 2018! 🎉

Shipping Events from Fluentd to Elasticsearch

We use fluentd to process and route log events from our various applications. It’s simple, safe, and flexible. With at-least-once delivery by default, log events are buffered at every step before they’re sent off to the various storage backends. However, there are some caveats with using Elasticsearch as a backend.

Currently, our setup looks something like this:

The general flow of data is from the application, to the fluentd aggregators, then to the backends — mainly Elasticsearch and S3. If a log event warrants a notification, it’s published to a SNS topic, which in turn triggers a Lambda function that sends the notification to Slack.

The fluentd aggregators are placed by an auto-scaling group, but are not load balanced by a load balancer. Instead, a Lambda function connected to the auto-scaling group lifecycle notifications updates a DNS round-robin entry with the private IP addresses of the fluentd aggregator instances.

We use the fluent-plugin-elasticsearch plugin to output log events to Elasticsearch. However, because this plugin uses the bulk insert API and does not validate whether events have actually been successfully inserted in to the cluster, it is dangerous to rely on it exclusively (thus the S3 backup).

IAM Policy for KMS-Encrypted Remote Terraform State in S3

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::<bucket name>/*",
                "arn:aws:s3:::<bucket name>"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "kms:Encrypt",
                "kms:Decrypt",
                "kms:GenerateDataKey"
            ],
            "Resource": [
                "<arn of KMS key>"
            ]
        }
    ]
}

Don’t forget to update the KMS Key Policy, too. I spent a bit of time trying to figure out why it wasn’t working, until CloudTrail helpfully told me that the kms:GenerateDataKey permission was also required. Turn it on today, even if you don’t need the auditing. It’s an excellent permissions debugging tool.