AWS Application Auto-scaling for ECS with Terraform

Update: Target tracking scaling is now available for ECS services.

I’ve been working on setting up autoscaling settings for ECS services recently, and here are a couple notes from managing auto-scaling for ECS services using Terraform.

Creating multiple scheduled actions at once


Terraform will perform the following actions:

  + aws_appautoscaling_scheduled_action.green_evening
      id:                                    
      arn:                                   
      name:                                  "ecs"
      resource_id:                           "service/default-production/green"
      scalable_dimension:                    "ecs:service:DesiredCount"
      scalable_target_action.#:              "1"
      scalable_target_action.0.max_capacity: "20"
      scalable_target_action.0.min_capacity: "2"
      schedule:                              "cron(0 15 * * ? *)"
      service_namespace:                     "ecs"

  + aws_appautoscaling_scheduled_action.wapi_green_morning
      id:                                    
      arn:                                   
      name:                                  "ecs"
      resource_id:                           "service/default-production/green"
      scalable_dimension:                    "ecs:service:DesiredCount"
      scalable_target_action.#:              "1"
      scalable_target_action.0.max_capacity: "20"
      scalable_target_action.0.min_capacity: "3"
      schedule:                              "cron(0 23 * * ? *)"
      service_namespace:                     "ecs"

This fails with:


* aws_appautoscaling_scheduled_action.green_evening: ConcurrentUpdateException: You already have a pending update to an Auto Scaling resource.

To fix, the scheduled actions need to be executed serially.


resource "aws_appautoscaling_scheduled_action" "green_morning" {
  name               = "ecs"
  service_namespace  = "${module.green-autoscaling.service_namespace}"
  resource_id        = "${module.green-autoscaling.resource_id}"
  scalable_dimension = "${module.green-autoscaling.scalable_dimension}"
  schedule           = "cron(0 23 * * ? *)"

  scalable_target_action {
    min_capacity = 3
    max_capacity = 20
  }
}

resource "aws_appautoscaling_scheduled_action" "green_evening" {
  name               = "ecs"
  service_namespace  = "${module.green-autoscaling.service_namespace}"
  resource_id        = "${module.green-autoscaling.resource_id}"
  scalable_dimension = "${module.green-autoscaling.scalable_dimension}"
  schedule           = "cron(0 15 * * ? *)"

  scalable_target_action {
    min_capacity = 2
    max_capacity = 20
  }

  # Application AutoScaling actions need to be executed serially
  depends_on = ["aws_appautoscaling_scheduled_action.green_morning"]
}

CodePipeline と Slack による ECS ChatOps 運用

この記事は ECS ChatOps with CodePipeline and Slack の日本語訳です

現在、Rails アプリケーションを ECS に移行する作業を進めています。現在のシステムでは Capistrano でデプロイを行っていますが、そろそろ限界が見えてきました。

EKS が使えるようになるのを待っている間、自分で Kubernetes のクラスターを管理するより ECS を採用することに決めました。当初、Lambda 関数を使って必要なタスク定義を作り、ECS サービスを更新する予定でしたが、プロジェクトの設計を開始する直前に CodePipeline の ECS デプロイの対応が発表されました 🎉。

現在のリソースは、CodePipeline と Slack を連携するためのいくつかの Lambda 関数、サービスごとに2つのCodePipeline パイプライン(本番用と検証用)、関連 ECS リソースです。

まず、デプロイするには、Slack チャンネルに deploy [environment] [service] と発言すると起動します。Slack は(API Gateway 経由で)Lambda にイベントを送信し、Lambda は CodePipeline パイプラインの実行を開始します(CodePipeline API の仕組み上、複数の同時実行は扱いにくい)。この Lambda 関数は、DynamoDB のいくつかの基本的な状態(Slack チャンネル、ユーザー、タイムスタンプ)を記録します。この情報は、返信を送信するチャンネルと、デプロイプロセスの何かがステータスが変わったり失敗した場合に通知するために使われます。

CodePipeline がまず、Docker イメージを作成する CodeBuild を起動し、タスク定義を更新するための新しい Docker イメージのタグが入った単純な JSON ファイルを生成します。

CodeBuild が終了すると、デプロイを続行する前に人の承認を要求するために「手動承認」アクションが使用されます。ここの例では検証環境用に有効にしていますが、通常は本番環境でのみ使用されます。本番環境では、まず1台のカナリアデプロイ、次に25%、残りの75%という三段階で運用しています。

残りの部分は比較的単純です。CodePipeline がイメージをデプロイするように ECS に指示します。途中でエラーが検出された場合、rollback コマンドを使用して手動で変更をロールバックします。

デプロイが完了すると、Lambda 関数がデプロイ用チャンネルにメッセージを送信します。

ECS ChatOps with CodePipeline and Slack

I’m currently working on migrating a Rails application to ECS at work. The current system uses a heavily customized Capistrano setup that’s showing its signs, especially when deploying to more than 10 instances at once.

While patiently waiting for EKS, I decided to use ECS over manage my own Kubernetes cluster on AWS using something like kops. I was initially planning on using Lambda to create the required task definitions and update ECS services, but native CodePipeline deploy support for ECS was announced right before I started planning the project, which greatly simplified the deploy step.

The current setup we have now is: a few Lambda functions to link CodePipeline and Slack together, two CodePipeline pipelines per service (one for production and one for staging), and the associated ECS resources.

First, a deploy is triggered by saying “deploy [environment] [service]” in the deploy channel. Slack sends an event to Lambda (via API Gateway), and Lambda starts an execution of the CodePipeline pipeline if it is not already in progress (because of the way CodePipeline API operations work, it’s hard to work with multiple concurrent runs). This Lambda function also records some basic state in DynamoDB — namely, the Slack channel, user, and timestamp. This information is used to determine what channel to send replies to, and what user to mention if something in the deploy process goes awry.

CodePipeline then starts CodeBuild, which is configured to create Docker image(s) and a simple JSON file that is used to tell CodePipeline’s ECS integration the image tags the task definition should be updated with.

When CodeBuild is finished, a “manual approve” action is used to request human approval before continuing with the deploy. In the example here, I have it turned on for staging environments, but it’s usually only used in production. In production, we normally have 3 stages in the release cycle — the first canary deployment, followed by 25%, then the remaining 75%.

The rest is relatively straightforward — just CodePipeline telling ECS to deploy images. If errors are detected along the way, a “rollback” command is used to manually roll back changes.

When the deploy is finished, a Lambda function is used to send a message to the deploy channel.

2017: A year in review

A few years ago when I was doing client work, we would regularly host clients’ sites and apps for them. During this time, I was responsible for both development and keeping them up and running as much as possible. Most of the money being in new development, it was difficult to assign priority to improving the operations of existing applications. In this period, I wanted an “operations person” to teach me how to make new applications that would need minimal operations support from the beginning. Failing this, I decided to become “the operations person” myself.

Following that decision, I found myself working at BizReach on the infrastructure team for HRMOS, a Software-as-a-Service product focused on applicant tracking for medium to large enterprises, in the end of 2016. Following that job, I then went to a small startup, dely, as a Site Reliability Engineer for their flagship product Kurashiru, a recipe video app for iOS and Android.

This is the first full year I’ve been working full-time as a dedicated infrastructure / operations / SRE / DevOps engineer, and I feel like I’ve grown a lot. On the technical side, I was able to lead the migration of complex legacy monolith systems to scalable and resilient independent systems. On the not-so-technical side, I’ve experienced different types of company cultures, managerial styles, and I’ve gotten accustomed to working with teams of engineers — the experience I’ve had up until this year was mostly working in extremely small teams.

While I do have a passion for making, maintaining, and improving services, I am also very interested in company culture — what makes it and what breaks it — especially when it comes to remote work. I believe most technical engineering work can be done as efficiently (if not more) remote, but there are definite challenges that need to be addressed before I can start leading a change in any position I’m in.

Here’s to 2018! 🎉

Shipping Events from Fluentd to Elasticsearch

We use fluentd to process and route log events from our various applications. It’s simple, safe, and flexible. With at-least-once delivery by default, log events are buffered at every step before they’re sent off to the various storage backends. However, there are some caveats with using Elasticsearch as a backend.

Currently, our setup looks something like this:

The general flow of data is from the application, to the fluentd aggregators, then to the backends — mainly Elasticsearch and S3. If a log event warrants a notification, it’s published to a SNS topic, which in turn triggers a Lambda function that sends the notification to Slack.

The fluentd aggregators are placed by an auto-scaling group, but are not load balanced by a load balancer. Instead, a Lambda function connected to the auto-scaling group lifecycle notifications updates a DNS round-robin entry with the private IP addresses of the fluentd aggregator instances.

We use the fluent-plugin-elasticsearch plugin to output log events to Elasticsearch. However, because this plugin uses the bulk insert API and does not validate whether events have actually been successfully inserted in to the cluster, it is dangerous to rely on it exclusively (thus the S3 backup).

IAM Policy for KMS-Encrypted Remote Terraform State in S3

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::<bucket name>/*",
                "arn:aws:s3:::<bucket name>"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "kms:Encrypt",
                "kms:Decrypt",
                "kms:GenerateDataKey"
            ],
            "Resource": [
                "<arn of KMS key>"
            ]
        }
    ]
}

Don’t forget to update the KMS Key Policy, too. I spent a bit of time trying to figure out why it wasn’t working, until CloudTrail helpfully told me that the kms:GenerateDataKey permission was also required. Turn it on today, even if you don’t need the auditing. It’s an excellent permissions debugging tool.

Authenticating Linux logins via LDAP (Samba / Active Directory)

I’ve been working on infrastructure of a fleet of a few dozen Amazon EC2 instances for the past week, and with a rapidly-growing team, we decided it was appropriate to make a central authentication / authorization service.

So, that meant setting up some sort of LDAP server.

I was a bit intimidated at first (the most I’ve done is seen people manage and complain about Active Directory), but I finally got it set up. Here are the components:

  • AWS Directory Service (Simple Directory) is used as the directory server.
  • A t2.large Windows Server instance used to administer directory (usually stopped).
  • A bunch of VPC settings to make the directory service the default DNS resolver of the VPC.
  • An Ansible play I made to:
    • Join the instance to the directory.
    • Configure sshd to pull public keys from the directory.
    • Add an access filter to allow access for users who are members of the appropriate groups.
    • Update sudoers to allow sudo for users who are members of the appropriate groups.

The first three weren’t too hard — Amazon has pretty good documentation and tutorials that cover this pretty well. I recommend reading them in this order:

  1. Tutorial: Create a Simple AD Directory.
  2. Create a DHCP Options Set.
  3. Joining a Windows Instance to an AWS Directory Service Domain — read the limitations and prerequisites (you’ll need a special EC2 IAM Role) — then skip to “Joining a Domain Using the Amazon EC2 Launch Wizard”.
  4. Delegating Directory Join Privileges — this is important for security.
  5. Manually Add a Linux Instance.

On step 5, the realm join command will prompt for a password. I spent a few days trying to figure out what the best way to automate this was — I tried creating a Kerberos keytab and use that for authentication, but I wasn’t getting consistent results (for some reason that is probably clear to someone who knows a lot about Kerberos, the realm join would work but after a realm leave, Kerberos would complain that the join account didn’t exist anymore — even though I couldn’t find any differences from the AD admin tools). I eventually decided to encrypt the directory join account password in an Ansible vault and use the Ansible expect module to automate the password entry.

To do

I’m currently using the Active Directory “Users & Groups” administration tool to administer users, but this involves booting a Windows instance every time a change to the directory is made — ideally, I want a simple web-based tool to add/remove/change users, their SSH public keys, and groups. There are a few web-based tools out there already, but the ones I’ve come across are either too complicated or don’t manage SSH keys as well.

Upgrading PostgreSQL on Ubuntu

I recently started using Ubuntu Linux on my main development machine. That means that my PostgreSQL database is running under Ubuntu, as well. I’ve written guides to upgrading PostgreSQL using Homebrew in the past, but the upgrade process under Ubuntu was much smoother.

These steps are assuming that you use Ubuntu 16.04 LTS, and PostgreSQL 9.6 is already installed via apt.

  1. Stop the postgresql service.
    $ sudo service postgresql stop
    
  2. Move the newly-created PostgreSQL 9.6 cluster elsewhere.
    $ sudo pg_renamecluster 9.6 main main_pristine
    
  3. Upgrade the 9.5 cluster.
    $ sudo pg_upgradecluster 9.5 main
    
  4. Start the postgresql service.
    $ sudo service postgresql start
    

Now, when running pg_lsclusters, you should see something like the following:

9.5 main          5434 online postgres /var/lib/postgresql/9.5/main          /var/log/postgresql/postgresql-9.5-main.log
9.6 main          5432 online postgres /var/lib/postgresql/9.6/main          /var/log/postgresql/postgresql-9.6-main.log
9.6 main_pristine 5433 online postgres /var/lib/postgresql/9.6/main_pristine /var/log/postgresql/postgresql-9.6-main_pristine.log

Verify everything is working as expected, then feel free to remove the 9.5/main and 9.6/main_pristine clusters (pg_dropcluster).

These cluster commands may be available in other distros, but I haven’t been able to check them. YMMV. Good luck!

macOS Sierra

Here are the headline features of Sierra, and my thoughts about them.

Siri

Don’t use it on the phone, won’t use it on the Mac. It would be nice if I could use it by typing, though.

Universal Clipboard

This seems like a massive security risk. It works via iCloud, so if someone (like my daughter) is using my iPad, then it will sync the clipboard to that, too. How do I turn it off?

Auto Unlock

screen-shot-2016-09-21-at-21-35-28

I can’t use it, because “two-factor authentication” is apparently completely different from “two-step verification”, and they’re completely incompatible. I can switch to 2FA, but that means I have to logout and re-login to everything, right? Will that help my iMessages that are already not syncing between my Mac and my iPhone?

(Side rant: I can’t use any of the iMessage features because I’m completely on Telegram now, because of the aforementioned troubles with iMessages being unreliable. Sometimes they come to the Mac. Sometimes the iPhone. Sometimes both. Who knows! I need some kind of cloud forecast app to know. Or learn how to read the stars. Or something. Is this the kind of stuff they put in horoscopes?)

Apple Pay

Not available in Japan yet (but will be coming soon, to the “major” credit card brands: JCB and MasterCard. VISA isn’t “major”, I suppose).

Photos

For some reason, the photos themselves sync between machines, but the facial recognition metadata doesn’t? Progress is reducing the amount of things we want to babysit, not more. I don’t want to babysit my photos by correcting facial recognition guesses on both machines, over 5,000+ photos.

Messages

Hmm.

iCloud Drive

Is it better than Dropbox yet?

Optimized Storage

All of a sudden, the voice of one of my podcast panelists simply vanished from the mix. I quit and re-launched Logic, only to be told that the file in question was missing. Sure enough, a visit to Finder revealed that Sierra had “optimized” my storage and removed that file from my local drive.

macOS Sierra Review by Jason Snell at sixcolors.com

No thanks.

Picture in Picture

Okay, this might be useful sometimes.

Tabs

“Finally”.

Conclusion

2/10

Look, I don’t mean to degrade the excellent work of the team of engineers responsible for macOS at Apple. They’re doing a good job. The upgrade process to Sierra was smooth. There are just a lot of features in this release that don’t excite me. I wish they had fixed other things, like that silly facial recognition sync thing. Maybe they will in the future. I really hope so.

Also: Don’t say “then just use Linux”, because I already do. It’s not as elegant as macOS, but it’s reliable for the work I do (web programming) and when something does go wrong, I know how to fix it.

If you want to get in to an argument with me about how I’m wrong, please make an App.net account and contact me there, I’m @keita.