Categories
AWS

Working with DynamoDB Global Tables

Just some stuff I’ve picked up while working with DynamoDB Global Tables. This was my first time using it; I used it to move a few tables from one region to another without downtime.

When deleting replica tables…

Note that this operation will delete the replica table and is non-reversible. This replica table cannot be re-added later to the global table.

This warning message is a little misleading — the replica table will be deleted, but it’s possible to re-create a new replica table in the region that was deleted.

Replica cannot be deleted because it has acted as a source region for new replica(s) being added to the table in the last 24 hours.

You have to wait for 24 hours before you can delete the source region.

Other stuff:

  • If you create a GSI in one region, it will automatically be created in all other regions as well.
  • If you delete a table from the list of tables (instead of the “Global Tables” tab), it will delete normally. All of the other tables in other regions will be unaffected.
Categories
AWS English

My Brief Thoughts on the AWS Kinesis Outage

There have been multiple analyses about the recent (2020/11/25) outage of AWS Kinesis and its cascading failure mode, taking a chunk of AWS services with it — including seemingly unrelated Cognito — due to dependencies hidden to the user. If you haven’t read the official postmortem statement by AWS yet, go read it now.

There are an infinite amount of arguments that can made about cascading failure; I’m not here to talk about that today. I’m here to talk about a time a few years ago I was evaluating a few systems to do event logging. Naturally, Kinesis was in consideration, and our team interviewed an AWS Solution Architect about potential design patterns we could implement, what problems they would solve, what hiccups we may encounter on the way, et cetera.

At the time, I didn’t think much of it, but in hindsight it should have been a red flag.

ME: “So, what happens when Kinesis goes down? What kind of recovery processes do you think we need to have in place?”

SA: “Don’t worry about that, Kinesis doesn’t go down.”

The reason I didn’t think much of it at that time was that our workload would have been relatively trivial for Kinesis, and I mentally translated the reply to “don’t worry about Kinesis going down for this particular use case”.

We decided to not go with Kinesis for other reasons (I believe we went with Fluentd).

Maybe my mental translation was correct? Maybe this SA had this image of Kinesis as a system that was impervious to fault? Maybe it was representative of larger problem inside AWS of people who overestimated the reliability of Kinesis? I don’t know. This is just a single data point — it’s not even that strong. A vague memory from “a few years ago”? I’d immediately dismiss it if I heard it.

The point of this post is not to disparage this SA, nor to disparage the Kinesis system as a whole (it is extremely reliable!), but to serve as a reminder (mostly to myself) that one should be immediately suspicious when someone says “never” or “always”.

Categories
AWS WordPress 日本語

WordPress を AWS Lambda で運用する

以前 WordPress を AWS Lambda で 運用する記事を投稿 (英語) しましたが、EFS対応前に執筆しました。EFS を使える様になって、WordPress を AWS Lambda 内の実行環境が完全に変わったので新しい記事を書きました。

今回は、SAM より Terraform を選びました。理由はいくつかありますが、主には私が管理するインフラはほぼ Terraform で管理されているのため、既存環境と融合性が優れてる。

Terraform モジュールとして公開しています。ソースコードは GitHub で公開しています。

実際どうなの?

まあまあいいよ。このリンク先で稼働しています。めちゃくちゃ速いわけでもないけど、遅すぎというわけでもない。CloudFrontを利用して静的アセットをキャッシュし、 opcache をチューニングしたらだいぶ速くなった。

通常なら Lambda が同時並行で起動されるときは個々のインスタンスが独立されて実行されれますが、 EFS を使えば異なる Lambda のインスタンスを跨いでファイルシステムを同期させることができる。このため、通常通り WordPress の更新、テーマやプラグインインストール、アップロード等利用できる。

準備するもの

今回のチュートリアルでは、 Lambda のソースコードに入ってるのは PHP を実行する環境のみ。 WordPress のファイル等は、 EFS のボリュームにインストールするので、別途 EC2 のインスタンスを用意する必要があります。

下記に、具体的に何を用意しないといけないのをリストしました。

  1. 有効なAWSアカウント
  2. インターネットにアクセスできるプライベートサブネット。 EFS を使うために VPC 内に Lambda を起動する制約がありますが、そうすると NATゲートウェイNATインスタンス (比較) を使わないとインターネットにアクセスできない。
  3. Terraform 0.12 以上
  4. MySQL データベース
  5. WordPress のファイルを初期インストールするための EC2 インスタンス

Terraform が作成するリソース一覧はこちらにあります

ステップ

今回は独立した Terraform モジュールを使いますが、他の Terraform の環境に埋め込む場合は適宜修正してください。

このチュートリアルをそのまま使う場合、us-west-2 リージョンを使ってください。PHP は標準で提供されていないため、カスタムの Lambda レイヤーを使って実行します。私が公開したレイヤーが現在、 us-west-2 しか公開していない。他のリージョンでも公開することを努めていますが、その間は私がフォークした php-lambda-layer を直接作成することができます。

1. EC2インスタンスを起動する

今回、 t3a.nano を選びました。amazon-efs-utils のパッケージを予めインストールしておくのはおすすめです。

コンソールにいるついでに、データベースにアクセスできるセキュリティグループのIDと Lambda を起動するプライベートサブネットのIDをメモしておいてください。

2. Terraform 環境を作成する
$ git clone https://github.com/KotobaMedia/terraform-aws-wordpress-on-lambda-efs
$ cd ./terraform-aws-wordpress-on-lambda-efs

ディレクトリに local.auto.tfvarsというファイルを作って、下記の情報を入れます。

# ステップ1でメモしたセキュリティグループのIDを配列に入れます
security_group_ids = ["sg-XXX"]

# ステップ1でメモしたサブネットのIDを配列に入れます
subnet_ids = ["subnet-XXX", "subnet-XXX", "subnet-XXX"]

もし ~~.cloudfront.net のデフォルトドメインより、カスタムドメインを使う場合は、 acm_certificate_namedomain_name 変数も指定します。

Apply すると、terraform がリソースを作ってくれます。

$ terraform apply

AWS 認証情報を求められる場合、中止 (Ctrl-C) した上、環境変数で認証情報を設定してください。私の場合、複数のAWS環境を管理しているので、AWS_PROFILE という環境変数をよく使います。

Terraform が実際にインフラのリソースを作成する前に、実際稼働しているインフラのリソースの差分を出します。今回は新しく作成しているはずなので、全て「追加」というように出ると思います。確認した上で、 yes を答えてください。

CloudFront distribution が入ってるため、apply に多少時間かかる(僕の場合は、全部で5分ぐらいかかりました)。完了したら、アウトプット変数がいくつか表示されます。この変数をまただす場合は、 terraform output コマンドを使ってください。

そのターミナルをそのまま開いてください、後で使います。

3. EC2 に EFS のファイルシステムをマウントする

EC2 が EFS にアクセスするためにセキュリティグループをアサインする必要があります。ステップ2の Terraform が EFS にアクセスできるためのセキュリティグループを作ってくれたので、それを使いました。 efs_security_group_id のアウトプット変数の値を EC2 インスタンスに貼り付けてください

次、EC2 にログインして、 EFS をマウントしましょう。下記のコマンドから、fs-XXXXXefs_file_system_id のアウトプット変数の値で置き換えてください。

$ sudo -s
# mkdir /mnt/efs
# mount -t efs fs-XXXXX:/ /mnt/efs

もし問題などあれば、ユーザーガイドによくある問題をリストしているので、確認してください。

4. WordPress をインストールしましょう

ファイルシステムがマウントされて、やっと WordPress のファイルをインストールできるようになりました。 Terraform がランダムで新しいディレクトリを作った( /mnt/efs/roots/wp-lambda-$ランダム )ので、そこに cd しましょう。

最新の WordPress をダウンロードして、そこに解凍してください。

ここから、通常通りの WordPress のインストールを進められることができると思います。カスタムドメインを利用していない場合は cloudfront_distribution_domain_name でアクセスできます。カスタムドメインを利用している場合は、CloudFront のdistribution のドメインに CNAME を向けて、指定したドメインでアクセスできると思います。

ここから何ができるか

このままだとパーフォーマンスが最高とは言えないのですが、下記の最適化も加えられると考えられます。

  • アップロードファイルを EFS より S3 にアップロードする。私は Humanmade の S3 Uploads プラグインをよく使います。
  • src/php.ini に入ってる opcache の設定を調節する。
  • 静的アセット( JS / CSS 等)を軽量な nginx サーバーで返す。
  • handler.php を調節して Cache-Control を追加する。これによって、 CloudFront のキャッシュがもっと使えるようになります。

制約

AWS のサービスの技術的な制約によって下記のようなリミットがあります。

この運用の形によって、他の制約もあります。

  • FTP やログイン可能な SSH アクセスがありません。EC2 インスタンスを使ってファイルを管理しないといけない。
  • 無限に同時並行で実行できるプラットフォーム Lambda から接続数が有限な MySQL に接続します。もし接続で問題になるところがあれば Aurora Serverless を試してみる価値があると思います。

最後まで読んでくれてありがとうございます!

質問、難しかったところ、改善してほしいところ、コメント等は下記のコメントフォームや Twitter で連絡してください。

Categories
AWS English WordPress

WordPress on AWS Lambda (EFS Edition)

I previously wrote a post about running WordPress on AWS Lambda, but it was before EFS support was announced (EFS is a managed network file system AWS provides). Being able to use EFS completely changes the way WordPress works in Lambda (for the better!), so I felt it warranted a new blog post.

In addition, this time I’m using Terraform instead of SAM. This matches the existing infrastructure-as-code setup I use when I deploy infrastructure for clients. Here’s the Terraform module (source code).

Summary

It works. It’s OK. Check it out, it’s running here. It’s not the best, but it isn’t bad, either. The biggest performance bottleneck is the EFS filesystem, and there’s no getting around that. PHP is serving static assets bundled with WordPress as well, which adds to some latency (in this configuration, CloudFront is caching most of these files, however). Tuning opcache to cache files in memory longer helped a lot.

Because EFS is synchronized across all the instances of Lambda, online updates, installs, and uploads work as expected.

What You’ll Need

In this setup, Lambda is only used for running PHP — installing the initial WordPress files is done on an EC2 instance that has the EFS volume mounted. This is a list of what you’ll need.

  1. An AWS account.
  2. A VPC with Internet access through a NAT gateway or instance (comparison). This is important because EFS connectivity requires Lambda to be set up in a VPC, but it won’t have Internet access by default.
  3. Terraform (the module uses v0.12 syntax, so you’ll need to use v0.12.)
  4. A MySQL database (I’m using MySQL on RDS using the smallest instance available)
  5. An EC2 instance to perform the initial setup and install of WordPress.

For a list of the resources that Terraform will provision, take a look at the Resources page here.

Steps

These steps assume you’re running this Terraform module standalone — if you want to run it in the context of an existing Terraform setup, prepare to adjust accordingly.

If you’re following this step-by-step, be sure to choose the us-west-2 region. Lambda Layer that I’m using for this is only published in the us-west-2 region. I’m working on getting the layer published in other regions, but in the meantime, use my fork of the php-lambda-layer to create your own in the region of your choosing.

1. Start the EC2 instance.

(If it isn’t already running)

I’m using a t3a.nano instance. Install the amazon-efs-utils package to get ready for mounting the EFS volume.

Also, while you’re in the console, note down the ID of a Security Group that allows access to RDS and the IDs of the private subnets to launch Lambda in.

2. Get Terraform up and running.
$ git clone https://github.com/KotobaMedia/terraform-aws-wordpress-on-lambda-efs
$ cd ./terraform-aws-wordpress-on-lambda-efs

Create a file called local.auto.tfvars, and put the following contents in to it:

# An array of the Security Group IDs you listed in step 1.
security_group_ids = ["sg-XXX"]

# An array of the Subnet IDs you listed in step 1.
subnet_ids = ["subnet-XXX", "subnet-XXX", "subnet-XXX"]

If you want to use a custom domain name (instead of the default randomly-generated CloudFront domain name), set the acm_certificate_arn and domain_name variables as well.

Now, you’re ready to create the resources.

$ terraform apply

If you’re asked for your AWS credentials, Ctrl-C and try setting the authentication information via environment variables. I manage a lot of AWS accounts, so I use the AWS_PROFILE environment variable.

Terraform will ask you if you want to go ahead with the apply or not — look over the changes (the initial apply should not have any modifications or deletions), then respond yes.

When the apply has finished, you should see some outputs. If you don’t (or you already closed the window), you can always run terraform output. Keep this window open, you’ll need it in the next step.

3. Mount EFS on the EC2 instance.

First, we need to give the EC2 instance access to the EFS filesystem. Terraform created a security group for us (it’s in the efs_security_group_id output), so attach that to your EC2 instance.

Log in to your EC2 server, then mount the EFS filesystem (replace fs-XXXXX with the value of the efs_file_system_id output):

$ sudo -s
# mkdir /mnt/efs
# mount -t efs fs-XXXXX:/ /mnt/efs

If you’re having trouble mounting the filesystem, double check the security groups and take a look at the User Guide.

4. Install WordPress.

Now that the filesystem is mounted, we can finally proceed to install WordPress. Terraform automatically created a directory in the EFS filesystem (/mnt/efs/roots/wp-lambda-$RANDOM_STRING), so cd to there first. Download the latest version of WordPress, then extract the files there.

Now, you can go ahead with the famous five-minute install like you would with any other WordPress site! If you didn’t set a custom domain name, your site should be accessible at the domain name outputted at cloudfront_distribution_domain_name. If you did set a custom domain, then set a CNAME or alias to the CloudFront distribution domain name, then you should be able to access the site there.

Where to go from here

Here are some ideas for performance improvements that I haven’t tried, but should have some potential.

  • Upload files to S3 instead of WordPress. I use this plugin by Human Made: humanmade/S3-Uploads.
  • Experiment with adjusting the opcache settings in src/php.ini.
  • Use a lightweight nginx server to serve static assets from EFS to CloudFront.
  • Experiment with setting Cache-Control headers in handler.php for static files.

Limitations

There are a couple hard limits imposed by AWS due to the technical limitations of the infrastructure.

Here are some other limitations that you’ll have to keep in mind.

  • No FTP / SSH access — you’ll need to manage an EC2 instance if you need command line or direct file access.
  • All the considerations of accessing a connection-oriented database from Lambda. You can try using Aurora Serverless if you run in to connection problems. RDS Proxy may also be able to provide you with a solution.

Thanks!

Thanks for reading! If you have any questions or comments, please don’t hesitate to leave a comment or send me a tweet.

Categories
AWS

Rails on AWS: Do you need nginx between Puma and ALB?

When I set up Rails on AWS, I usually use the following pattern:

(CloudFront) → ALB → Puma

I was wondering: Is it always necessary to put nginx between the ALB and Puma server?

My theory behind not using nginx is that because it has its own queue (while the Classic Load Balancer had a very limited “surge queue”, the ALB does not have such a queue), it will help in getting responses back to the user (trading for increased latency) while hindering metrics used for autoscaling and choosing what backend to route the request to (such as Rejected Connection Count).

I couldn’t find any in-depth articles about this, so I decided to prove my theory (in)correct by myself.

In this test, the application servers will be running using ECS on Fargate (platform version 1.4.0). It’s a very simple “hello world” app, but I’ll give it a bit of room to breathe with each instance having 1 vCPU and 2GB of RAM. I’ll be using Gatling on a single c5n.large instance (“up to 25 gigabits” should be enough for this test).

In this test, I wanted to try out a few configurations that mimic characteristics of applications I’ve worked on: short and long requests, usually IO-bound. A short request is defined as just rendering a simple HTML template. A long request is 300ms. The requests are ramped from 1 request/sec to 1000 requests/sec over 5 minutes.

Response Time Percentiles over Time (OK responses), simple render — 4 instances, 20 threads each, connected directly to the ALB.
Response Time Percentiles over Time (OK responses), simple render — 4 instances, 20 threads each, using Nginx.

As you can see, for the simple render scenario, Nginx and Puma were mostly the same. As load approached 1000 requests/sec, latency started to get worse, but all requests were completed with an OK status.

The 300ms scenario was a little more grim.

Number of responses per second (green OK, red error), 300ms response — 4 instances, 20 threads each, connected directly to the ALB.
Number of responses per second (green OK, red error), 300ms response — 4 instances, 20 threads each, using Nginx.

My theory that Puma will fail fast and give error status to the ALB when reaching capacity was right. The theoretical maximum throughput is 4 instances * 20 threads * (1000ms in 1 second / 300ms) = 266 requests/sec. Puma handles about 200 requests/sec before returning errors; Nginx starts returning error status at around 275 requests/sec, but at that point requests are already queueing and the response time is spiking.

Remember, these results are for this specific use case, and results for a test specific to your use case probably will be different, so it’s always important to do load testing tailored to your environment, especially for performance critical areas.

Categories
AWS English

Hosting a Single Page Application with an API with CloudFront and S3

I’ve written about how to host a single page application (SPA) on AWS using CloudFront and S3 before, using the CloudFront “rewrite not found errors as a 200 response with index.html” trick.

Recently, working on a few serverless apps, I’ve realized that this trick, while quick, isn’t perfect. The specific case where it broke down was when the API is configured as a behavior on CloudFront (I usually scope the API to /api on the same domain as the frontend, so CORS and OPTIONS requests aren’t necessary). If the API returned a 404 Not Found response, CloudFront would rewrite it to 200 OK index.html, and the front-end application would get confused. Unfortunately, CloudFront doesn’t support customized error responses per behavior, so the only way to fix this was to use Lambda@Edge instead.

Here’s the code for the Lambda function:

'use strict'

const path = require('path')

exports.handler = (evt, context, cb) => {
  const { request } = evt.Records[0].cf

  const uriParts = request.uri.split("/")

  if (
    // Root resource with a file extension.
    (
      uriParts.length === 2 && path.extname(uriParts[1]) !== ""
    ) ||
    // Anything inside the "static" directory.
    uriParts[1] === "static"
  ) {
    // serve the original request to S3
  } else {
    // change the request to index.html
    request.uri = '/index.html'
  }

  cb(null, request)
}

This code assumes all requests to a root request with a file extension, or anything in the /static/ directory is a static file that should be served from S3. All other requests will be rewritten to index.html. These are the defaults for create-react-app, but you’ll probably need to change them to meet your requirements. (Remember, Lambda@Edge functions need to be created in us-east-1)

Attach this Lambda function to the CloudFront behavior responsible for serving from the S3 origin as origin-request, and you should be good to go. Don’t forget to remove the 404-to-200 rewrite.

Categories
AWS English WordPress

Serverless WordPress on AWS Lambda

Update 2020/07/29: AWS recently announced EFS support for Lambda, which makes running WordPress in Lambda easier, with fewer limitations. Here’s the new article about how to run WordPress in Lambda using EFS.

There are a few ways to run WordPress “serverless” on AWS. I’m going to talk about running WordPress on Lambda for this article. If you’re interested in how you can run WordPress serverless-ly on Fargate, I’m working on a post about that too.

Keep in mind that while it is possible to do this, it’s not for everyone. It’s probably not for me. Probably not for you. Use at your own risk!

Before we start, there is a core feature of Lambda that make running WordPress in Lambda quite troublesome: Read-only file system. WordPress expects a writable, persistent, local file system. We’ll be using the S3 Uploads plugin by Human Made to handle media uploads. However, core and plugin updates will not work. There’s no workaround for this, so to install / update files, we’ll need to make a new Lambda deployment.

So: let’s go! First, you’ll want to clone my boilerplate repository. I’ve prepared a WordPress installation and a simple glue script to actually boot WordPress.

$ git clone https://github.com/keichan34/wordpress-on-lambda

My plan of attack is: run WordPress in the Lambda function using a PHP custom runtime, make uploads work with S3 instead of the local filesystem, and wire up the database. In the repository above, I’ve configured static assets to be served from S3 as well.

Now, let’s prepare the database. Lambda has two networking modes: public and VPC mode. In public mode, the Lambda has default access to the public internet, but nothing else. In VPC mode, the Lambda is booted inside the VPC, and doesn’t have public internet access by default. Because WordPress requires public internet access we have to either run it in public mode, or run it in VPC mode and prepare a NAT gateway (about $30 to $50 a month, depending on the region). If Lambda runs in public mode, the database must also be publicly accessible — something that is frowned upon from a security standpoint. You should choose the option that fits your risk and price profile. In my case, I’m going with the NAT gateway route.

Now we’ve got the messy stuff out of the way, we’ll have to assemble the Lambda runtime. AWS has an article on their blog detailing how to make a PHP custom runtime, but Stackery provides a batteries-included PHP layer. It includes everything you need to make a PHP application that assumes it’s running in a traditional server environment run in AWS Lambda.

# Replace "km-wordpress-on-lambda-deployment-201906" with something that makes sense for you. It's globally unique, so copying and pasting this will result in an error.
# Make sure you're in the same region as your database!

$ DEPLOY_BUCKET="km-wordpress-on-lambda-deployment-201906"
$ aws s3 mb "s3://$DEPLOY_BUCKET"
$ cd <the directory you cloned the GitHub repository to>

Now, it’s time to install WordPress! We’ll add the WordPress files to the deployment package. As usual, copy wp-config-example.php to wp-config.php. Enter your database details. If you have a hostname that you’re going to use with CloudFront, enter it now. If not, you’ll have to wait until after the CloudFront distribution is created, then try again.

Now, let’s deploy. This will create a new CloudFront distribution and S3 bucket for public assets, so maybe it’s a good time to make a cup of coffee. If you haven’t installed the SAM CLI, do that before the next block.

$ sam package --template-file template.yaml --output-template-file serverless-output.yaml --s3-bucket "$DEPLOY_BUCKET"
$ sam deploy --template-file serverless-output.yaml --stack-name wordpress-on-lambda --capabilities CAPABILITY_IAM
$ aws s3 sync ./src/php s3://deploy-bucket-XXXXX --exclude "*.php" --exclude "*.ini"

I’ll be using the default CloudFront domain for this demo. If you’re going to be using your own domain, you need to modify the template.yaml file to add the an alias to the CloudFront distribution. Use the following command to show the CloudFront domain name.

$ aws cloudformation describe-stacks --stack-name wordpress-on-lambda | jq '.Stacks[0].Outputs'

OK! Now, you should be able to access the CloudFront URL, and you’ll get redirected to the friendly WordPress installer! If you’ve set up your wp-config.php correctly, the installation should go smoothly.

The site I set up for this post is available here: https://dskhgdbzphjkm.cloudfront.net/

Lessons Learned

This is for almost no-one. I think the only valid use case (in this current form) for running WordPress in AWS Lambda is a site that gets periodic, unpredictable spikes of intense traffic — a use case where Lambda’s scalability and price model pays off. This is also a use case where, presumably, the benefits of the scalability trumps the inconvenience of not being able to use the online updaters and installers (also, I’m assuming the database will be able to keep up with the load).

However, if updating and installing themes or plugins could be managed outside of the Lambda environment (say, with wp-cli), with deployments automated… Then, it may be a little more applicable to a larger audience.

If you’re looking for a cheap solution to host your personal blog (like me!), you might just want to bite the bullet and check out any of the hosted WordPress solutions out there.

If you liked this post, or you’d like to provide some input, please do so in the comments. My favorite AWS service is Lambda, and I like pushing it a bit, so look forward to similar posts in the future. If you find bugs in the boilerplate, or you can make improvements, please open an issue or PR!

Miscellaneous Tidbits

  • Aurora Serverless sounds like it would be the best match for this setup. It probably is. Just keep in mind that Aurora Serverless doesn’t support publicly accessible clusters. To use it, you’ll need to go the Lambda-in-VPC, NAT gateway route.
  • Regarding public / private access and NAT gateways, if you’re like me and believe in the future of IPv6 and think that you can just use an egress-only internet gateway – you’re wrong! Lambda doesn’t seem to support IPv6 at this time.
  • You can actually use a NAT instance if the NAT gateway is overkill. However, I would recommend using the NAT gateway if you can. It comes with automatic scalability and redundancy, so you don’t have to babysit your NAT instance. (If you need more than one NAT instance, use the gateway. Seriously.)
  • At time of writing, my patches to php-lambda-layer haven’t been merged yet, so you can use my patched version (the boilerplate repository has this applied already).
  • If you’re really going all-in, consider using an Application Load Balancer rather than API Gateway to save money. API Gateway has zero fixed costs, but there is a point where ALB will become cheaper than API Gateway.
  • Doing some crude calculations, you should be able to handle an average of a few hundred users per day under the perpetual free tier. Your highest bill may be data transfer to the user.
Categories
AWS English

Managing ECS clusters, 4 years in.

Throughout these past 4 years since AWS ECS became generally available, I’ve had the opportunity to manage 4 major ECS cluster deployments.

Across these deployments, I’ve built up knowledge and tools to help manage them, make them safer, more reliable, and cheaper to run. This article has a bunch of tips and tricks I’ve learned along the way.

Note that most of these tips are rendered useless if you use Fargate! I usually use Fargate these days, but there are still valid reasons for managing your own cluster.

Spot Instances

ECS clusters are great places to use spot instances, especially when managed by a Spot Fleet. As long as you handle the “spot instance is about to be terminated” event, and set the container instance to draining status, it works pretty well. When ECS is told to drain a container instance, it will stop the tasks cleanly on the instance and run them somewhere else. I’ve made the source code for this Lambda function available on GitHub.

Just make sure your app is able to stop itself and boot another instance in 2 minutes (the warning time you have before the spot instance is terminated). I’ve experienced overall savings of around 60% when using a cluster exclusively comprised of spot instances (EBS is not discounted).

Autoscaling Group Lifecycle Hooks

If you need to use on-demand instances for your ECS cluster, or you’re using a mixed spot/on-demand cluster, I recommend using an Autoscaling Group to manage your cluster instances.

To prevent the ASG from stopping instances with tasks currently running, you have to write your own integration. AWS provides some sample code, which I’ve modified and published on GitHub.

The basic gist of this integration is:

  1. When an instance is scheduled for termination, the Autoscaling Group sends a message to an SNS topic.
  2. Lambda is subscribed to this topic, and receives the message.
  3. Lambda tells the ECS API to drain the instance that is scheduled to be terminated.
  4. If the instance has zero running tasks, Lambda tells the Autoscaling Group to continue with termination. The Autoscaling Group terminates the instance at this point.
  5. If the instance has more than zero running tasks, Lambda waits for some time and sends the same message to the topic, returning to step (2).

By default, I set the timeout for this operation to 15 minutes. This value depends on the specific application. If your applications require more than 15 minutes to cleanly shut down and relocate to another container instance, you’ll have to set this value accordingly. (Also, you’ll have to change the default ECS StopTask SIGTERM timeout — look for the “ECS_CONTAINER_STOP_TIMEOUT” environment variable)

Cluster Instance Scaling

Cluster instance scale-out is pretty easy. Set some CloudWatch alarms on the ECS CPUReservation and MemoryReservation metrics, and you can scale out according to those. Scaling in is a little more tricky.

I originally used those same metrics to scale in. Now, I use a Lambda script that runs every 30 minutes, cleaning up unused resources until a certain threshold of available CPU and memory is reached. This technique further reduces service disruption. I’ll post this on GitHub sometime in the near future.

Application Deployment

I’ve gone through a few application deployment strategies.

  1. Hosted CI + Deploy Shell Script
    • Pros: simple.
    • Cons: you need somewhere to run it, easily becomes a mess. Shell scripts are a pain to debug and test.
  2. Hosted CI + Deploy Python Script (I might put this on GitHub sometime)
    • Pros: powerful, easier to test than using a bunch of shell scripts.
    • Cons: be careful about extending the script. It can quickly become spaghetti code.
  3. Jenkins
    • Pros: powerful.
    • Cons: Jenkins.
  4. CodeBuild + CodePipeline
    • Pros: simple; ECS deployment was recently added; can be managed with Terraform.
    • Cons: Subject to limitations of CodePipeline (pretty limited). In our use case, the sticking points are not being able to deploy an arbitrary Git branch (you have to deploy the branch specified in the CodePipeline definition).

Grab-bag

Other tips and tricks

  • Docker stdout logging is not cheap (also, performance is highly variable across log drivers — I recently had a major problem with the fluentd driver blocking all writes). If your application blocks on logging (looking at you, Ruby), performance will suffer.
  • Having a few large instances yields more performance than many small instances (with the added benefit of having the layer cache when performing deploys).
  • The default placing strategy should be: binpack on the resource that is most important to your application (CPU or memory), AZ-balanced
  • Applications that can’t be safely shut down in less than 1 minute do not work well with Spot instances. Use a placement constraint to make sure these tasks don’t get scheduled on a Spot instance (you’ll have to set the attribute yourself, probably using the EC2 user data)
  • Spot Fleet + ECS = ❤️
  • aws update-service help for service administration commands. I use --force-new-deployment and --desired-count quite often.
  • If you manage your own EC2 instances with Auto Scaling Groups: aws autoscaling terminate-instance-in-auto-scaling-group --instance-id "i-XXX" --no-should-decrement-desired-capacity will start a new EC2 instance and perform termination lifecycle hooks on it. This is what I use to switch out old EC2 instances with new launch configurations.
Categories
AWS English

AWS Application Auto-scaling for ECS with Terraform

Update: Target tracking scaling is now available for ECS services.

I’ve been working on setting up autoscaling settings for ECS services recently, and here are a couple notes from managing auto-scaling for ECS services using Terraform.

Creating multiple scheduled actions at once


Terraform will perform the following actions:

  + aws_appautoscaling_scheduled_action.green_evening
      id:                                    
      arn:                                   
      name:                                  "ecs"
      resource_id:                           "service/default-production/green"
      scalable_dimension:                    "ecs:service:DesiredCount"
      scalable_target_action.#:              "1"
      scalable_target_action.0.max_capacity: "20"
      scalable_target_action.0.min_capacity: "2"
      schedule:                              "cron(0 15 * * ? *)"
      service_namespace:                     "ecs"

  + aws_appautoscaling_scheduled_action.wapi_green_morning
      id:                                    
      arn:                                   
      name:                                  "ecs"
      resource_id:                           "service/default-production/green"
      scalable_dimension:                    "ecs:service:DesiredCount"
      scalable_target_action.#:              "1"
      scalable_target_action.0.max_capacity: "20"
      scalable_target_action.0.min_capacity: "3"
      schedule:                              "cron(0 23 * * ? *)"
      service_namespace:                     "ecs"

This fails with:


* aws_appautoscaling_scheduled_action.green_evening: ConcurrentUpdateException: You already have a pending update to an Auto Scaling resource.

To fix, the scheduled actions need to be executed serially.


resource "aws_appautoscaling_scheduled_action" "green_morning" {
  name               = "ecs"
  service_namespace  = "${module.green-autoscaling.service_namespace}"
  resource_id        = "${module.green-autoscaling.resource_id}"
  scalable_dimension = "${module.green-autoscaling.scalable_dimension}"
  schedule           = "cron(0 23 * * ? *)"

  scalable_target_action {
    min_capacity = 3
    max_capacity = 20
  }
}

resource "aws_appautoscaling_scheduled_action" "green_evening" {
  name               = "ecs"
  service_namespace  = "${module.green-autoscaling.service_namespace}"
  resource_id        = "${module.green-autoscaling.resource_id}"
  scalable_dimension = "${module.green-autoscaling.scalable_dimension}"
  schedule           = "cron(0 15 * * ? *)"

  scalable_target_action {
    min_capacity = 2
    max_capacity = 20
  }

  # Application AutoScaling actions need to be executed serially
  depends_on = ["aws_appautoscaling_scheduled_action.green_morning"]
}
Categories
AWS English

ECS ChatOps with CodePipeline and Slack

I’m currently working on migrating a Rails application to ECS at work. The current system uses a heavily customized Capistrano setup that’s showing its signs, especially when deploying to more than 10 instances at once.

While patiently waiting for EKS, I decided to use ECS over manage my own Kubernetes cluster on AWS using something like kops. I was initially planning on using Lambda to create the required task definitions and update ECS services, but native CodePipeline deploy support for ECS was announced right before I started planning the project, which greatly simplified the deploy step.

The current setup we have now is: a few Lambda functions to link CodePipeline and Slack together, two CodePipeline pipelines per service (one for production and one for staging), and the associated ECS resources.

First, a deploy is triggered by saying “deploy [environment] [service]” in the deploy channel. Slack sends an event to Lambda (via API Gateway), and Lambda starts an execution of the CodePipeline pipeline if it is not already in progress (because of the way CodePipeline API operations work, it’s hard to work with multiple concurrent runs). This Lambda function also records some basic state in DynamoDB — namely, the Slack channel, user, and timestamp. This information is used to determine what channel to send replies to, and what user to mention if something in the deploy process goes awry.

CodePipeline then starts CodeBuild, which is configured to create Docker image(s) and a simple JSON file that is used to tell CodePipeline’s ECS integration the image tags the task definition should be updated with.

When CodeBuild is finished, a “manual approve” action is used to request human approval before continuing with the deploy. In the example here, I have it turned on for staging environments, but it’s usually only used in production. In production, we normally have 3 stages in the release cycle — the first canary deployment, followed by 25%, then the remaining 75%.

The rest is relatively straightforward — just CodePipeline telling ECS to deploy images. If errors are detected along the way, a “rollback” command is used to manually roll back changes.

When the deploy is finished, a Lambda function is used to send a message to the deploy channel.