Categories
English Tools Useful Utilities

“Logging in” to AWS ECS Fargate

I’m a big fan of AWS ECS Fargate. I’ve written in the past about managing ECS clusters, and with Fargate — all of that work disappears and is managed by AWS instead. I like to refer to this as quasi-serverless. Sorta-serverless? Almost-serverless? I’m open to better suggestions. 😂

There are a few limitations of running in Fargate, and this blog post will focus on working around one limitation: there’s easy way to get an interactive command line shell within a running Fargate container.

The way I’m going to establish an interactive session inside Fargate is similar to how CircleCI or Heroku does this: start a SSH server in the container. This requires two components: the SSH server itself, which will be running in Fargate, and a tool to automate launching the SSH server. Most of this blog post will be about the tool to automate launching the server, called ecs-fargate-login.

If you want to skip to the code, I’ve made it available on GitHub using the MIT license, so feel free to use it as you wish.

How it works

This is what ecs-fargate-login does for you, in order:

  1. Generate a temporary SSH key pair.
  2. Use the ECS API to start a one-time task, setting the public key as an environment variable.
    • When the SSH server boots, it reads this environment variable and adds it to the list of authorized keys.
  3. Poll the ECS API for the IP address of the running task. ecs-fargate-login supports both public and private IPs.
  4. Start the ssh command and connect to the server.

When the SSH session finishes, ecs-fargate-login will make sure the ECS task is stopping.

Use cases

Most of my clients use Rails, and Rails provides an interactive REPL (read-eval-print loop) within the Rails environment. This REPL is useful for running one-off commands like creating new users or fixing some data in the database, checking and/or clearing cache items, to mention a few common tasks. Rails developers are accustomed to using the REPL, so while not entirely necessary (in the past, I usually recommended fixing data using direct database access or with one-time scripts in the application repository), it is a nice-to-have feature.

In conclusion

I don’t use this tool daily, but probably a few times a week. A few clients of mine use it as well, and they’re generally happy with how it works. However, if you have any recommendations about how it could be improved, or how the way the tool itself is architected could be improved, I’m always open to discussion. This was my first serious attempt at writing Golang code, so there are probably quite a few beginner mistakes in the code, but it should work as expected.

Categories
日本語

Serverless Meetup Tokyo #13 に参加してみました

Serverless Meetup Tokyo 第13回 に参加してみました。

会場は Speee Lounge

ServerlessDays Tokyo 2019

というイベントの啓発(参加者、登壇者)ありました。私も参加しようと思っています。登壇は、、検討します(笑)

Azure Serverless 2019 Summer Edition

三宅 和之(株式会社ゼンアーキテクツ

自分は普段AWSの世界に浸かられてるのでAzureは新鮮でした。Azure Functions ではが C# と Node.js が主流だが最近 Java が最近サポートし始めてる。Runtime は全てOSSらしい。

Azure Functions v2 はオススメ!v1と違って、v2はgRPCを利用して基盤となる.NET Coreとワーカーを分離することによって軽量化できたらしい。なるほど。

Premium Planを使えば “Pre-warmed instances” という機能は使える – これはお金払ってもAWS Lambdaも提供してない。(起動時間はいつも改善しようとしてるらしいけど・・・

TypeScript正式利用!これはすごいね。デフォルトで?どこでコンパイルされるんだろう?

Durable Functions: ステートフルファンクション。 AWSならStep Functions?概要読んだところ、コード(C#, F#, JavaScript)で定義できるのは面白いですね。

KEDA “Kubernetes-based Event Driven Autoscaling”.

Azure Cosmos DB – DynamoDBみたいなやつ。もっと機能があるように見える。Change Feed DynamoDB Streams。Change Feed から Azure Function を起動することができる。SignalR というものと組み合わせれば、WebSocketsにパブリッシュできそう。SignalRはDynamoDB Streams + Lambda + API Gateway WebSocket よりかなり簡易的に実装できそう!

SQL DB Serverless。AWS Aurora Serverlessと同じような挙動してるっぽい。いつか完全にリクエスト課金のSQLデータベースできるといいね。。

Azure 世界は OSS プロダクトが多いのでコントリビュートできる。

でも正直なところ、あまりAWSから移行する!という感じはしなかった。個人的にはTypeScript興味ありますが、C#など全然興味ありません(C#を勉強する前に、Rustをちょっと深掘りしたい・・)。ただ、冒頭の通りあまりAzureのこと触れないので、今回はとてもいい機会でした。

営業職から見たサーバーレス

「既存の開発メニューにはまらない」

「どうやってサーバレスをクライアントに売る?」

私は個人的には、「サーバレス」を直接うるんじゃなくて、サーバレスで運用が楽になった分、たとえ運用費を同じ額をもらってると特になると思う。(エンドユーザーからみたら、全く同じ方式)ただ、サーバレスで以前できなかったことを実現できれば付加価値として請求できると思う。例えば、サーバレスは急なバーストなどに耐えられる設計しやすいので、「安易に急なバーストを耐えられるアプリを作る」という提案はできます。それ以外は、あくまで最適化の手段の一つだと思う。

LambdaとDynamoDBでつくるIoTバックエンド

岡本 忠浩(株式会社MMM

AWS SAM を使ってる。SAMで管理しきれないのは別レポジトリーのCloudFormationが。

100個以上のLambda関数。多い!6ヶ月をかけてエンジニア2人で作ってるみたい。なんて複雑なプロジェクト。。

DynamoDBのテーブル設計

始まりがだいたい良さそう。RDBのER図を書いて、アクセスパターンを列挙して、DynamoDB を設計する。「Serverlessを極めるためにDynamoDBデータモデリングを極めよう」という資料を参考になります。

まあ、ベストプラクティスを従ってると問題ない、という感じだね。

変わってるところがあれば、 Go で定義を落とし込んでる。

DynamoDBトランザクションが10個に制限されてる。「簡易に超過する」いいえ、これはDynamoDBをRBDMSっぽく使うとそうなるけど、本当にNoSQLファーストな設計じゃないとだめ。(先ほどのServerlessを極めるために・・)

やっぱり、DynamoDBと限らないNoSQLは、「大量なデータに適した技術」なので、まだRBDMSを使った方が適切な場合は多いと思います。特に私がいるスタートアップ界隈は条件などが急に変わったりすると、アクセスパターンがかなり変わる。。アクセスパターンを更新するたびにデータベース設計も変えないといけなかったら、かなり厳しい感じします。DynamoDBを意識しない仕組みを作るのがベストというが、これは絶対にしちゃいけないこと。そうすると慣れてるRBDMSっぽいことをしようとする。

AWS Step Functions を使ってるみたい。私も使いたいと思って、以前自分で Lambda -> SQS -> Lambda -> SQS で簡易的に対応したけど、もっと複雑な課題があったら Step Functions の方が適切っぽい。

FaaS上のコードをもっとシンプルに書くためのトランスパイラ

木村 功作(富士通研究所

https://github.com/fujitsulaboratories/escapin

正直、これって async await とどう違うのか?という疑問。技術自体はかなり凄いと思う!いつかトランスパイラー作ると面白いかもしれない。

まとめ

楽しかった!普段触れない技術や話に色々触れたので、非常に勉強になりました。

Categories
AWS English

Hosting a Single Page Application with an API with CloudFront and S3

I’ve written about how to host a single page application (SPA) on AWS using CloudFront and S3 before, using the CloudFront “rewrite not found errors as a 200 response with index.html” trick.

Recently, working on a few serverless apps, I’ve realized that this trick, while quick, isn’t perfect. The specific case where it broke down was when the API is configured as a behavior on CloudFront (I usually scope the API to /api on the same domain as the frontend, so CORS and OPTIONS requests aren’t necessary). If the API returned a 404 Not Found response, CloudFront would rewrite it to 200 OK index.html, and the front-end application would get confused. Unfortunately, CloudFront doesn’t support customized error responses per behavior, so the only way to fix this was to use Lambda@Edge instead.

Here’s the code for the Lambda function:

'use strict'

const path = require('path')

exports.handler = (evt, context, cb) => {
  const { request } = evt.Records[0].cf

  const uriParts = request.uri.split("/")

  if (
    // Root resource with a file extension.
    (
      uriParts.length === 2 && path.extname(uriParts[1]) !== ""
    ) ||
    // Anything inside the "static" directory.
    uriParts[1] === "static"
  ) {
    // serve the original request to S3
  } else {
    // change the request to index.html
    request.uri = '/index.html'
  }

  cb(null, request)
}

This code assumes all requests to a root request with a file extension, or anything in the /static/ directory is a static file that should be served from S3. All other requests will be rewritten to index.html. These are the defaults for create-react-app, but you’ll probably need to change them to meet your requirements. (Remember, Lambda@Edge functions need to be created in us-east-1)

Attach this Lambda function to the CloudFront behavior responsible for serving from the S3 origin as origin-request, and you should be good to go. Don’t forget to remove the 404-to-200 rewrite.

Categories
AWS English WordPress

Serverless WordPress on AWS Lambda

Update 2020/07/29: AWS recently announced EFS support for Lambda, which makes running WordPress in Lambda easier, with fewer limitations. Here’s the new article about how to run WordPress in Lambda using EFS.

There are a few ways to run WordPress “serverless” on AWS. I’m going to talk about running WordPress on Lambda for this article. If you’re interested in how you can run WordPress serverless-ly on Fargate, I’m working on a post about that too.

Keep in mind that while it is possible to do this, it’s not for everyone. It’s probably not for me. Probably not for you. Use at your own risk!

Before we start, there is a core feature of Lambda that make running WordPress in Lambda quite troublesome: Read-only file system. WordPress expects a writable, persistent, local file system. We’ll be using the S3 Uploads plugin by Human Made to handle media uploads. However, core and plugin updates will not work. There’s no workaround for this, so to install / update files, we’ll need to make a new Lambda deployment.

So: let’s go! First, you’ll want to clone my boilerplate repository. I’ve prepared a WordPress installation and a simple glue script to actually boot WordPress.

$ git clone https://github.com/keichan34/wordpress-on-lambda

My plan of attack is: run WordPress in the Lambda function using a PHP custom runtime, make uploads work with S3 instead of the local filesystem, and wire up the database. In the repository above, I’ve configured static assets to be served from S3 as well.

Now, let’s prepare the database. Lambda has two networking modes: public and VPC mode. In public mode, the Lambda has default access to the public internet, but nothing else. In VPC mode, the Lambda is booted inside the VPC, and doesn’t have public internet access by default. Because WordPress requires public internet access we have to either run it in public mode, or run it in VPC mode and prepare a NAT gateway (about $30 to $50 a month, depending on the region). If Lambda runs in public mode, the database must also be publicly accessible — something that is frowned upon from a security standpoint. You should choose the option that fits your risk and price profile. In my case, I’m going with the NAT gateway route.

Now we’ve got the messy stuff out of the way, we’ll have to assemble the Lambda runtime. AWS has an article on their blog detailing how to make a PHP custom runtime, but Stackery provides a batteries-included PHP layer. It includes everything you need to make a PHP application that assumes it’s running in a traditional server environment run in AWS Lambda.

# Replace "km-wordpress-on-lambda-deployment-201906" with something that makes sense for you. It's globally unique, so copying and pasting this will result in an error.
# Make sure you're in the same region as your database!

$ DEPLOY_BUCKET="km-wordpress-on-lambda-deployment-201906"
$ aws s3 mb "s3://$DEPLOY_BUCKET"
$ cd <the directory you cloned the GitHub repository to>

Now, it’s time to install WordPress! We’ll add the WordPress files to the deployment package. As usual, copy wp-config-example.php to wp-config.php. Enter your database details. If you have a hostname that you’re going to use with CloudFront, enter it now. If not, you’ll have to wait until after the CloudFront distribution is created, then try again.

Now, let’s deploy. This will create a new CloudFront distribution and S3 bucket for public assets, so maybe it’s a good time to make a cup of coffee. If you haven’t installed the SAM CLI, do that before the next block.

$ sam package --template-file template.yaml --output-template-file serverless-output.yaml --s3-bucket "$DEPLOY_BUCKET"
$ sam deploy --template-file serverless-output.yaml --stack-name wordpress-on-lambda --capabilities CAPABILITY_IAM
$ aws s3 sync ./src/php s3://deploy-bucket-XXXXX --exclude "*.php" --exclude "*.ini"

I’ll be using the default CloudFront domain for this demo. If you’re going to be using your own domain, you need to modify the template.yaml file to add the an alias to the CloudFront distribution. Use the following command to show the CloudFront domain name.

$ aws cloudformation describe-stacks --stack-name wordpress-on-lambda | jq '.Stacks[0].Outputs'

OK! Now, you should be able to access the CloudFront URL, and you’ll get redirected to the friendly WordPress installer! If you’ve set up your wp-config.php correctly, the installation should go smoothly.

The site I set up for this post is available here: https://dskhgdbzphjkm.cloudfront.net/

Lessons Learned

This is for almost no-one. I think the only valid use case (in this current form) for running WordPress in AWS Lambda is a site that gets periodic, unpredictable spikes of intense traffic — a use case where Lambda’s scalability and price model pays off. This is also a use case where, presumably, the benefits of the scalability trumps the inconvenience of not being able to use the online updaters and installers (also, I’m assuming the database will be able to keep up with the load).

However, if updating and installing themes or plugins could be managed outside of the Lambda environment (say, with wp-cli), with deployments automated… Then, it may be a little more applicable to a larger audience.

If you’re looking for a cheap solution to host your personal blog (like me!), you might just want to bite the bullet and check out any of the hosted WordPress solutions out there.

If you liked this post, or you’d like to provide some input, please do so in the comments. My favorite AWS service is Lambda, and I like pushing it a bit, so look forward to similar posts in the future. If you find bugs in the boilerplate, or you can make improvements, please open an issue or PR!

Miscellaneous Tidbits

  • Aurora Serverless sounds like it would be the best match for this setup. It probably is. Just keep in mind that Aurora Serverless doesn’t support publicly accessible clusters. To use it, you’ll need to go the Lambda-in-VPC, NAT gateway route.
  • Regarding public / private access and NAT gateways, if you’re like me and believe in the future of IPv6 and think that you can just use an egress-only internet gateway – you’re wrong! Lambda doesn’t seem to support IPv6 at this time.
  • You can actually use a NAT instance if the NAT gateway is overkill. However, I would recommend using the NAT gateway if you can. It comes with automatic scalability and redundancy, so you don’t have to babysit your NAT instance. (If you need more than one NAT instance, use the gateway. Seriously.)
  • At time of writing, my patches to php-lambda-layer haven’t been merged yet, so you can use my patched version (the boilerplate repository has this applied already).
  • If you’re really going all-in, consider using an Application Load Balancer rather than API Gateway to save money. API Gateway has zero fixed costs, but there is a point where ALB will become cheaper than API Gateway.
  • Doing some crude calculations, you should be able to handle an average of a few hundred users per day under the perpetual free tier. Your highest bill may be data transfer to the user.
Categories
AWS English

Managing ECS clusters, 4 years in.

Throughout these past 4 years since AWS ECS became generally available, I’ve had the opportunity to manage 4 major ECS cluster deployments.

Across these deployments, I’ve built up knowledge and tools to help manage them, make them safer, more reliable, and cheaper to run. This article has a bunch of tips and tricks I’ve learned along the way.

Note that most of these tips are rendered useless if you use Fargate! I usually use Fargate these days, but there are still valid reasons for managing your own cluster.

Spot Instances

ECS clusters are great places to use spot instances, especially when managed by a Spot Fleet. As long as you handle the “spot instance is about to be terminated” event, and set the container instance to draining status, it works pretty well. When ECS is told to drain a container instance, it will stop the tasks cleanly on the instance and run them somewhere else. I’ve made the source code for this Lambda function available on GitHub.

Just make sure your app is able to stop itself and boot another instance in 2 minutes (the warning time you have before the spot instance is terminated). I’ve experienced overall savings of around 60% when using a cluster exclusively comprised of spot instances (EBS is not discounted).

Autoscaling Group Lifecycle Hooks

If you need to use on-demand instances for your ECS cluster, or you’re using a mixed spot/on-demand cluster, I recommend using an Autoscaling Group to manage your cluster instances.

To prevent the ASG from stopping instances with tasks currently running, you have to write your own integration. AWS provides some sample code, which I’ve modified and published on GitHub.

The basic gist of this integration is:

  1. When an instance is scheduled for termination, the Autoscaling Group sends a message to an SNS topic.
  2. Lambda is subscribed to this topic, and receives the message.
  3. Lambda tells the ECS API to drain the instance that is scheduled to be terminated.
  4. If the instance has zero running tasks, Lambda tells the Autoscaling Group to continue with termination. The Autoscaling Group terminates the instance at this point.
  5. If the instance has more than zero running tasks, Lambda waits for some time and sends the same message to the topic, returning to step (2).

By default, I set the timeout for this operation to 15 minutes. This value depends on the specific application. If your applications require more than 15 minutes to cleanly shut down and relocate to another container instance, you’ll have to set this value accordingly. (Also, you’ll have to change the default ECS StopTask SIGTERM timeout — look for the “ECS_CONTAINER_STOP_TIMEOUT” environment variable)

Cluster Instance Scaling

Cluster instance scale-out is pretty easy. Set some CloudWatch alarms on the ECS CPUReservation and MemoryReservation metrics, and you can scale out according to those. Scaling in is a little more tricky.

I originally used those same metrics to scale in. Now, I use a Lambda script that runs every 30 minutes, cleaning up unused resources until a certain threshold of available CPU and memory is reached. This technique further reduces service disruption. I’ll post this on GitHub sometime in the near future.

Application Deployment

I’ve gone through a few application deployment strategies.

  1. Hosted CI + Deploy Shell Script
    • Pros: simple.
    • Cons: you need somewhere to run it, easily becomes a mess. Shell scripts are a pain to debug and test.
  2. Hosted CI + Deploy Python Script (I might put this on GitHub sometime)
    • Pros: powerful, easier to test than using a bunch of shell scripts.
    • Cons: be careful about extending the script. It can quickly become spaghetti code.
  3. Jenkins
    • Pros: powerful.
    • Cons: Jenkins.
  4. CodeBuild + CodePipeline
    • Pros: simple; ECS deployment was recently added; can be managed with Terraform.
    • Cons: Subject to limitations of CodePipeline (pretty limited). In our use case, the sticking points are not being able to deploy an arbitrary Git branch (you have to deploy the branch specified in the CodePipeline definition).

Grab-bag

Other tips and tricks

  • Docker stdout logging is not cheap (also, performance is highly variable across log drivers — I recently had a major problem with the fluentd driver blocking all writes). If your application blocks on logging (looking at you, Ruby), performance will suffer.
  • Having a few large instances yields more performance than many small instances (with the added benefit of having the layer cache when performing deploys).
  • The default placing strategy should be: binpack on the resource that is most important to your application (CPU or memory), AZ-balanced
  • Applications that can’t be safely shut down in less than 1 minute do not work well with Spot instances. Use a placement constraint to make sure these tasks don’t get scheduled on a Spot instance (you’ll have to set the attribute yourself, probably using the EC2 user data)
  • Spot Fleet + ECS = ❤️
  • aws update-service help for service administration commands. I use --force-new-deployment and --desired-count quite often.
  • If you manage your own EC2 instances with Auto Scaling Groups: aws autoscaling terminate-instance-in-auto-scaling-group --instance-id "i-XXX" --no-should-decrement-desired-capacity will start a new EC2 instance and perform termination lifecycle hooks on it. This is what I use to switch out old EC2 instances with new launch configurations.
Categories
English

“Truth in bots”

The bots should announce, “I’m not a person, or if I am, I’m not allowed to act like one.”

Or, if there’s no room or time for that sentence, perhaps a simple bot at the top of the conversation. That way, we can save our human emotions for the humans who will appreciate them.

Truth in bots | Seth’s Blog

“If you can’t tell the difference, does it matter?”

Quacking like ducks, et cetera.

The point of the post is a bit different (it’s predicated on there being able to tell the difference — “… only a minute or two into the interaction that you realize you’re being fooled by an AI, not a caring human”), but what happens when you can’t tell the difference? Should AIs always announce themselves as AIs if they are indistinguishable from a human? Why?

Categories
AWS English

AWS Application Auto-scaling for ECS with Terraform

Update: Target tracking scaling is now available for ECS services.

I’ve been working on setting up autoscaling settings for ECS services recently, and here are a couple notes from managing auto-scaling for ECS services using Terraform.

Creating multiple scheduled actions at once


Terraform will perform the following actions:

  + aws_appautoscaling_scheduled_action.green_evening
      id:                                    
      arn:                                   
      name:                                  "ecs"
      resource_id:                           "service/default-production/green"
      scalable_dimension:                    "ecs:service:DesiredCount"
      scalable_target_action.#:              "1"
      scalable_target_action.0.max_capacity: "20"
      scalable_target_action.0.min_capacity: "2"
      schedule:                              "cron(0 15 * * ? *)"
      service_namespace:                     "ecs"

  + aws_appautoscaling_scheduled_action.wapi_green_morning
      id:                                    
      arn:                                   
      name:                                  "ecs"
      resource_id:                           "service/default-production/green"
      scalable_dimension:                    "ecs:service:DesiredCount"
      scalable_target_action.#:              "1"
      scalable_target_action.0.max_capacity: "20"
      scalable_target_action.0.min_capacity: "3"
      schedule:                              "cron(0 23 * * ? *)"
      service_namespace:                     "ecs"

This fails with:


* aws_appautoscaling_scheduled_action.green_evening: ConcurrentUpdateException: You already have a pending update to an Auto Scaling resource.

To fix, the scheduled actions need to be executed serially.


resource "aws_appautoscaling_scheduled_action" "green_morning" {
  name               = "ecs"
  service_namespace  = "${module.green-autoscaling.service_namespace}"
  resource_id        = "${module.green-autoscaling.resource_id}"
  scalable_dimension = "${module.green-autoscaling.scalable_dimension}"
  schedule           = "cron(0 23 * * ? *)"

  scalable_target_action {
    min_capacity = 3
    max_capacity = 20
  }
}

resource "aws_appautoscaling_scheduled_action" "green_evening" {
  name               = "ecs"
  service_namespace  = "${module.green-autoscaling.service_namespace}"
  resource_id        = "${module.green-autoscaling.resource_id}"
  scalable_dimension = "${module.green-autoscaling.scalable_dimension}"
  schedule           = "cron(0 15 * * ? *)"

  scalable_target_action {
    min_capacity = 2
    max_capacity = 20
  }

  # Application AutoScaling actions need to be executed serially
  depends_on = ["aws_appautoscaling_scheduled_action.green_morning"]
}
Categories
日本語

CodePipeline と Slack による ECS ChatOps 運用

この記事は ECS ChatOps with CodePipeline and Slack の日本語訳です

現在、Rails アプリケーションを ECS に移行する作業を進めています。現在のシステムでは Capistrano でデプロイを行っていますが、そろそろ限界が見えてきました。

EKS が使えるようになるのを待っている間、自分で Kubernetes のクラスターを管理するより ECS を採用することに決めました。当初、Lambda 関数を使って必要なタスク定義を作り、ECS サービスを更新する予定でしたが、プロジェクトの設計を開始する直前に CodePipeline の ECS デプロイの対応が発表されました 🎉。

現在のリソースは、CodePipeline と Slack を連携するためのいくつかの Lambda 関数、サービスごとに2つのCodePipeline パイプライン(本番用と検証用)、関連 ECS リソースです。

まず、デプロイするには、Slack チャンネルに deploy [environment] [service] と発言すると起動します。Slack は(API Gateway 経由で)Lambda にイベントを送信し、Lambda は CodePipeline パイプラインの実行を開始します(CodePipeline API の仕組み上、複数の同時実行は扱いにくい)。この Lambda 関数は、DynamoDB のいくつかの基本的な状態(Slack チャンネル、ユーザー、タイムスタンプ)を記録します。この情報は、返信を送信するチャンネルと、デプロイプロセスの何かがステータスが変わったり失敗した場合に通知するために使われます。

CodePipeline がまず、Docker イメージを作成する CodeBuild を起動し、タスク定義を更新するための新しい Docker イメージのタグが入った単純な JSON ファイルを生成します。

CodeBuild が終了すると、デプロイを続行する前に人の承認を要求するために「手動承認」アクションが使用されます。ここの例では検証環境用に有効にしていますが、通常は本番環境でのみ使用されます。本番環境では、まず1台のカナリアデプロイ、次に25%、残りの75%という三段階で運用しています。

残りの部分は比較的単純です。CodePipeline がイメージをデプロイするように ECS に指示します。途中でエラーが検出された場合、rollback コマンドを使用して手動で変更をロールバックします。

デプロイが完了すると、Lambda 関数がデプロイ用チャンネルにメッセージを送信します。

Categories
AWS English

ECS ChatOps with CodePipeline and Slack

I’m currently working on migrating a Rails application to ECS at work. The current system uses a heavily customized Capistrano setup that’s showing its signs, especially when deploying to more than 10 instances at once.

While patiently waiting for EKS, I decided to use ECS over manage my own Kubernetes cluster on AWS using something like kops. I was initially planning on using Lambda to create the required task definitions and update ECS services, but native CodePipeline deploy support for ECS was announced right before I started planning the project, which greatly simplified the deploy step.

The current setup we have now is: a few Lambda functions to link CodePipeline and Slack together, two CodePipeline pipelines per service (one for production and one for staging), and the associated ECS resources.

First, a deploy is triggered by saying “deploy [environment] [service]” in the deploy channel. Slack sends an event to Lambda (via API Gateway), and Lambda starts an execution of the CodePipeline pipeline if it is not already in progress (because of the way CodePipeline API operations work, it’s hard to work with multiple concurrent runs). This Lambda function also records some basic state in DynamoDB — namely, the Slack channel, user, and timestamp. This information is used to determine what channel to send replies to, and what user to mention if something in the deploy process goes awry.

CodePipeline then starts CodeBuild, which is configured to create Docker image(s) and a simple JSON file that is used to tell CodePipeline’s ECS integration the image tags the task definition should be updated with.

When CodeBuild is finished, a “manual approve” action is used to request human approval before continuing with the deploy. In the example here, I have it turned on for staging environments, but it’s usually only used in production. In production, we normally have 3 stages in the release cycle — the first canary deployment, followed by 25%, then the remaining 75%.

The rest is relatively straightforward — just CodePipeline telling ECS to deploy images. If errors are detected along the way, a “rollback” command is used to manually roll back changes.

When the deploy is finished, a Lambda function is used to send a message to the deploy channel.

Categories
English

2017: A year in review

A few years ago when I was doing client work, we would regularly host clients’ sites and apps for them. During this time, I was responsible for both development and keeping them up and running as much as possible. Most of the money being in new development, it was difficult to assign priority to improving the operations of existing applications. In this period, I wanted an “operations person” to teach me how to make new applications that would need minimal operations support from the beginning. Failing this, I decided to become “the operations person” myself.

Following that decision, I found myself working at BizReach on the infrastructure team for HRMOS, a Software-as-a-Service product focused on applicant tracking for medium to large enterprises, in the end of 2016. Following that job, I then went to a small startup, dely, as a Site Reliability Engineer for their flagship product Kurashiru, a recipe video app for iOS and Android.

This is the first full year I’ve been working full-time as a dedicated infrastructure / operations / SRE / DevOps engineer, and I feel like I’ve grown a lot. On the technical side, I was able to lead the migration of complex legacy monolith systems to scalable and resilient independent systems. On the not-so-technical side, I’ve experienced different types of company cultures, managerial styles, and I’ve gotten accustomed to working with teams of engineers — the experience I’ve had up until this year was mostly working in extremely small teams.

While I do have a passion for making, maintaining, and improving services, I am also very interested in company culture — what makes it and what breaks it — especially when it comes to remote work. I believe most technical engineering work can be done as efficiently (if not more) remote, but there are definite challenges that need to be addressed before I can start leading a change in any position I’m in.

Here’s to 2018! 🎉