Serverless Meetup Tokyo #13 に参加してみました

Serverless Meetup Tokyo 第13回 に参加してみました。

会場は Speee Lounge

ServerlessDays Tokyo 2019

というイベントの啓発(参加者、登壇者)ありました。私も参加しようと思っています。登壇は、、検討します(笑)

Azure Serverless 2019 Summer Edition

三宅 和之(株式会社ゼンアーキテクツ

自分は普段AWSの世界に浸かられてるのでAzureは新鮮でした。Azure Functions ではが C# と Node.js が主流だが最近 Java が最近サポートし始めてる。Runtime は全てOSSらしい。

Azure Functions v2 はオススメ!v1と違って、v2はgRPCを利用して基盤となる.NET Coreとワーカーを分離することによって軽量化できたらしい。なるほど。

Premium Planを使えば “Pre-warmed instances” という機能は使える – これはお金払ってもAWS Lambdaも提供してない。(起動時間はいつも改善しようとしてるらしいけど・・・

TypeScript正式利用!これはすごいね。デフォルトで?どこでコンパイルされるんだろう?

Durable Functions: ステートフルファンクション。 AWSならStep Functions?概要読んだところ、コード(C#, F#, JavaScript)で定義できるのは面白いですね。

KEDA “Kubernetes-based Event Driven Autoscaling”.

Azure Cosmos DB – DynamoDBみたいなやつ。もっと機能があるように見える。Change Feed DynamoDB Streams。Change Feed から Azure Function を起動することができる。SignalR というものと組み合わせれば、WebSocketsにパブリッシュできそう。SignalRはDynamoDB Streams + Lambda + API Gateway WebSocket よりかなり簡易的に実装できそう!

SQL DB Serverless。AWS Aurora Serverlessと同じような挙動してるっぽい。いつか完全にリクエスト課金のSQLデータベースできるといいね。。

Azure 世界は OSS プロダクトが多いのでコントリビュートできる。

でも正直なところ、あまりAWSから移行する!という感じはしなかった。個人的にはTypeScript興味ありますが、C#など全然興味ありません(C#を勉強する前に、Rustをちょっと深掘りしたい・・)。ただ、冒頭の通りあまりAzureのこと触れないので、今回はとてもいい機会でした。

営業職から見たサーバーレス

「既存の開発メニューにはまらない」

「どうやってサーバレスをクライアントに売る?」

私は個人的には、「サーバレス」を直接うるんじゃなくて、サーバレスで運用が楽になった分、たとえ運用費を同じ額をもらってると特になると思う。(エンドユーザーからみたら、全く同じ方式)ただ、サーバレスで以前できなかったことを実現できれば付加価値として請求できると思う。例えば、サーバレスは急なバーストなどに耐えられる設計しやすいので、「安易に急なバーストを耐えられるアプリを作る」という提案はできます。それ以外は、あくまで最適化の手段の一つだと思う。

LambdaとDynamoDBでつくるIoTバックエンド

岡本 忠浩(株式会社MMM

AWS SAM を使ってる。SAMで管理しきれないのは別レポジトリーのCloudFormationが。

100個以上のLambda関数。多い!6ヶ月をかけてエンジニア2人で作ってるみたい。なんて複雑なプロジェクト。。

DynamoDBのテーブル設計

始まりがだいたい良さそう。RDBのER図を書いて、アクセスパターンを列挙して、DynamoDB を設計する。「Serverlessを極めるためにDynamoDBデータモデリングを極めよう」という資料を参考になります。

まあ、ベストプラクティスを従ってると問題ない、という感じだね。

変わってるところがあれば、 Go で定義を落とし込んでる。

DynamoDBトランザクションが10個に制限されてる。「簡易に超過する」いいえ、これはDynamoDBをRBDMSっぽく使うとそうなるけど、本当にNoSQLファーストな設計じゃないとだめ。(先ほどのServerlessを極めるために・・)

やっぱり、DynamoDBと限らないNoSQLは、「大量なデータに適した技術」なので、まだRBDMSを使った方が適切な場合は多いと思います。特に私がいるスタートアップ界隈は条件などが急に変わったりすると、アクセスパターンがかなり変わる。。アクセスパターンを更新するたびにデータベース設計も変えないといけなかったら、かなり厳しい感じします。DynamoDBを意識しない仕組みを作るのがベストというが、これは絶対にしちゃいけないこと。そうすると慣れてるRBDMSっぽいことをしようとする。

AWS Step Functions を使ってるみたい。私も使いたいと思って、以前自分で Lambda -> SQS -> Lambda -> SQS で簡易的に対応したけど、もっと複雑な課題があったら Step Functions の方が適切っぽい。

FaaS上のコードをもっとシンプルに書くためのトランスパイラ

木村 功作(富士通研究所

https://github.com/fujitsulaboratories/escapin

正直、これって async await とどう違うのか?という疑問。技術自体はかなり凄いと思う!いつかトランスパイラー作ると面白いかもしれない。

まとめ

楽しかった!普段触れない技術や話に色々触れたので、非常に勉強になりました。

Hosting a Single Page Application with an API with CloudFront and S3

I’ve written about how to host a single page application (SPA) on AWS using CloudFront and S3 before, using the CloudFront “rewrite not found errors as a 200 response with index.html” trick.

Recently, working on a few serverless apps, I’ve realized that this trick, while quick, isn’t perfect. The specific case where it broke down was when the API is configured as a behavior on CloudFront (I usually scope the API to /api on the same domain as the frontend, so CORS and OPTIONS requests aren’t necessary). If the API returned a 404 Not Found response, CloudFront would rewrite it to 200 OK index.html, and the front-end application would get confused. Unfortunately, CloudFront doesn’t support customized error responses per behavior, so the only way to fix this was to use Lambda@Edge instead.

Here’s the code for the Lambda function:

'use strict'

const path = require('path')

exports.handler = (evt, context, cb) => {
  const { request } = evt.Records[0].cf

  const uriParts = request.uri.split("/")

  if (
    // Root resource with a file extension.
    (
      uriParts.length === 2 && path.extname(uriParts[1]) !== ""
    ) ||
    // Anything inside the "static" directory.
    uriParts[1] === "static"
  ) {
    // serve the original request to S3
  } else {
    // change the request to index.html
    request.uri = '/index.html'
  }

  cb(null, request)
}

This code assumes all requests to a root request with a file extension, or anything in the /static/ directory is a static file that should be served from S3. All other requests will be rewritten to index.html. These are the defaults for create-react-app, but you’ll probably need to change them to meet your requirements. (Remember, Lambda@Edge functions need to be created in us-east-1)

Attach this Lambda function to the CloudFront behavior responsible for serving from the S3 origin as origin-request, and you should be good to go. Don’t forget to remove the 404-to-200 rewrite.

Serverless WordPress on AWS Lambda

There are a few ways to run WordPress “serverless” on AWS. I’m going to talk about running WordPress on Lambda for this article. If you’re interested in how you can run WordPress serverless-ly on Fargate, I’m working on a post about that too.

Keep in mind that while it is possible to do this, it’s not for everyone. It’s probably not for me. Probably not for you. Use at your own risk!

Before we start, there is a core feature of Lambda that make running WordPress in Lambda quite troublesome: Read-only file system. WordPress expects a writable, persistent, local file system. We’ll be using the S3 Uploads plugin by Human Made to handle media uploads. However, core and plugin updates will not work. There’s no workaround for this, so to install / update files, we’ll need to make a new Lambda deployment.

So: let’s go! First, you’ll want to clone my boilerplate repository. I’ve prepared a WordPress installation and a simple glue script to actually boot WordPress.

$ git clone https://github.com/keichan34/wordpress-on-lambda

My plan of attack is: run WordPress in the Lambda function using a PHP custom runtime, make uploads work with S3 instead of the local filesystem, and wire up the database. In the repository above, I’ve configured static assets to be served from S3 as well.

Now, let’s prepare the database. Lambda has two networking modes: public and VPC mode. In public mode, the Lambda has default access to the public internet, but nothing else. In VPC mode, the Lambda is booted inside the VPC, and doesn’t have public internet access by default. Because WordPress requires public internet access we have to either run it in public mode, or run it in VPC mode and prepare a NAT gateway (about $30 to $50 a month, depending on the region). If Lambda runs in public mode, the database must also be publicly accessible — something that is frowned upon from a security standpoint. You should choose the option that fits your risk and price profile. In my case, I’m going with the NAT gateway route.

Now we’ve got the messy stuff out of the way, we’ll have to assemble the Lambda runtime. AWS has an article on their blog detailing how to make a PHP custom runtime, but Stackery provides a batteries-included PHP layer. It includes everything you need to make a PHP application that assumes it’s running in a traditional server environment run in AWS Lambda.

# Replace "km-wordpress-on-lambda-deployment-201906" with something that makes sense for you. It's globally unique, so copying and pasting this will result in an error.
# Make sure you're in the same region as your database!

$ DEPLOY_BUCKET="km-wordpress-on-lambda-deployment-201906"
$ aws s3 mb "s3://$DEPLOY_BUCKET"
$ cd <the directory you cloned the GitHub repository to>

Now, it’s time to install WordPress! We’ll add the WordPress files to the deployment package. As usual, copy wp-config-example.php to wp-config.php. Enter your database details. If you have a hostname that you’re going to use with CloudFront, enter it now. If not, you’ll have to wait until after the CloudFront distribution is created, then try again.

Now, let’s deploy. This will create a new CloudFront distribution and S3 bucket for public assets, so maybe it’s a good time to make a cup of coffee. If you haven’t installed the SAM CLI, do that before the next block.

$ sam package --template-file template.yaml --output-template-file serverless-output.yaml --s3-bucket "$DEPLOY_BUCKET"
$ sam deploy --template-file serverless-output.yaml --stack-name wordpress-on-lambda --capabilities CAPABILITY_IAM
$ aws s3 sync ./src/php s3://deploy-bucket-XXXXX --exclude "*.php" --exclude "*.ini"

I’ll be using the default CloudFront domain for this demo. If you’re going to be using your own domain, you need to modify the template.yaml file to add the an alias to the CloudFront distribution. Use the following command to show the CloudFront domain name.

$ aws cloudformation describe-stacks --stack-name wordpress-on-lambda | jq '.Stacks[0].Outputs'

OK! Now, you should be able to access the CloudFront URL, and you’ll get redirected to the friendly WordPress installer! If you’ve set up your wp-config.php correctly, the installation should go smoothly.

The site I set up for this post is available here: https://dskhgdbzphjkm.cloudfront.net/

Lessons Learned

This is for almost no-one. I think the only valid use case (in this current form) for running WordPress in AWS Lambda is a site that gets periodic, unpredictable spikes of intense traffic — a use case where Lambda’s scalability and price model pays off. This is also a use case where, presumably, the benefits of the scalability trumps the inconvenience of not being able to use the online updaters and installers (also, I’m assuming the database will be able to keep up with the load).

However, if updating and installing themes or plugins could be managed outside of the Lambda environment (say, with wp-cli), with deployments automated… Then, it may be a little more applicable to a larger audience.

If you’re looking for a cheap solution to host your personal blog (like me!), you might just want to bite the bullet and check out any of the hosted WordPress solutions out there.

If you liked this post, or you’d like to provide some input, please do so in the comments. My favorite AWS service is Lambda, and I like pushing it a bit, so look forward to similar posts in the future. If you find bugs in the boilerplate, or you can make improvements, please open an issue or PR!

Miscellaneous Tidbits

  • Aurora Serverless sounds like it would be the best match for this setup. It probably is. Just keep in mind that Aurora Serverless doesn’t support publicly accessible clusters. To use it, you’ll need to go the Lambda-in-VPC, NAT gateway route.
  • Regarding public / private access and NAT gateways, if you’re like me and believe in the future of IPv6 and think that you can just use an egress-only internet gateway – you’re wrong! Lambda doesn’t seem to support IPv6 at this time.
  • You can actually use a NAT instance if the NAT gateway is overkill. However, I would recommend using the NAT gateway if you can. It comes with automatic scalability and redundancy, so you don’t have to babysit your NAT instance. (If you need more than one NAT instance, use the gateway. Seriously.)
  • At time of writing, my patches to php-lambda-layer haven’t been merged yet, so you can use my patched version (the boilerplate repository has this applied already).
  • If you’re really going all-in, consider using an Application Load Balancer rather than API Gateway to save money. API Gateway has zero fixed costs, but there is a point where ALB will become cheaper than API Gateway.
  • Doing some crude calculations, you should be able to handle an average of a few hundred users per day under the perpetual free tier. Your highest bill may be data transfer to the user.

Managing ECS clusters, 4 years in.

Throughout these past 4 years since AWS ECS became generally available, I’ve had the opportunity to manage 4 major ECS cluster deployments.

Across these deployments, I’ve built up knowledge and tools to help manage them, make them safer, more reliable, and cheaper to run. This article has a bunch of tips and tricks I’ve learned along the way.

Note that most of these tips are rendered useless if you use Fargate! I usually use Fargate these days, but there are still valid reasons for managing your own cluster.

Spot Instances

ECS clusters are great places to use spot instances, especially when managed by a Spot Fleet. As long as you handle the “spot instance is about to be terminated” event, and set the container instance to draining status, it works pretty well. When ECS is told to drain a container instance, it will stop the tasks cleanly on the instance and run them somewhere else. I’ve made the source code for this Lambda function available on GitHub.

Just make sure your app is able to stop itself and boot another instance in 2 minutes (the warning time you have before the spot instance is terminated). I’ve experienced overall savings of around 60% when using a cluster exclusively comprised of spot instances (EBS is not discounted).

Autoscaling Group Lifecycle Hooks

If you need to use on-demand instances for your ECS cluster, or you’re using a mixed spot/on-demand cluster, I recommend using an Autoscaling Group to manage your cluster instances.

To prevent the ASG from stopping instances with tasks currently running, you have to write your own integration. AWS provides some sample code, which I’ve modified and published on GitHub.

The basic gist of this integration is:

  1. When an instance is scheduled for termination, the Autoscaling Group sends a message to an SNS topic.
  2. Lambda is subscribed to this topic, and receives the message.
  3. Lambda tells the ECS API to drain the instance that is scheduled to be terminated.
  4. If the instance has zero running tasks, Lambda tells the Autoscaling Group to continue with termination. The Autoscaling Group terminates the instance at this point.
  5. If the instance has more than zero running tasks, Lambda waits for some time and sends the same message to the topic, returning to step (2).

By default, I set the timeout for this operation to 15 minutes. This value depends on the specific application. If your applications require more than 15 minutes to cleanly shut down and relocate to another container instance, you’ll have to set this value accordingly. (Also, you’ll have to change the default ECS StopTask SIGTERM timeout — look for the “ECS_CONTAINER_STOP_TIMEOUT” environment variable)

Cluster Instance Scaling

Cluster instance scale-out is pretty easy. Set some CloudWatch alarms on the ECS CPUReservation and MemoryReservation metrics, and you can scale out according to those. Scaling in is a little more tricky.

I originally used those same metrics to scale in. Now, I use a Lambda script that runs every 30 minutes, cleaning up unused resources until a certain threshold of available CPU and memory is reached. This technique further reduces service disruption. I’ll post this on GitHub sometime in the near future.

Application Deployment

I’ve gone through a few application deployment strategies.

  1. Hosted CI + Deploy Shell Script
    • Pros: simple.
    • Cons: you need somewhere to run it, easily becomes a mess. Shell scripts are a pain to debug and test.
  2. Hosted CI + Deploy Python Script (I might put this on GitHub sometime)
    • Pros: powerful, easier to test than using a bunch of shell scripts.
    • Cons: be careful about extending the script. It can quickly become spaghetti code.
  3. Jenkins
    • Pros: powerful.
    • Cons: Jenkins.
  4. CodeBuild + CodePipeline
    • Pros: simple; ECS deployment was recently added; can be managed with Terraform.
    • Cons: Subject to limitations of CodePipeline (pretty limited). In our use case, the sticking points are not being able to deploy an arbitrary Git branch (you have to deploy the branch specified in the CodePipeline definition).

Grab-bag

Other tips and tricks

  • Docker stdout logging is not cheap (also, performance is highly variable across log drivers — I recently had a major problem with the fluentd driver blocking all writes). If your application blocks on logging (looking at you, Ruby), performance will suffer.
  • Having a few large instances yields more performance than many small instances (with the added benefit of having the layer cache when performing deploys).
  • The default placing strategy should be: binpack on the resource that is most important to your application (CPU or memory), AZ-balanced
  • Applications that can’t be safely shut down in less than 1 minute do not work well with Spot instances. Use a placement constraint to make sure these tasks don’t get scheduled on a Spot instance (you’ll have to set the attribute yourself, probably using the EC2 user data)
  • Spot Fleet + ECS = ❤️
  • aws update-service help for service administration commands. I use --force-new-deployment and --desired-count quite often.
  • If you manage your own EC2 instances with Auto Scaling Groups: aws autoscaling terminate-instance-in-auto-scaling-group --instance-id "i-XXX" --no-should-decrement-desired-capacity will start a new EC2 instance and perform termination lifecycle hooks on it. This is what I use to switch out old EC2 instances with new launch configurations.

“Truth in bots”

The bots should announce, “I’m not a person, or if I am, I’m not allowed to act like one.”

Or, if there’s no room or time for that sentence, perhaps a simple bot at the top of the conversation. That way, we can save our human emotions for the humans who will appreciate them.

Truth in bots | Seth’s Blog

“If you can’t tell the difference, does it matter?”

Quacking like ducks, et cetera.

The point of the post is a bit different (it’s predicated on there being able to tell the difference — “… only a minute or two into the interaction that you realize you’re being fooled by an AI, not a caring human”), but what happens when you can’t tell the difference? Should AIs always announce themselves as AIs if they are indistinguishable from a human? Why?

AWS Application Auto-scaling for ECS with Terraform

Update: Target tracking scaling is now available for ECS services.

I’ve been working on setting up autoscaling settings for ECS services recently, and here are a couple notes from managing auto-scaling for ECS services using Terraform.

Creating multiple scheduled actions at once


Terraform will perform the following actions:

  + aws_appautoscaling_scheduled_action.green_evening
      id:                                    
      arn:                                   
      name:                                  "ecs"
      resource_id:                           "service/default-production/green"
      scalable_dimension:                    "ecs:service:DesiredCount"
      scalable_target_action.#:              "1"
      scalable_target_action.0.max_capacity: "20"
      scalable_target_action.0.min_capacity: "2"
      schedule:                              "cron(0 15 * * ? *)"
      service_namespace:                     "ecs"

  + aws_appautoscaling_scheduled_action.wapi_green_morning
      id:                                    
      arn:                                   
      name:                                  "ecs"
      resource_id:                           "service/default-production/green"
      scalable_dimension:                    "ecs:service:DesiredCount"
      scalable_target_action.#:              "1"
      scalable_target_action.0.max_capacity: "20"
      scalable_target_action.0.min_capacity: "3"
      schedule:                              "cron(0 23 * * ? *)"
      service_namespace:                     "ecs"

This fails with:


* aws_appautoscaling_scheduled_action.green_evening: ConcurrentUpdateException: You already have a pending update to an Auto Scaling resource.

To fix, the scheduled actions need to be executed serially.


resource "aws_appautoscaling_scheduled_action" "green_morning" {
  name               = "ecs"
  service_namespace  = "${module.green-autoscaling.service_namespace}"
  resource_id        = "${module.green-autoscaling.resource_id}"
  scalable_dimension = "${module.green-autoscaling.scalable_dimension}"
  schedule           = "cron(0 23 * * ? *)"

  scalable_target_action {
    min_capacity = 3
    max_capacity = 20
  }
}

resource "aws_appautoscaling_scheduled_action" "green_evening" {
  name               = "ecs"
  service_namespace  = "${module.green-autoscaling.service_namespace}"
  resource_id        = "${module.green-autoscaling.resource_id}"
  scalable_dimension = "${module.green-autoscaling.scalable_dimension}"
  schedule           = "cron(0 15 * * ? *)"

  scalable_target_action {
    min_capacity = 2
    max_capacity = 20
  }

  # Application AutoScaling actions need to be executed serially
  depends_on = ["aws_appautoscaling_scheduled_action.green_morning"]
}

CodePipeline と Slack による ECS ChatOps 運用

この記事は ECS ChatOps with CodePipeline and Slack の日本語訳です

現在、Rails アプリケーションを ECS に移行する作業を進めています。現在のシステムでは Capistrano でデプロイを行っていますが、そろそろ限界が見えてきました。

EKS が使えるようになるのを待っている間、自分で Kubernetes のクラスターを管理するより ECS を採用することに決めました。当初、Lambda 関数を使って必要なタスク定義を作り、ECS サービスを更新する予定でしたが、プロジェクトの設計を開始する直前に CodePipeline の ECS デプロイの対応が発表されました 🎉。

現在のリソースは、CodePipeline と Slack を連携するためのいくつかの Lambda 関数、サービスごとに2つのCodePipeline パイプライン(本番用と検証用)、関連 ECS リソースです。

まず、デプロイするには、Slack チャンネルに deploy [environment] [service] と発言すると起動します。Slack は(API Gateway 経由で)Lambda にイベントを送信し、Lambda は CodePipeline パイプラインの実行を開始します(CodePipeline API の仕組み上、複数の同時実行は扱いにくい)。この Lambda 関数は、DynamoDB のいくつかの基本的な状態(Slack チャンネル、ユーザー、タイムスタンプ)を記録します。この情報は、返信を送信するチャンネルと、デプロイプロセスの何かがステータスが変わったり失敗した場合に通知するために使われます。

CodePipeline がまず、Docker イメージを作成する CodeBuild を起動し、タスク定義を更新するための新しい Docker イメージのタグが入った単純な JSON ファイルを生成します。

CodeBuild が終了すると、デプロイを続行する前に人の承認を要求するために「手動承認」アクションが使用されます。ここの例では検証環境用に有効にしていますが、通常は本番環境でのみ使用されます。本番環境では、まず1台のカナリアデプロイ、次に25%、残りの75%という三段階で運用しています。

残りの部分は比較的単純です。CodePipeline がイメージをデプロイするように ECS に指示します。途中でエラーが検出された場合、rollback コマンドを使用して手動で変更をロールバックします。

デプロイが完了すると、Lambda 関数がデプロイ用チャンネルにメッセージを送信します。

ECS ChatOps with CodePipeline and Slack

I’m currently working on migrating a Rails application to ECS at work. The current system uses a heavily customized Capistrano setup that’s showing its signs, especially when deploying to more than 10 instances at once.

While patiently waiting for EKS, I decided to use ECS over manage my own Kubernetes cluster on AWS using something like kops. I was initially planning on using Lambda to create the required task definitions and update ECS services, but native CodePipeline deploy support for ECS was announced right before I started planning the project, which greatly simplified the deploy step.

The current setup we have now is: a few Lambda functions to link CodePipeline and Slack together, two CodePipeline pipelines per service (one for production and one for staging), and the associated ECS resources.

First, a deploy is triggered by saying “deploy [environment] [service]” in the deploy channel. Slack sends an event to Lambda (via API Gateway), and Lambda starts an execution of the CodePipeline pipeline if it is not already in progress (because of the way CodePipeline API operations work, it’s hard to work with multiple concurrent runs). This Lambda function also records some basic state in DynamoDB — namely, the Slack channel, user, and timestamp. This information is used to determine what channel to send replies to, and what user to mention if something in the deploy process goes awry.

CodePipeline then starts CodeBuild, which is configured to create Docker image(s) and a simple JSON file that is used to tell CodePipeline’s ECS integration the image tags the task definition should be updated with.

When CodeBuild is finished, a “manual approve” action is used to request human approval before continuing with the deploy. In the example here, I have it turned on for staging environments, but it’s usually only used in production. In production, we normally have 3 stages in the release cycle — the first canary deployment, followed by 25%, then the remaining 75%.

The rest is relatively straightforward — just CodePipeline telling ECS to deploy images. If errors are detected along the way, a “rollback” command is used to manually roll back changes.

When the deploy is finished, a Lambda function is used to send a message to the deploy channel.

2017: A year in review

A few years ago when I was doing client work, we would regularly host clients’ sites and apps for them. During this time, I was responsible for both development and keeping them up and running as much as possible. Most of the money being in new development, it was difficult to assign priority to improving the operations of existing applications. In this period, I wanted an “operations person” to teach me how to make new applications that would need minimal operations support from the beginning. Failing this, I decided to become “the operations person” myself.

Following that decision, I found myself working at BizReach on the infrastructure team for HRMOS, a Software-as-a-Service product focused on applicant tracking for medium to large enterprises, in the end of 2016. Following that job, I then went to a small startup, dely, as a Site Reliability Engineer for their flagship product Kurashiru, a recipe video app for iOS and Android.

This is the first full year I’ve been working full-time as a dedicated infrastructure / operations / SRE / DevOps engineer, and I feel like I’ve grown a lot. On the technical side, I was able to lead the migration of complex legacy monolith systems to scalable and resilient independent systems. On the not-so-technical side, I’ve experienced different types of company cultures, managerial styles, and I’ve gotten accustomed to working with teams of engineers — the experience I’ve had up until this year was mostly working in extremely small teams.

While I do have a passion for making, maintaining, and improving services, I am also very interested in company culture — what makes it and what breaks it — especially when it comes to remote work. I believe most technical engineering work can be done as efficiently (if not more) remote, but there are definite challenges that need to be addressed before I can start leading a change in any position I’m in.

Here’s to 2018! 🎉

Shipping Events from Fluentd to Elasticsearch

We use fluentd to process and route log events from our various applications. It’s simple, safe, and flexible. With at-least-once delivery by default, log events are buffered at every step before they’re sent off to the various storage backends. However, there are some caveats with using Elasticsearch as a backend.

Currently, our setup looks something like this:

The general flow of data is from the application, to the fluentd aggregators, then to the backends — mainly Elasticsearch and S3. If a log event warrants a notification, it’s published to a SNS topic, which in turn triggers a Lambda function that sends the notification to Slack.

The fluentd aggregators are placed by an auto-scaling group, but are not load balanced by a load balancer. Instead, a Lambda function connected to the auto-scaling group lifecycle notifications updates a DNS round-robin entry with the private IP addresses of the fluentd aggregator instances.

We use the fluent-plugin-elasticsearch plugin to output log events to Elasticsearch. However, because this plugin uses the bulk insert API and does not validate whether events have actually been successfully inserted in to the cluster, it is dangerous to rely on it exclusively (thus the S3 backup).