My Brief Thoughts on the AWS Kinesis Outage
There have been multiple analyses about the recent (2020/11/25) outage of AWS Kinesis and its cascading failure mode, taking a chunk of AWS services with it – including seemingly unrelated Cognito – due to dependencies hidden to the user. If you haven’t read the official postmortem statement by AWS yet, go read it now.
There are an infinite amount of arguments that can made about cascading failure; I’m not here to talk about that today. I’m here to talk about a time a few years ago I was evaluating a few systems to do event logging. Naturally, Kinesis was in consideration, and our team interviewed an AWS Solution Architect about potential design patterns we could implement, what problems they would solve, what hiccups we may encounter on the way, et cetera.
At the time, I didn’t think much of it, but in hindsight it should have been a red flag.
ME: “So, what happens when Kinesis goes down? What kind of recovery processes do you think we need to have in place?”
SA: “Don’t worry about that, Kinesis doesn’t go down.”
The reason I didn’t think much of it at that time was that our workload would have been relatively trivial for Kinesis, and I mentally translated the reply to “don’t worry about Kinesis going down for this particular use case”.
We decided to not go with Kinesis for other reasons (I believe we went with Fluentd).
Maybe my mental translation was correct? Maybe this SA had this image of Kinesis as a system that was impervious to fault? Maybe it was representative of larger problem inside AWS of people who overestimated the reliability of Kinesis? I don’t know. This is just a single data point – it’s not even that strong. A vague memory from “a few years ago”? I’d immediately dismiss it if I heard it.
The point of this post is not to disparage this SA, nor to disparage the Kinesis system as a whole (it is extremely reliable!), but to serve as a reminder (mostly to myself) that one should be immediately suspicious when someone says “never” or “always”.