Chris' senior-level experience with AWS best-practices fast-tracked our company's infrastructure development. His work was a crucial milestone that enabled us to scale our engineering teams and systems in step with our rapid growth.
Calm is really taking leaps forward to ensure their customers have the best possible user experience in terms of stability and performance. Moving to AWS EKS further enabled Calm to focus on product, velocity, and user experience without concerns for the operational overhead and complexities that Kubernetes can introduce.
The Challenge Calm Faced When an Unexpected Outage Brought Their System Down
Many companies are evolving their IT solutions to move from virtualization to containerized solutions, allowing them to abstract away differences in OS distributions and underlying infrastructures. Kubernetes is an open-sourced container management system that provides mechanisms for deploying, maintaining, and scaling containerized applications, and is the system Calm had put into place for its own operations, using the standard industry tools that existed at the time.
Calm had hired Christopher Stobie, a senior engineer through Toptal’s AWS DevOps practice, in order to supplement their current resources, as they simply didn’t have enough people with the necessary skills to manage the systems they already had in place. On Chris’s 2nd day on the job, what Calm subsequently referred to as the “Great and Terrible Outage” occurred as a result of Etcd corruption in the self-managed k8s control plane rolling the system back to its legacy infrastructure, with catastrophic consequences. “Calm was running Kubernetes entirely by themselves, which is very hard to do,” notes Chris. “But on my 2nd day at Calm, there was a two-day outage, a Kubernetes failure, and the control plane was corrupted and unrecoverable.”
Despite the dire situation, Chris was able to build a new, fully-automated cluster that would be managed by AWS instead of self-managed. He developed the system to run under EKS, creating a whole networking layer as code in Terraform and enabling Calm to be fully functional again. Because of the ease of use of the AWS solution, the migration to EKS only took about three days.
An Immediate Beneficial Outcome
Though prompted by an unexpected emergency situation, the migration had immediate results. The control plane saw improved stability immediately, and the networking overhead within the cluster was significantly reduced. In addition, the source-controlled cluster configuration allowed for quick iterations, and the IAM authorization setup was extremely easy.
The metrics for success that most companies running an IT environment would use – uptime, resiliency, ability to depend on production environments – saw substantial improvement after the switch to EKS. Previously, the cost of downtime alone was significant, with each outage costing approximately $40K per hour as Calm was unable to subscribe users. In the six months since the EKS deployment, networking has become much more reliable, and the speed at which the server returns responses means that DevOps is no longer waiting for auto-complete to come back for suggested deployments.
Bold and Innovative Thinking Pays Off
While the AWS EKS system isn’t the only one in the managed Kubernetes marketplace, it certainly showcases the depth and breadth of AWS technology and expertise. And in selecting EKS as an early adopter, Calm displayed the forward thinking that is a hallmark of the best companies, as they implement technologies that will assure the most seamless client and customer experiences. In this case, Calm recognized early on that they needed extra assistance, and turned to Toptal knowing that Toptal would have the resources needed for such a monumental undertaking. A key lesson here for other companies relates to understanding that technology itself is not enough: while this success would not have been possible without the agile superiority of AWS cloud and technology offerings, the “human talent cloud” with experience in implementation is imperative as well. This combination enabled Calm to institute a robust system that is in full production, a distinction that is relatively unique and gives them an advantage in the marketplace. Now, six months into the EKS rollout, the experience that Calm has had shows that their innovative path is one that is continuing to pay dividends and will do so for some time into the future.
Faster server responses, which saves significant time previously spent waiting for auto-complete.
Reliable networking, saving money by preventing unexpected outages.
Fully-automated cluster managed by AWS, which allows Calm to focus on other priorities.