Cover image
Technology
11 minute read

Efficiency at Scale: A Tale of AWS Cost Optimization

Understanding total spend is a common challenge for cloud users, especially on projects with complex pricing models. This article explores the top AWS cost optimizations that will help you scale your platform effectively.

I recently launched a cryptocurrency analysis platform, expecting a small number of daily users. However, when some popular YouTubers found the site helpful and published a review, traffic grew so quickly that it overloaded the server, and the platform (Scalper.AI) became inaccessible. My original AWS EC2 environment needed extra support. After considering multiple solutions, I decided to use AWS Elastic Beanstalk to scale my application. Things were looking good and running smoothly, but I was taken aback by the costs in the billing dashboard.

This is not an uncommon issue. A survey from 2021 found that 82% of IT and cloud decision-makers have encountered unnecessary cloud costs, and 86% don’t feel they can get a comprehensive view of all their cloud spending. Though Amazon offers a detailed overview of additional expenses in its documentation, the pricing model is complex for a growing project. To make things easier to understand, I’ll break down a few relevant optimizations to reduce your cloud costs.

Why I Chose AWS

The goal of Scalper.AI is to collect information about cryptocurrency pairs (the assets swapped when trading on an exchange), run statistical analyses, and provide crypto traders with insights about the state of the market. The technical structure of the platform consists of three parts:

  • Data ingestion scripts
  • A web server
  • A database

The ingestion scripts gather data from different sources and load it to the database. I had experience working with AWS services, so I decided to deploy these scripts by setting up EC2 instances. EC2 offers many instance types and lets you choose an instance’s processor, storage, network, and operating system.

I chose Elastic Beanstalk for the remaining functionality because it promised smooth application management. The load balancer properly distributed the burden among my server’s instances, while the autoscaling feature handled adding new instances for an increased load. Deploying updates became very easy, taking just a few minutes.

Scalper.AI worked stably, and my users no longer faced downtime. Of course, I expected an increase in spending since I added extra services, but the numbers were much larger than I had predicted.

How I Could Have Reduced Cloud Costs

Looking back, there were many areas of complexity in my project’s use of AWS services. We’ll examine the budget optimizations I discovered while working with common AWS EC2 features: burstable performance instances, outbound data transfers, elastic IP addresses, and terminate and stop states.

Burstable Performance Instances

My first challenge was supporting CPU power consumption for my growing project. Scalper.AI’s data ingestion scripts provide users with real-time information analysis; the scripts run every few seconds and feed the platform with the most recent updates from crypto exchanges. Each iteration of this process generates hundreds of asynchronous jobs, so the site’s increased traffic necessitated more CPU power to decrease processing time.

The cheapest instance offered by AWS with four vCPUs, a1.xlarge, would have cost me ~$75 per month at the time. Instead, I decided to spread the load between two t3.micro instances with two vCPUs and 1GB of RAM each. The t3.micro instances offered enough speed and memory for the job I needed at one-fifth of the a1.xlarge’s cost. Nevertheless, my bill was still larger than I expected at the end of the month.

In an effort to understand why, I searched Amazon’s documentation and found the answer: When an instance’s CPU usage falls below a defined baseline, it collects credits, but when the instance bursts above baseline usage, it consumes the previously earned credits. If there are no credits available, the instance spends Amazon-provided “surplus credits.” This ability to earn and spend credits causes Amazon EC2 to average an instance’s CPU usage over 24 hours. If the average usage goes above the baseline, the instance is billed extra at a flat rate per vCPU-hour.

I monitored the data ingestion instances for multiple days and found that my CPU setup, which was intended to cut costs, did the opposite. Most of the time, my average CPU usage was higher than the baseline.

A chart has three drop-down selections chosen at the top of the screen. The first two, at the left, are "Feb 10, 2022 - Feb 19, 2022" and "Daily," followed by a small, circled "i." The third is on the right of the screen and says "Line" with a symbol for a line graph. Below the drop-down selections, the chart contains two line graphs with filters. At the top, the filter line reads "Group by: Usage Type" (the selected filter, which has a blue background) and then shows the unselected filters: "Service, Linked Account, Region, Instance Type, Resource, Cost Category, Tag, More," with "Resource" grayed out and the last three having drop-down arrows beside them. The top line graph has "Costs ($)" on the y-axis, ranging from 0.0 to 2.5. The bottom line graph has "Usage (vCPU-Hours)" on the y-axis, ranging from 0 to 40. Both line graphs share dates labeled on the x-axis, ranging from Feb-10 to Feb-19, and a key labeling their purple lines: "USE2-CPUCredits:t3." The top line graph has approximately eight points connected linearly and trends upward over time: one point around (Feb-10, $0.3), a second around (Feb-11, $0.6), a third around (Feb-12, $0.5), a fourth around (Feb-14, $2.1), a fifth around (Feb-15, $2.4), a sixth around (Feb-16, $2.3), a seventh around (Feb-18, $2.3), and an eighth around (Feb-19, $2.6). The bottom line graph also has approximately eight points connected linearly and trends upward over time: one point around (Feb-10, 5 vCPU-Hours), a second around (Feb-11, 15 vCPU-Hours), a third around (Feb-12, 10 vCPU-Hours), a fourth around (Feb-14, 40 vCPU-Hours), a fifth around (Feb-15, 50 vCPU-Hours), a sixth around (Feb-16, 45 vCPU-Hours), a seventh around (Feb-18, 45 vCPU-Hours), and an eighth around (Feb-19, 55 vCPU-Hours).
The above chart displays cost surges (top graph) and increasing CPU credit usage (bottom graph) during a period when CPU usage was above the baseline. The dollar cost is proportional to the surplus credits spent, since the instance is billed per vCPU-hour.

I had initially analyzed CPU usage for a few crypto pairs; the load was small, so I thought I had plenty of space for growth. (I used just one micro-instance for data ingestion since fewer crypto pairs did not require as much CPU power.) ​However, I realized the limitations of my original analysis once I decided to make my insights more comprehensive and support the ingestion of data for hundreds of crypto pairs—cloud service analysis means nothing unless performed at the correct scale.

Outbound Data Transfers

Another result of my site’s expansion was increased data transfers from my app due to a small bug. With traffic growing steadily and no more downtime, I needed to add features to capture and hold users’ attention as soon as possible. My newest update was an audio alert triggered when a crypto pair’s market conditions matched the user’s predefined parameters. Unfortunately, I made a mistake in the code, and audio files loaded into the user’s browser hundreds of times every few seconds.

The impact was huge. My bug generated audio downloads from my web servers, causing additional outbound data transfers. A tiny error in my code resulted in a bill almost five times larger than the previous ones. (This wasn’t the only consequence: The bug could cause a memory leak in the user’s computer, so many users stopped coming back.)

A chart similar to the previous one but with the first drop-down reading "Jan 06, 2022 - Jan 15, 2022," the top line graph's "Costs ($)" ranging from 0 to 30, and the bottom line graph having "Usage (GB)" on the y-axis, ranging from 0 to 300. Both line graphs share dates labeled on the x-axis, ranging from Jan-06 to Jan-15, and a key labeling their purple lines: "USE2-DataTransfer-Out-Bytes." The top line graph has approximately eight points connected linearly and trends upward over time: one point around (Jan-06, $2), a second around (Jan-08, $4), a third around (Jan-09, $7), a fourth around (Jan-10, $6), a fifth around (Jan-12, $15), a sixth around (Jan-13, $25), a seventh around (Jan-14, $24), and an eighth around (Jan-15, $29). The bottom line graph also has approximately eight points connected linearly and trends upward over time: one point around (Jan-06, 10 GB), a second around (Jan-08, 50 GB), a third around (Jan-09, 80 GB), a fourth around (Jan-10, 70 GB), a fifth around (Jan-12, 160 GB), a sixth around (Jan-13, 270 GB), a seventh around (Jan-14, 260 GB), and an eighth around (Jan-15, 320 GB).
The above chart displays cost surges (top graph) and increasing outbound data transfers (bottom graph). Because outbound data transfers are billed per GB, the dollar cost is proportional to the outbound data usage.

Data transfer costs can account for upward of 30% of AWS price surges. EC2 inbound transfer is free, but outbound transfer charges are billed per GB ($0.09 per GB when I built Scalper.AI). As I learned the hard way, it is important to be cautious with code affecting outbound data; reducing downloads or file loading where possible (or carefully monitoring these areas) will protect you from higher fees. These pennies add up quickly since charges for transferring data from EC2 to the internet depend on the workload and AWS Region-specific rates. A final caveat unknown to many new AWS customers: Data transfer becomes more expensive between different locations. However, using private IP addresses can prevent extra data transfer costs between different availability zones of the same region.

Elastic IP Addresses

Even when using public addresses such as Elastic IP addresses (EIPs), it is possible to lower your EC2 costs. EIPs are static IPv4 addresses used for dynamic cloud computing. The “elastic” part means that you can assign an EIP to any EC2 instance and use it until you choose to stop. These addresses let you seamlessly swap unhealthy instances with healthy ones by remapping the address to a different instance in your account. You can also use EIPs to specify a DNS record for a domain so that it points to an EC2 instance.

AWS provides only five EIPs per account per region, making them a limited resource and costly with inefficient use. AWS charges a low hourly rate for each additional EIP and bills extra if you remap an EIP more than 100 times in a month; staying under these limits will lower costs.

Terminate and Stop States

AWS provides two options for managing the state of running EC2 instances: terminate or stop. Terminating will shut down the instance, and the virtual machine provisioned for it will no longer be available. Any attached Elastic Block Store (EBS) volumes will be detached and deleted, and all data stored locally in the instance will be lost. You will no longer be charged for the instance.

Stopping an instance is similar, with one small difference. The attached EBS volumes are not deleted, so their data is preserved, and you can restart the instance at any time. In both cases, Amazon no longer charges for using the instance, but if you opt for stopping instead of terminating, the EBS volumes will generate a cost as long as they exist. AWS recommends stopping an instance only if you expect to reactivate it soon.

But there’s a feature that can enlarge your AWS bill at the end of the month even if you terminated an instance instead of stopping it: EBS snapshots. These are incremental backups of your EBS volumes stored in Amazon’s Simple Storage Service (S3). Each snapshot holds the information you need to create a new EBS volume with your previous data. If you terminate an instance, its associated EBS volumes will be deleted automatically, but its snapshots will remain. As S3 charges by the volume of data stored, I recommend that you delete these snapshots if you won’t use them shortly. AWS features the ability to monitor per-volume storage activity using the CloudWatch service:

  1. While logged into the AWS Console, from the top-left Services menu, find and open the CloudWatch service.
  2. On the left side of the page, under the Metrics collapsible menu, click on All Metrics.
  3. The page shows a list of services with metrics available, including EBS, EC2, S3, and more. Click on EBS and then on Per-volume Metrics. (Note: The EBS option will be visible only if you have EBS volumes configured on your account.)
  4. Click on the Query tab. In the Editor view, copy and paste the command SELECT AVG(VolumeReadBytes) FROM "AWS/EBS" GROUP BY VolumeId and then click Run. (Note: CloudWatch uses a dialect of SQL with a unique syntax.)

A webpage appears with a dark blue header menu on top of the page, which from left to right includes the aws logo, a "Services" drop-down, a search bar, a code icon, a bell icon, a question mark icon, a drop-down text reading "N. Virginia", and a drop-down text reading "Toptal Test Account." Underneath, we see the main part of the webpage in white. The left of the page has a scrolling menu with title "CloudWatch" and an "X." Underneath, from top to bottom, the menu reads: "Favorites and recents," "Dashboards," "Alarms" (in bold, with three drop-downs: "In alarm," "All alarms," and "Billing"), "Logs" (in bold, with two drop-downs: "Log groups" and "Log insights"), "Metrics" (in bold, with three drop-downs: "All metrics," which is highlighted in bright orange, "Explorer," and "Streams"), "X-Ray traces" (in bold), "Events" (in bold), and "Application monitoring" (in bold). All bolded text in the menu has a drop-down menu triangle to the left of the text. The middle of the webpage displays a graph on the top half of the page, and an editor on the bottom half of the page. The graph has two lines of headers. The first line reads "CloudWatch > Metrics" on the left (with "CloudWatch" text in blue) and "Switch to your original interface" in blue on the right. The second line reads "Untitled graph" with an edit icon on the left, and displays options on the right: "1h, 3h, 12h, 1d, 3d, 1w, Custom" (with "3h" in blue, and "Custom" having a calendar icon), "Line" (with a drop-down icon), "Actions" (with a drop-down icon), and a refresh button with a drop-down icon. The graph itself has text in its center reading: "Your CloudWatch graph is empty. Select some metrics to appear here." The editor also has two lines of headers. The first line reads "Browse," "Query" (highlighted in orange), "Graphed metrics," "Options," and "Source" on the left, and "Add math" (with a drop-down icon) and "Add query" (with a drop-down icon) on the right. The second line reads "Metrics Insights - query editor" and "Info" (in blue) on the left, and "Builder" and "Editor" (with "Editor" selected) on the right. Below the editor is an orange "Run" button on the left, and the text "Use Ctrl + Enter to run query, Ctrl + Space to autocomplete" on the right. The right of the webpage has a white menu, which reads "Queries" and "Help" from top to bottom.
An overview of the CloudWatch monitoring setup described above (shown with empty data and no metrics selected). If you have existing EBS, EC2, or S3 instances on your account, these will show up as metric options and will populate your CloudWatch graph.

CloudWatch offers a variety of visualization formats for analyzing storage activity, such as pie charts, lines, bars, stacked area charts, and numbers. Using CloudWatch to identify inactive EBS volumes and snapshots is an easy step toward optimizing cloud costs.

Extra Money in Your Pocket

Though AWS tools such as CloudWatch offer decent solutions for cloud cost monitoring, various external platforms integrate with AWS for more comprehensive analysis. For example, cloud management platforms like VMWare’s CloudHealth show a detailed breakdown of top spending areas that can be used for trend analysis, anomaly detection, and cost and performance monitoring. I also recommend that you set up a CloudWatch billing alarm to detect any surges in charges before they become excessive.

Amazon provides many great cloud services that can help you delegate the maintenance work of servers, databases, and hardware to the AWS team. Though cloud platform costs can easily grow due to bugs or user errors, AWS monitoring tools equip developers with the knowledge to defend themselves from additional expenses.

With these cost optimizations in mind, you’re ready to get your project off the ground—and save hundreds of dollars in the process.

The AWS logo with the word "PARTNER" and the text "Advanced Tier Services" below that.
As an Advanced Consulting Partner in the Amazon Partner Network (APN), Toptal offers companies access to AWS-certified experts, on demand, anywhere in the world.

Understanding the basics

AWS burstable instances (also called burstable performance instances) allow short-term CPU use above a baseline level. In contrast, traditional instances offer fixed CPU resources.

The latest generation of burstable instances includes types T4g, T3a, and T3. The previous generation T2 type also offers burstable performance.

The terminated state occurs once an AWS EC2 instance is shut down and no longer has a virtual machine. In this case, its EBS volumes are deleted, and all of the instance’s local data is permanently lost.

When an instance is stopped, the attached EBS volumes’ data persists, and the instance can be restarted at any time. Though Amazon doesn’t charge for stopped instances, the EBS volumes can generate charges as long as they exist.

For inbound EC2 data transfers, there is no cost. However, outbound data transfers are charged based on the AWS Regions involved.

Best practices for AWS cost optimization include: purchasing reserved EC2 instances and savings plans, deleting unattached EBS volumes, deleting obsolete EBS snapshots, and releasing unattached Elastic IP addresses.

Amazon CloudWatch and AWS Cost Explorer can help optimize AWS costs. CloudHealth by VMWare is an external option that offers AWS support.