Using Docker as an Admin tool on Windows

Have you ever tried to download openssl on Windows? You need to convert a cert, or just do some crypto work, so you google “openssl windows” and find the source forge entry. After a few minutes of scrolling around confused you finally accept that the page doesn’t have a release more recent than several years ago.

So you go back to google and click on the link for openssl.org, and realize that they don’t distribute any binaries at all (windows or otherwise).
You scroll a few entries further down, still looking for an executable or guide to get openssl on windows, and you click on a promising article heading. Perusing it tells you that it’s actually just a guide for Cygwin (and it would work, but then you have Cygwin sitting on your machine, and you’ll probably never use it again). You think to yourself, “There has to be an executable somewhere.”
Next you jump to page 2 of the google results (personally it’s the first time I’ve jumped to that page two in years) and scrolling you find more of the same. Linux fanatics using Cygwin, source code you could compile yourself, and obscure religious wars like schannel vs every other cryptography provider.
All you really want is to go from a .pfx to a .pem, and you’re running in circles looking for the most popular tool in the world to do it.
Enter Docker.
At work a number of our services are deployed on Docker, so I already have Docker desktop installed, and it’s often in Linux container mode on my workstation, so it was only a couple commands to get into openssl in an alpine container
Here are my commands for reference:
PS C:\Users\bolson\Documents> docker run -v "$((pwd).path)/keys:/keys" -it alpine
/ # cd keys/
/keys # ls -l | grep corp.pem
-rwxr-xr-x 1 root root 1692 Jul 6 17:09 corp.pem
/keys # apk add openssl
fetch http://dl-cdn.alpinelinux.org/alpine/v3.10/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.10/community/x86_64/APKINDEX.tar.gz
(1/1) Installing openssl (1.1.1c-r0)
Executing busybox-1.30.1-r2.trigger
OK: 6 MiB in 15 packages
/keys # openssl
OpenSSL>

I realize there are plenty of other options for getting a Linux interpreter up and running on Windows. You could grab virtual box and a ubuntu ISO, you could open a Cloud9 environment on AWS and get to an Amazon Linux instance there, you could use the Windows Subsystem for Linux, you could dual boot your windows laptop with some linux distro, and the list could go on.

Those approaches are fine and would all work but they either take time, cost money, or are focused on one specific scenario, and wouldn’t have much utility outside of getting your into openssl to convert your cert. If I realize that one of the certs I need to convert is a jks store instead of a .pfx, I can flip over to a docker image with the java keytool installed pretty easily

Cleanup is easy with a few powershell commands

$containers = docker ps -a
foreach ($container in $containers) {$id = ($container -split "[ ]+")[0];docker rm $id}

$images = docker images;
foreach ($image in $images) {$id = ($image -split "[ ]+")[2];docker rmi $id}

And that’s why, as a windows user, I love Docker. You get simple, easy access to Linux environments for utilities and it’s straightforward to map directories on your windows machines into the Linux containers.

Nowadays you can use the Windows Subsystem for Linux for easy command line SSH access from Windows, but before that went GA on Windows 10 I used Docker for an easy SSH client (I know that plink exists, so this time you can accuse me of forcing Docker to be a solution).

You can create a simple Dockerfile that adds open ssh to an alpine container like so

FROM alpine
RUN apk add openssh-client

And then run it with

docker build -t ssh .
docker run -v "$($env:userprofile)/documents/keys:/keys" -it ssh sh

And you’re up and runing with an SSH client. Simple!

Again, there are other ways of accomplishing all of these tasks. But if you’re organization is investing in Docker using it for a few simple management tasks can give you some familiarity with the mechanics, and make it easier for you to support on all kinds of development platforms.

Using Unit Tests for Communication

Expressing software requirements is hard. Software is abstract and intangible, you can’t touch it see it or feel it, so talking about it can become very difficult.

There are plenty of tools and techniques that people use to make communicating about software easier like UML diagrams or software state diagrams. Today I’d like to talk about a software development technique my team has been using for communication: Test Driven Development.
Test Driven Development (TDD) is a very involved software development approach, and I won’t go into it in depth in this post, but here’s a quick summary
  1. Write tests before code, expecting them to fail
  2. Build out code that adds the behavior the test looks for
  3. Rerun the test
On my team we’ve started trying to use unit tests as a communication tool. Here’s an example.
At my company we use the concept of a unique customer ID that consists of the first two digits of the state the customer’s headquarters is in, and a sequential 3 digit number. At least, most of the time. The Site ID concept grew organically, and like any standard that starts organically it has exceptions. Here are a few that I’m aware
  1. One very large customer uses an abbreviation of their company name, instead of state code
  2. Most systems pad with zeros, some do not (e.g. TX001 in some systems is TX1 in others)
  3. Most systems use a 3 digit number, some use four (e.g. TX001 vs TX0001)
After we added “site id converters” to a few modules in our config management code we decided it was time to centralize that functionality and build out a single “site id converter” function. When I was writing the card it was clear there was enough variation that it was going to take a fair amount of verbiage to spell out what I wanted. Let’s give it a try for fun.

Please build a function that takes a site id (the siteid can be two digit state code and 1, 3, or 4 digits, or 3 digit customer code and 1, 3, or 4 digits). The function should also take a “target domain” which is where the site id will be used (for example “citrix” or “bi”). The function should convert the site id into the right format for the target domain. For example if TX0001 and Citrix is passed in the function should convert to “TX001”. If TX001 and “BI” are passed in the function should convert to “TX0001”.

It’s not terrible, but it gets more complex when you start unpacking it and notice the details I may have forgotten to add. What if the function gets passed an invalid domain? What should the signature look like? What module should the function go into?

And it gets clarified when we add a simple unit tests with lots of details that feel a little awkward to put in the card.

Describe "Get SiteID by domain tests" {
InModuleScope CommonFunctions {
It "converts long id to short for citrix" {
Get-SiteIdByDomain -siteid "TX0001" -domain "citrix" | Should be "TX001"
}
}

I’m not suggesting you stop writing cards or user stories, rather the right answer usually seems to be a little of both. Some verbiage to tell that background, give some motivation for the card, and open a dialog with the engineer doing the work. Some unit tests to give more explicit requirements on the interface to the function.

This approach also leaves the engineer free to pick their own implementation details. They could use a few nested “if” statements, they could use powershell script blocks, or anything else they can think of (that will pass a peer review). As long as it meets the requirements of the unit test the specifics are wide open.

A note about TDD

Please note, I’m not a TDD fanatic. TDD is a really useful tool that drives you to write testable code that keeps units small, makes your functions reasonable, and lets you refactor with confidence. As long as your code still passes your unit test sweet you can make any changes you want.
But it’s not a magic hammer. In my experience it’s not a good fit to start with TDD when
  • You’re using an unfamiliar SDK or framework, TDD will slow down your exploration
  • You have a small script that is truly one time use, TDD isn’t worth the overhead
  • You have a team that is unfamiliar with TDD, introducing it can be good, using it as a hard and fast rule will demoralize and delay

Docker Windows container for Pester Tests

I recently wrote an intro to unit testing your powershell modules with Pester, and I wanted to give a walk through for our method of running these unit tests inside of a Docker for Windows container.

Before we get started, I’d like to acknowledge this post is obviously filled with trendy buzzwords (CICD, Docker, Config Management, *game of thrones, docker-compose, you get the picture). All of the components we’re going to talk through today add concrete value to our business, and we didn’t do any resume driven development.

Why?

Here’s a quick run through of our motivation for each of the pieces I’ll cover in this post.
  1. Docker image for running unit tests 
    1. gives engineers a consistent way to run the unit tests. One your workstation you might need different versions of SDKs and tools, but a docker container lets you pin versions of things like the AWS Powershell tools
    2. Makes all pathing consistent – you can setup your laptop anyway you lock, but the paths inside of the container are consistent
  2. Docker-compose
    1. Provides a way to customize unit test runs to a project
    2. Provides a consistent way for engineers to map drives into the container
  3. Code coverage metrics
    1. At my company we don’t put too much stock in code coverage metrics, but they offer some context for how thorough an engineer has been with unit tests
    2. We keep a loose goal of 60%
  4. Unit test passing count
    1. A failed unit test does not go to production. A failed unit test has a high chance of causing production outage

How!

The first step is to setup Docker Desktop for Windows. The biggest struggle I’ve seen people having getting docker running on Windows is getting virtualization enabled, so pay extra attention to that step.
Once you have Docker installed you’ll need to create an image you can use to run your unit tests, a script to execute them, and a docker-compose file. The whole structure will look like

  • /
    • docker-compose.yml
    • /pestertester
      • Dockerfile
      • Run-AllUnitTests.ps1

We call our image “pestertester” (I’m more proud of that name than I should be).

There are two files inside of the pestertester folder: a Dockerfile that defines the image, and a script called Run-AllUnitTests.ps1.
Here’s a simple example of the dockerfile. For more detail on how to write a dockerfile you should explore the dockerfile reference

FROM mcr.microsoft.com/windows/servercore
RUN "powershell Install-PackageProvider -Name NuGet -MinimumVersion 2.8.5.201 -Force"
RUN "powershell Install-Module -Scope CurrentUser -Name AWSPowerShell -Force;"
COPY ./Run-AllUnitTests.ps1 c:/scripts/Run-AllUnitTests.ps1

All we need for these unit tests is the AWS Powershell Tools, and we install NuGet so we can use powershell’s Install-Module.

We played around with several different docker images before we picked mcr.microsoft.com/windows/servercore.

  1. We moved away from any of the .NET containers because we didn’t need the dependencies they added, and they were very large
  2. We moved away from nano server images because some of our powershell modules call functions outside of .NET core
Next we have the script Run-AllUnitTests.ps1. The main requirement for this script to work is that your tests be stored with this file structure
  • /ConfigModule.psm1
    • /tests
      • /ConfigModule.tests.ps1
  • ConfigModule2.psm1
    • /tests
      • /ConfigModule2.tests.ps1
The script isn’t too complicated
$results = @();
gci -recurse -include tests -directory | ? {$_.FullName -notlike "*dsc*"} | % {
set-location $_.FullName;
$tests = gci;
foreach ($test in $tests) {
$module = $test.Name.Replace("tests.ps1","psm1")
$result = invoke-pester ".\$test" -CodeCoverage "..\$module" -passthru -quiet;
$results += @{Module = $module;
Total = $result.TotalCount;
passed = $result.PassedCount;
failed = $result.FailedCount
codecoverage = [math]::round(($result.CodeCoverage.NumberOfCommandsExecuted / $result.CodeCoverage.NumberOfCommandsAnalyzed) * 100,2)
}
}
}

foreach ($result in $results) {
write-host -foregroundcolor Magenta "module: $($result['Module'])";
write-host "Total tests: $($result['total'])";
write-host -ForegroundColor Green "Passed tests: $($result['passed'])";
if($result['failed'] -gt 0) {
$color = "Red";
} else {
$color = "Green";
}
write-host -foregroundcolor $color "Failed tests: $($result['failed'])";
if($result['codecoverage'] -gt 60) {
$color = "Green";
} elseif($result['codecoverage'] -gt 30) {
$color = "Yellow";
} else {
$color = "Red";
}
write-host -ForegroundColor $color "CodeCoverage: $($result['codecoverage'])";
}

The script iterates through any subdirectories named “tests”, and executes the unit tests it finds there, running code coverage metrics for each module.

The last piece to tie all of this together is a docker-compose file. The docker compose file handles

  1. Mapping the windows drives into the container
  2. Executing the script that runs the unit tests
The docker-compose file is pretty straightforward too
version: '3.7'

services:
pestertester:
build: ./pestertester
volumes:
- c:\users\bolson\documents\github\dt-infra-citrix-management\ssm:c:\ssm
stdin_open: true
tty: true
command: powershell "cd ssm;C:\scripts\Run-AllUnitTests.ps1"

Once you’ve got all of this setup, you can run your unit tests with

docker-compose run pestertester

One the container starts up you’ll see your test results

Experience

We’ve been running linux containers in production for a couple of years now, but we’re just starting to pilot windows containers. According to the documentation they’re not production ready yet

Docker is a full development platform for creating containerized apps, and Docker Desktop for Windows is the best way to get started with Docker on Windows.

Running our unit tests inside of windows containers has been a good way to get some experience with them without risking production impact.

A couple final thoughts

Windows containers are large, even server core and nano server are gigabytes.

The container we landed on is 11GB

If you need to run windows containers, and you can’t stick to .NET core and get onto nano server, you’re going to be stuck with pretty large images.

Start up times for windows containers will be a few minutes

Especially the first time on a machine while resources are getting loaded.

Versatile Pattern

This pattern of unit testing inside of a container is pretty versatile. You can use it with any unit testing framework, and any operating system you can run inside a container.

*no actual game of thrones references will be in this blog post

AWS S3 Lifecycle Policies – Prep for Deep Archive

AWS recently released a new S3 storage class called Deep Archive. It’s an archival data service with pretty low cost for data you need to hold onto, but don’t access very often.

Deep Archive is about half the cost of Glacier at $0.00099 per GB per month, but you sacrifice the option to get your data back in minutes, your only retrieval option is hours.

I work for a health care company so we hold onto patient data for years. There are plenty of reasons we might need to retrieve data from years ago, but few of them would have a time limit of less than several weeks. That makes Deep Archive a great fit for our long term data retention.

Setting it up is as simple as changing an existing life cycle transition to Deep Archive, or creating a new S3 Lifecycle transition to glacier

We put together a quick script to find the lifecycle transition rules in our S3 buckets that move data to Glacier already

$buckets = get-s3bucket;

# Iterate through buckets in the current account
foreach ($bucket in $buckets) {
write-host -foregroundcolor Green "Bucket: $($bucket.BucketName)";

# Get the lifecycle configuration for each bucket
$lifecycle = Get-S3LifecycleConfiguration -BucketName $bucket.BucketName;

# Print a warning if ther eare no lifecycles for this bucket
if(!$lifecycle) {
write-host -foregroundcolor Yellow "$($bucket.BucketName) has no life cycle policies";
} else {
# Iterate the transition rules in this lifecycle
foreach ($rule in $lifecycle.Rules) {
write-host -foregroundcolor Magenta "$($rule.Id) with prefix: $($rule.Filter.Lifecyclefilterpredicate.Prefix)";
# Print a warning if there are no transitions
if(!($rule.Transitions)) {
write-host -foregroundcolor Yellow "No lifecycle transitions";
}

# Iterate the transitions and print the rules
foreach ($transition in $rule.Transitions) {
if($transition.StorageClass -eq "GLACIER") {
$color = "Yellow";
} else {
$color = "White";
}
write-host -foregroundcolor $color "After $($transition.Days) transition to $($transition.StorageClass)";
}
}
}
}

To run this script you’ll need the AWS powershell tools, an IAM account setup, and a default region initialized.

When you run the script it will print out your current S3 buckets, the lifecycle rules, and the transitions in each of them, highlighting the transitions to Glacier in yellow.

Unit Testing PowerShell Modules with Pester

Pester is a unit testing framework for Powershell. There are some good tutorials for it on their github page, and a few other places, but I’d like to pull together some of the key motivating use cases I’ve found and a couple of the gotchas.

Let’s start with a very simple example.
This is the contents of a simple utility module named Util.psm1
function Get-Sum([int]$number1, [int]$number2) {
$result = $number1 + $number2;
write-host "Result is: $($result)";
return $result;
}

And this is the content of a simple unit test file named UtilTest.ps1

Import-Module .\Util.psm1
Describe "Util Function Tests" {
It "Get-Sum Adds two numbers" {
Get-Sum 2 2 | Should be 4;
}
}

We can run these tests using “Invoke-Pester .\UtilTest.ps1”.

And already there’s a gotcha here that wasn’t obvious to me from the examples online. Let’s say I change my function to say “Sum is:” instead of “Result is” and save the file. When I re-run my pester tests I still see “Result is:” printed out.

What’s also interesting is that the second run rook 122 ms, while the first took 407 ms.

It turns out both of these changes are results of the same fact – once the module you are testing is loaded into memory it will stay there until you Remove it. That means any changes you make trying to fix your unit tests won’t take effect until you’ve refreshed the module. The fix is simple

Import-Module .\Util.psm1
Describe "Util Function Tests" {
It "Get-Sum Adds two numbers" {
Get-Sum 2 2 | Should be 4;
}
}
Remove-Module Util;

Removing the module after running your tests makes powershell pull a fresh copy into memory so you can see the changes.

The next gotcha is using the Mock keyword. Let’s say I want to hide the write-host output in my function so it doesn’t clutter up my unit tests. The obvious way is to use the “Mock” keyword to create a new version of write-host that doesn’t actually write anything. My first attempt looked like this

Import-Module .\Util.psm1
Describe "Util Function Tests" {
It "Get-Sum Adds two numbers" {
Mock write-host;
Get-Sum 2 2 | Should be 4;
}
}
Remove-Module Util;

But I still see the write-host output in my unit test results.

It turns out the reason is that the Mock keyword creates mock objects in the current scope, instead of in scope for the module being tested. There are two ways of fixing this. One is the InModuleScope, or the ModuleName parameter on the Mock object. Here’s an example of the first option

Import-Module .\Util.psm1

InModuleScope Util {
Describe "Util Function Tests" {
It "Get-Sum Adds two numbers" {
Mock write-host;
Get-Sum 2 2 | Should be 4;
}
}
}
Remove-Module Util;

And just like that the output goes away!

Graph Your RI Commitment Over Time (subtitle: HOW LONG AM I PAYING FOR THIS?!?!?)

In my last post I talked about distributing your committed RI spend over time. The goal being to avoid buying too many 1 year RIs (front loading your spend), and missing out on the savings of committing to 3 years, but not buying too many 3 year RIs (back loading your spend) and risking having a bill you have to foot if your organization has major changes.

Our solution for balancing this is a powershell snippet that graphs our RI commitment over time.

# Get RI entries from AWS console
$ri_entries = Get-EC2ReservedInstance -filter @(@{Name="state";Value="active"});

# Array to hold the relevant RI data
$ri_data = @();

# Calculate monthly cost for RIs
foreach ($ri_entry in $ri_entries) {
$ri = @{};
$hourly = $ri_entry.RecurringCharges.Amount;
$monthly = $hourly * 24 * 30 * $ri_entry.InstanceCount;
$ri.monthly = $monthly;
$ri.End = $ri_entry.End;
$ri_data += $ri;
}

# Three years into the future (maximum duration of RIs as of 1.22.2019)
$three_years_out = (get-date).addyears(3);

# Our current date iterator
$current = (get-date);

# Array to hold the commit by month
$monthly_commit = @();

# CSV file name to save output
$csv_name = "ri_commitment-$((get-date).tostring('ddMMyyyy')).csv";

# Remove the CSV if it already exists
if(test-path $csv_name) {
remove-item -force $csv_name;
}

# Insert CSV headers
"date,commitment" | out-file $csv_name -append -encoding ascii;

# Iterate from today to three years in the future
while($current -lt $three_years_out) {

# Find the sum of the RIs that are active on this date
# all RI data -> RIs that have expirations after current -> select the monthly measure -> get the sum -> select the sum
$commit = ($ri_data | ? {$_.End -gt $current} | % {$_.monthly} | measure -sum).sum;

# Build a row of the CSV
$output = "$($current),$($commit)";

# Print the output to standard out for quick review
write-host $output;

# Write out to the CSV for deeper analysis
$output | out-file $csv_name -append -encoding ascii;

# Increment to the next month and repeat
$current = $current.addmonths(1);
}

Ok, short’s not the right word. It’s a little lengthy, but at the end it kicks out a CSV in your working directory with months and your RI commit for them.

From there it’s easy to create a graph that shows your RI spend commit over time.

That gives you an idea of how much spend you’ve committed to, and for how long.

Our RI Purchase Guidelines

I’ve talked about it a couple of times, but AWS’s recommendation engine is free and borderline magic.

It’s a part of AWS Cost Explorer and ingests your AWS usage data, and spits back reserved instance recommendations

At first glance it feels a little suspect that a vendor has a built in engine to help you get insight into how to save money, but Amazon is playing the long game. If you’re use of AWS is more efficient (and you’re committing to spend with them) you’re more likely to stay a customer, and spend more in the long haul.
The Recommendation engine has a few parameters you can tweak. They default to the settings that will save you the most money (and have you commit to the longest term spend with Amazon), but that may not be the right fit for you.
For example our total workload fluctuates depending on new features that get released, performance improvements for our databases, etc., so we typically buy convertible instances so we have the option of changing instance types, size and OS if we need to.
As you click around in these options you’ll notice the total percent savings flies around like a kite. Depending on the options you select your savings can go up and down quite a bit.
Paying upfront and standard vs. convertible can give you a 2-3% difference (based on what I’ve seen), but buying a 3 year RI instead of a 1 year doubles you’re savings. That can be a big difference if you’re willing to commit to the spend.
Now, three years is almost forever in Amazon. Keep in mind Amazon releases a new instance type or family about every year, so a 3 standard RI feels a little risky to me. Here are the guidelines we’re trying out
  • Buy mostly convertible (the exception is services that will not change)
  • Stay below ~70% RI coverage (we have a couple efficiency projects in the works that will reduce our EC2 running hours)
  • Distribute your spend commit
My next post will cover how we distribute our spend commit.

Getting Started with AWS Reserved Instances

If you’ve been using AWS for a while you’ve probably built up some excess spend. The “pay as you go” nature of AWS is a double edged sword: It’s easy to get a PoC up and running, but you can wind up with waste if you aren’t disciplined with your cleanup.
That’s the situation we found ourselves in recently. My company has been running production workloads in AWS for 3+ years, and we’ve had 100% of customer facing workload in Amazon for over a year.
Over those 3 plus years we’ve redesigned out app delivery environment, released several new products, rebuild our BI environment, and reworked our CICD process. We subscribe to the “fail as fast as you can” methodology, so we’ve also started several PoCs that never went live.
All of that is to say we’ve done a lot in Amazon and we’ve tried a lot of new services. Despite our best efforts, we’ve had some wasted spend build up in our AWS bill. The whole engineering team was aware of it, but how do you start cleaning up waste, especially if your bill is large?

Sell the Marathon to Execs

Pitching a big, expensive cost saving project to execs is hard. Pitching a slow and steady approach is a lot easier. Rather than try to block a week for “cost savings” exercises we asked management for 1 hour working meeting a week. No follow ups for outside of the meeting, no third party reporting tools, and only low/no risk changes.
The risk with a dramatic cost savings project is that executives will think of it as a purchase rather than a continual effort. If they spend a lot of money on cost savings, they’ll expect costs to automatically stay lower forever. If they get used to the idea of a small effort for a long time, it will be easier to keep up with it.

Start Small, Cautious, and Skeptical

Most of the struggle is finding the waste. Tools liked Trusted Advisor are useful, but they have, I hate to say, somewhat pithy recommendations. It’s never quite as straightforward to turn off services as we might like. 
For example when Trusted Advisor finds an under-utilized instance you have a slew of questions to answer before you can turn it off. “Is it low use, but important?” “Is it used at specific times like month end?” “Is it using a resource Trusted advisor doesn’t check?”
Instead of taking these straight recommendations pull a small coalition of knowledgeable resources into your 1 hour cost saving meeting. We started with
  • A DBA – someone who knows about big, expensive systems who will be very cautious about damaging them
  • An IT engineer – the team with permissions to create new servers who support customer environments (also very cautious)
  • A DevOps engineer – someone with the ability to write scripts to cross data sets like EBS usage and CPU usage
With those three roles we had the people in the room who would get called when something breaks, meaning they would be very careful not to impact production.

Avoid Analysis as Long as Possible

Avoid getting bogged down with analysis until there are no more easy cost savings. With our cost savings team of a DBA, an IT Engineer, and a DevOps engineer we started with cost savings options that we all agreed on within 5 minutes. If we debated a plan for more than 5 minutes we tabled it for when we’d run out of easy options.
That approach let us show value for the 1 hour/week cost savings meetings quickly, and convince execs we should keep having them.
When you start to run out of easy ideas start doing more analysis to think through how much money you’ll save with a change, and how much time it will take to do it. That will let you prioritize the harder stuff.

Document, Document, Document

Documenting your cost saving efforts well early on will lend itself to building our a recurring/automated process later on. If you save the scripts you use to find unused instances, you can re-run them later and eventually build them into Lambda functions or jobs that run.
It will also make it easy to demonstrate value to execs. If you have good documentation on estimated cost savings, actual cost savings it will be easier to show to your executives.
That’s our high level approach, see my blog post on getting started with AWS Cost Explorer to start diving into your bill details!

Diving Into (and reducing) Your AWS Costs!

AWS uses a “pay as you go” model for most of it’s services. You can start using them at any time, you often get a runway of free usage to get up to speed on the service, then they charge you for what you use. No contract negotiations, no figuring out bulk discounts, and you don’t have to provision for max capacity.

This model is a double edge sword. It’s great when you’re

  • First getting started
  • Working with a predictable workload
  • Working with a modern technology stack (i.e. most of your resources are stateless and can be ephemeral
But it has some challenges when
  • Your workload is unpredictable
  • Your stack is not stateless (i.e. you have to provision for max capacity)
  • Your environment is complex with a lot of services being used by different teams

It’s easy to have your AWS costs run away from you and you can suddenly find yourself paying much more than you need or want to. We recently found ourselves in that scenario. Obviously I can’t show you our actual account costs, but I’ll walk you through the process we used to start digging into (and reducing our costs) with one of my personal accounts.

Step 1: AWS Cost Explorer

Cost Explorer is your first stop for understanding your AWS bill. You’ll navigate to your AWS Billing Dashboard, and then launch cost explorer. If you haven’t been in cost explorer it doesn’t hurt to look at some of the alerts on the home page, but the real interesting data is in Costs and Usage
My preference is to switch to “stack view”
I find this helps to view your costs in context. If you’re looking to cut costs the obvious place to start is the server that takes up the largest section of the bar. For this account it’s ElastiCache
ElastiCache is pretty straight forward to cut costs for – you either cut your nodes or node size – so let’s pick a more interesting service like S3.
Once you’ve picked a service to try to cut costs for add a service filter on the right hand side, and group by service type
Right away we can see that most of our costs are TimedStorage-ByteHrs which translates to S3 Standard Storage, so we’ll focus our cost savings on that storage class.

Next we’ll go to Cloudwatch to see where our storage in this class is. Open up Cloudwatch, open up metrics, and select S3.


Inside of S3 click on Storage Metrics and search for “StandardStorage” and select all buckets.


Then change your time window to something pretty long (say, 6 months) and your view type to Number

This will give you a view of specific buckets and how much they’re storage. You can quickly skim through to find the buckets storing the most data.

Once you have your largest storage points you can clean them up or apply s3 lifecycle policies to transition them to cheaper storage classes.

After you’re done with your largest cost areas, you rinse and repeat on other services.
This is a good exercise to do regularly. Even if you have good discipline around cleaning up old AWS resources costs can still crop up.
Happy cost savings!