Moby/Docker in Production: An Update


The previous article Docker in Production: A History of Failure was quite a hit.

After long discussions, hundreds of feedbacks, thousands of comments, meetings with various individuals and major players, more experimentation and more failures, it’s time for an update on the situation.

We’ll go over the lessons learned from all the recent interactions and articles, but first, a reminder and a bit of context.

Disclaimer: Intended Audience

The large amount of comments made it clear that the world is divided in 10 kind of people:

1) The Amateur

Running mostly test and side projects with no real users. May think that using Ubuntu beta is the norm and call anything “stable” obsolete.

I dont always make workin code but when I do it works on my machine
Can’t blame him. It worked on his machine.

2) The Professional

Running critical systems for a real business with real users, definitely accountable, probably get a phone call when shit hits the fan.

one-does-not-simply-say-well-it-worked-on-my-machine.jpg
Didn’t work on the machine that served his 586 million customers.

What Audience Are You?

There is a fine line between these worlds and they clash pretty hard when they ever meet. Obviously, they have very different standards and expectations.

One of the reason I love finance is because that it has a great culture of risk. It doesn’t mean to be risk-averse contrary to a popular belief. It means to evaluate potential risks and potential gains and weight them against each other.

You should take a minute to think about your standards. What do you expect to achieve with Docker? What do you have to lose if it crashes all systems it’s running on and corrupt the mounted volumes? These are important factor to drive your decisions.

What pushed me to publish the last article was a conversation with a guy from a random finance company, just asking my thoughts about Docker, because he was considering to consider it. Among other things, this company -and this guy in particular- manages systems that handle trillions of dollars, including the pensions of millions of Americans.

Docker is nowhere ready to handle my mother’s pension, how could anyone ever think that??? Well, it seemed the Docker experience wasn’t documented enough.

What Do You Need to Run Docker?

As you should be aware by know, Docker is highly sensitive to the kernel, the host and the filesystem it’s using. Pick the wrong combination and you’re talking kernel panic, filesystem corruption, Docker daemon lock down, etc…

I had time to collect feedback on various operating conditions and test a couple more myself.

We’ll go over the results of the research, what has been registered to work, not work, experience intermittent failures, or blow up entirely in epic proportions.

Spoiler Alert: There is nothing with or around Docker that’s guaranteed to work.

Disclaimer: Understand the Risks and the Consequences

I am biased toward my own standards (as a professional who has to handle real money) and following the feedback I got (with a bias toward reliable sources known for operating real world systems).

For instance, if a combination of operating system and filesystem is marked as “no-go: registered catastrophic filesystem failure with full volume data loss“. It is not production ready (for me) but it is good enough for a student who has to do a one-off exercise in a vagrant virtual machine.

You may or may not experience the issues mentioned. Either way, they are mentioned because they are certified to be present in the wild as confirmed by the people who hit them. If you try an environment that is similar enough, you are on the right path to become the next witness.

The worst that can -and usually- happen with Docker is that it seems okay during the proof of concepts and you’ll only begin to notice and understand issues far down the line, when you cannot easily move away from it.

CoreOS

CoreOS is an operating that can only run containers and is exclusively intended to run containers.

Last article, the conclusion was that it might be the only operating system that may be able to run Docker. This may or may not be accurate.

We abandoned the idea of running CoreOS.

First, the main benefit of Docker is to unify dev and production. Having a separate OS in production only for containers totally ruins this point.

Second, Debian (we were on Debian) announced the next major release for Q1 2017. It takes a lot of effort to understand and migrate everything to CoreOS, with no guarantee of success. It’s wiser to just wait for the next Debian.

CentOS/RHEL

CentOS/RHEL 6

Docker on CentOS/RHEL 6 is no-go: known filesystem failures, full volume data loss

  1. Various known issues with the devicemapper driver.
  2. Critical issues with LVM volumes in combination with devicemapper causing data corruption, container crash, and docker daemon freeze requiring hard reboot to fix.
  3. The Docker packages are not maintained on this distribution. There are numerous critical bug fixes that were released in the CentOS/RHEL 7 packages but were not back ported to the CentOS/RHEL 6 packages.
ship crash shipt it revert
The only sane way to migrate to Docker in a big company still running on RHEL 6 => Don’t do it!

CentOS/RHEL 7

Originally running the kernel 3, RedHat has been back porting the kernel 4 features into it, which is mandatory for running Docker.

It caused problems at time because Docker failed to detect the custom kernel version and the available features on it, thus it cannot set proper system settings and fails in various mysterious ways. Every time this happens, this can only be resolved by Docker publishing a fix on feature detection for specific kernels, which is neither a timely nor systematic process..

There are various issues with the usage of LVM volumes, depends on the version.

Otherwise, it’s a mixed bag. Your mileage may vary.

As of CentOS 7.0, RedHat recommended some settings but I can’t find the page on their website anymore. Anyway, there are a tons of critical bugfixes in later version so you MUST update to the latest version.

As of CentOS 7.2, RedHat recommends and supports exclusively XFS and they give special flags for the configuration. AUFS doesn’t exist, OverlayFS is officially considered unstable, BTRFS is beta (technology preview).

The RedHat employees are admitting themselves that they struggle pretty hard to get docker working in proper conditions, which is a major problem because they gotta resell it as part of their OpenShift offering. Try making a product on an unstable core.

If you like playing with fire, it looks like that’s the OS of choice.

Note that for once, it is a case where you surely wants to have RHEL and not CentOS, meaning timely updates and helpful support at your disposal.

Debian

Debian 8 jessie (stable)

A major cause of the issues we experienced was because our production OS was Debian stable, as explained in the previous article.

Basically, Debian froze the kernel to a version that doesn’t support anything Docker needs and the few components that are present are rigged with bugs.

Docker on Debian is major no-go: There is a wide range of bugs in the AUFS driver (but not only), usually crashing the host, potentially corrupting the data, and that’s just the tip of the iceberg.

Docker is 100% guaranteed suicide on Debian 8 and it’s been since the inception of Docker a few years ago. It’s killing me no one ever documented this earlier.

I wanted to show you a graph of AWS instances going down like dominoes but I didn’t have a good monitoring and drawing tool to do that, so instead I’ll illustrate with a piano chart that looks the same.

docker-crash-illustrated
Typical docker cascade failure in our test systems.

Typical Docker cascading failure on our test systems. A test slave crashes… the next one retries two minutes later… and dies too. This specific cascade took 6 tries to go past the bug, slightly more than usual, but nothing fancy.

You should have CloudWatch alarms to restart dead hosts automatically and send a crash notifications.

Fancy: You can also have a CloudWatch alarm to automatically send a customized issue report to your regulator whenever there is an issue persisting more than 5 minutes.

Not to brag but we got quite good at containing Docker. Forget about Chaos Monkey, that’s child play, try running trading systems handling billions of dollars on Docker [1].

[1] Please don’t do that. That’s a terrible idea.

Debian 9 stretch

Debian stretch is planned to become the stable edition in 2017. (Note: might be released as I write and edit this article).

It will feature the kernel 4.10 which is the latest LTS, published simultaneously.

At the time of release, Debian Stretch will be the most up to date stable operating system and it will allegedly have all the shiny things necessary to run Docker (until the Docker requirements change again).

It may resolve a lot of the issues and it may make a tons of new ones. We’ll see how it goes.

Ubuntu

Ubuntu has always been more up to date than the regular server distributions.

Sadly, I am not aware of any serious companies than run on Ubuntu. This has been a source of much misunderstanding in the docker community because dev and amateur bloggers try things on the latest Ubuntu (not even the LTS [1]) yet it’s utterly non representative of production systems in the real world (RHEL, CentOS, Debian or one of the exotic Unix/BSD/Solaris).

I cannot comment on the LTS 16 as I do not use it. It’s the only distribution to have Overlay2 and ZFS available, that gives some more options to be tried and maybe find something working?

The LTS 14 is a definitive no-go: Too old, don’t have the required components.

[1] I received quite a few comments and unfriendly emails of people saying to “just” use the latest Ubuntu beta. As if migrating all live systems, changing distribution and running on a beta platform that didn’t even exist at the time was an actual solution.


Update: I said I’m never coming back to Docker and certainly not to spend an hour on digging up references but I guess I have to now that they are handed to me in spectacular ways.

I received a quite insulting email from a guy who is clearly in the amateur league to say that “any idiot can run Docker on Ubuntu” then proceed to give a list of software packages and advanced system tweaks that are mandatory to run Docker on Ubuntu, that allegedly “anyone could have found in 5 seconds with Google“.

At the heart of his mail is this bug report, which is indeed the first Google result for “Ubuntu docker not working” and “Ubuntu docker crash: Ubuntu 16.04 install for 1.11.2 hangs.

This bug report, published on June 2016 highlights that the Ubuntu installer simply doesn’t work at all because it doesn’t install some dependencies which are required by Docker to run, then it’s a see of comments, user workarounds and not-giving-a-fuck #WONTFIX by Docker developers.

The last answer is given by an employee 5 months later to say that the Ubuntu installer will never be fixed, however the next major version of Docker may use something completely different that won’t be affected by this issue.

A new major version (v1.13) just got released (8 months since the report), it is not confirmed whether it is affected by the bug or not (but it is confirmed to come with breaking changes).

It’s fairly typical of what to expect from Docker. Checklist:

  • Is everything broken to the point Docker can’t run at all? YES.
  • Is it broken for all users, of say a major distribution? YES.
  • Is there a timely reply to acknowledge the issue? NO.
  • Is it confirmed that the issue is present and how severe it is? NO.
  • Is there any fix planned? NO.
  • Is there a ton of workarounds of various danger and complexity? YES.
  • Will it ever be fixed? Who knows.
  • Will the fix, if it ever comes, be backported? NEVER.
  • Is the ultimate answer to everything to just update to latest? Of course.

AWS Container Service

AWS has an AMI dedicated to running Docker. It is based on an Ubuntu.

As confirmed by internal sources, they experienced massive troubles to get Docker working in any decent condition

Ultimately, they released am AMI for it, running a custom OS with a custom docker package with custom bug fixes and custom backports. They went and are still going through extensive efforts and testing to keep things together.

If you are locked-in on Docker and running on AWS, your only salvation might be to let AWS handles it for you.

Google Container Service

Google offers containers as a service. Google merely exposes a Docker interface, the containers are run on internal google containerization technologies, that cannot possibly suffer from all the Docker implementation flaws.

Don’t get me wrong. Containers are great as a concept, the problem is not the theoretical aspect, it’s the practical implementation and tooling we have (i.e. Docker) which are experimental at best.

If you really want to play with Docker (or containers) and you are not operating on AWS, that leaves Google as the single strongest choice, better yet, it comes with Kubernetes for orchestration, making it a league of its own.

That should still be considered experimental and playing with fire. It just happens that it’s the only thing that may deliver the promises and also the only thing that comes with containers AND orchestration.

OpenShift

It’s not possible to build a stable product on a broken core, yet RedHat is trying.

From the feedback I had, they are both struggling pretty hard to mitigate the Docker issues, with variable success. Your mileage may vary.

Considering that they both appeal to large companies, who have quite a lot to lose, I’d really question the choice of going for that route (i.e. anything build on top of Docker).

You should try the regular clouds instead: AWS or Google or Azure. Using virtual machines and some of the hosted services will achieve 90% of what Docker does, 90% of what Docker doesn’t do, and it’s dependable. It’s also a better long-term strategy.

Chances are that you want to do OpenShift because you can’t do public cloud. Well, that’s a tough spot to be in. (Good luck with that. Please write a blog in reply to talk about your experience).

Summary

  • CentOS/RHEL: Russian roulette
  • Debian: Jumping off a plane naked
  • Ubuntu: Not sure Update: LOL.
  • CoreOS: Not worth the effort
  • AWS Containers: Your only salvation if you are locked-in with Docker and on AWS
  • Google Containers: The only practical way to run Docker that is not entirely insane.
  • OpenShift: Not sure. Depends how good the support and engineers can manage?

A Business Perspective

Docker has no business model and no way to monetize. It’s fair to say that they are releasing to all platforms (Mac/Windows) and integrating all kind of features (Swarm) as a desperate move to 1) not let any competitor have any distinctive feature 2) get everyone to use docker and docker tools 3) lock customers completely in their ecosystem 4) publish a ton of news, articles and releases in the process, increasing hype 5) justify their valuation.

It is extremely tough to execute an expansion both horizontally and vertically to multiple products and markets. (Ignoring whether that is an appropriate or sustainable business decision, which is a different aspect).

In the meantime, the competitors, namely Amazon, Microsoft, Google, Pivotal and RedHat all compete in various ways and make more money on containers than Docker does, while CoreOS is working an OS (CoreOS) and competing containerization technology (Rocket).

That’s a lot of big names with a lot of firepower directed to compete intensively and decisively against Docker. They have zero interest whatsoever to let Docker locks anyone. If anything, they individually and collectively have an interest in killing Docker and replacing it with something else.

Let’s call that the war of containers. We’ll see how it plays out.

Currently, Google is leading the way, they are replacing Docker and they are the only one to provide out of the box orchestration (Kubernetes).

Conclusion

Did I say that Docker is an unstable toy project?

Invariably some people will say that the issues are not real or in the past. They are not in the past, the challenges and the issues are very current and very real. There is definite proof and documentation that Docker has suffered from critical bugs making it plain unusable on ALL major distributions, bugs that ran rampant for years, some still present as of today.

If you look for any combination of “docker + version + filesystem + OS” on Google, you’ll find a trail of issues with various impact going back all the way to docker birth. It’s a mystery how something could fail that bad for that long and no one writes about it. (Actually, there are a few articles, they were just lost under the mass of advertisement and quick evaluations). The last software to achieve that level of expectation with that level of failure was MongoDB.

I didn’t manage to find anyone on the planet using Docker seriously AND successfully AND without major hassle. The experiences mentioned in this article were acquired by blood, the blood of employees and companies who learned Docker the hard way while every second of downtime was a $1000 loss.

Hopefully, you can learn from our past, as to not repeat it.

mistake - it could be that the purpose of your life is only to serve as a warning to others

If you were wondering whether you should have adopted docker years ago => The answer is hell no, you dodged a bullet. You can tell that to your boss. (It’s still not that much useful today if you don’t proper have orchestration around it, which is itself an experimental subject).

If you are wondering whether you should adopt it now… while what you run is satisfactory and you have any considerations for quality => The reasonable answer is to wait until RHEL 8 and Debian 10. No rush. Things need to mature and the packages ain’t gonna move faster than the distributions you’ll run them on.

If you like to play with fire => Full-on Google Container Engine on Google Cloud. Definitive high risk, probable high reward.

Would this article have more credibility if I linked numerous bug reports, screenshots of kernel panics, personal charts of system failures over the day, relevant forum posts and disclosed private conversations? Probably.

Do I want to spend yet-another hundred hours to dig that off, once again? Nope. I’d rather spend my evening on Tinder than Docker. Bye bye Docker.

Moving On

Back to me. My action plan to lead the way on Containers and Clouds had a major flaw I missed out, the average tenure in tech companies is still not counted in yearS, thus the year 2017 began by being poached.

Bad news: No more cloud and no more Docker where I am going. Meaning no more groundbreaking news. you are on your own to figure it out.

Good news: No more toying around with billions dollars of other people’s money… since I am moving up by at least 3 orders of magnitude! I am moderately confident that my new immediate playground may include the pensions of a few millions of Americans, including a lot of people who read this blog.

docker your pension fund 100% certified not dockeri
Rest assured: Your pension is in good hands! =D
Advertisements

Google Cloud is 50% cheaper than AWS


Let’s revisit Google and Amazon pricing since the AWS November 2016 Price Reduction.

We’ll analyse instance costs, for various workloads and usages. All prices are given in dollars per month (720 hours) for servers located in Europe (eu-west-1).

Shared CPU Instances

Shared CPU instances give only a bit of CPU. The physical processor is over allocated and shared with many other instances running on the same host. A shared CPU instance may burst to 100% CPU usage for short periods but it may also be starved of CPU and paused. Note that these instances are cheap but they are not reliable for non-negligible continuous workloads.

google cloud vs aws pricing shared CPU instances

The smallest instances on both cloud is 500MB and a few percent of CPU. That’s the cheapest instance. It’s usable for testing and minimal needs (can’t do much with only 5% of CPU and 500MB).

The infamous t2.small and it’s rival the g1-small are usually the most common instance types in use. They come with 2GB of memory and a bit of CPU. It’s cheap and good enough for many use cases (excluding production and time critical processing, which need dedicated CPU time).

The Cheapest Production Instances

Production instances are all the instances with dedicated CPU time (i.e. everything but the shared CPU instances).

Most services will just run on the cheapest production instance available. That instance is very important because it determines the entry price and the specifications for everything.

google cloud vs aws pricing cheapest production instances

The cheapest production instance on Google Cloud is the n1-standard-1 which gives 1 CPU and 4 GB of memory.

AWS is more complex. The m3.medium is 1 CPU and 4 GB of memory. The c4.large is 2 CPU and 4 GB of memory.

m3/c3 are the previous family generation (pre-2015), using older hardware and an ancient virtualisation technology. c4/m4 are the current generation, it has enhanced networking and reserved bandwidth for EBS, among other system improvements.

Either way, the Google entry-level instance is significantly cheaper than both AWS entry-level instances. There will be a lot of these running, expect massive costs savings by using Google cloud.


I’m a believer that one should optimize for manageability and not raw costs. That means adopting c4/m4 as the standard for deployments (instead of c3/m3).

Given this decision, the smallest production instance on AWS is the c4.large (2 CPU, 4GB memory), a rather big instance when compared to the n1-standard-1 (1 CPU, 4GB memory). Why are we forced to pay for two CPUs as the minimal choice on AWS? That does set a high base price.

Not only Google is cheaper because it’s more competitive but it also offers more tailored options. The result is a massive 68% discount on the most commonly used production instance.

Personal Note: I would criticize the choice of AWS to discontinue the line of m4.medium instance type (1 CPU).


Instances by usage

A server has 3 dimensions of specifications: CPU performances, memory size and network speed.

Most applications only have a hard requirement in a single dimension. We’ll analyse the pricing separately for each usage pattern.

google cloud vs aws pricing instances by usage

Network Heavy

Typical Consumers: load balancers, file transfers, uploads/downloads, backups and generally speaking everything that uses the network.

What should we order to have  1Gbps and how much will it be?

  • The minimum on Google Cloud is the n1-highcpu-4 instance (4 CPU, 4 GB memory).
  • The minimum on AWS is the c4.4xlarge instance (16 CPU, 30 GB memory).

AWS bandwidth allowance is limited and correlated to the instance size. The big instances -with decent bandwidth- are incredibly expensive.

To give a point of comparison, the c4|m4|r3.large instances have a hard cap at 220 Mbits/s of network traffic (Note: It also applies internally within a VPC).

figure2_7001
Source: Network and cloud storage benchmark in 2015

All Google instances have significantly faster network than the equivalent [and even bigger] AWS instances, to the point where they’re not even playing in the same league.

Google has been designing networks and manufacturing their own equipment for decades. It’s fair to assume than AWS doesn’t have the technology to compete.

CPU

Typical Consumers: web servers, data analysis, simulations, processing and computations.

Google is cheaper per CPU.

Google CPU instances have half the memory of AWS CPU instances[1]. While that could have justified a 10% difference, it doesn’t justify double[2].

Note: The performances per CPU are equivalent on both cloud (though the CPU models and serial numbers may vary).

[1] A sane design decision. Most CPU bound workloads don’t need much memory. (Note: if they do, they can be run on “standard” instances).

[2] Pricing is mostly linked to CPU count. Additional memory is cheap.

Memory

Typical Consumers: database, caches and in-memory workloads.

Google is cheaper per GB of memory.

Google memory instances have 15% less memory than AWS CPU instances. While that could have justified a few percent difference, it sure as hell doesn’t justify double[2].

[2] Pricing is mostly linked to CPU count. Additional memory is cheap.

Local SSD and Scaling Up

There are software that can only scale up, typically SQL databases. A database holding tons of data will require fast local disks and truckloads of memory to operate non-sluggishly.

Scaling up is the most typical use case for beefy dedicated servers, but we’re not gonna rent a single server in another place just for one application. The cloud provider will have to accommodate that need.

Google allows to attach local 400GB SSDs to any instance type ($85 a month per disk).

Some AWS instances comes with small local SSD (16-160GB), you’re out of luck if you need more space than that. The only option to get big local SSD is the special i2 instances family, they have specifications in powers of 800GB local SSD + 4 CPU + 15 GB RAM (for $655 a month).

The Google SSD model is superior. It’s significantly more modular and cheaper (and more performant but that’s a different topic).

aws-vs-gce-pricing-instances-with-local-ssd
The requirements to fulfil are between parenthesis.

Disk Intensive Load: A job that requires high volume fast disks (i.e. local SSD) but not much memory.

AWS forces you to buy a big instance (i2.xlarge) to get enough SSD space whereas Google allows you to attach a SSD to a small instance (n1-highcpu-4). The lack of flexibility from AWS has a measurable impact, the AWS setup is 406% the costs of the Google setup to achieve the same need.

Database: A typical database. Fast storage and sizeable memory.

Bigger Database: Sometimes there is no choice but to scale up, to whatever resources are commanded by the application.

On AWS (i2.8xlarge) 32 cores, 244GB memory, 2 x 800 GB local SSD in RAID1 (+ 6 SSD unused yet gotta pay for it).

On Google Cloud (n1-highmem-32): 32 cores, 208 GB memory, 4 x 375 GB local SSD in RAID10.

This last number is meant to show that the lack of flexibility of AWS can (and will) snowball quickly. Only a very particular instance can fulfil the requirements, it comes with many cores and 4800 GB of unnecessary local SSD. The AWS bill is $4k (273%) higher than the equivalent setup on Google Cloud.

Custom Instances

Google offers custom machine types. You can pick how much CPU and memory you want, you’ll get that exact instance with a tailored pricing.

It is quite flexible. For instance, we could recreate any instance from AWS on Google Cloud.

Of course, there are physical bounds inherent to hardware (e.g. you can’t have a single core with 100 GB of memory).

Reserved Instances

Reserved Instances are bullshit!

Reserving capacity is a dangerous and antiquated pricing model that belongs to the era of the datacenter.

The numbers given in this article do not account for any AWS reservation. However, they all account for Google sustained use discount (30% automatic discount on instances that ran for the entire month).

If your infrastructure is so small that you can reserve all your 4 instances upfront, you should reconsider why you use AWS in the first place. There are more appropriate and cheaper options available.

If your infrastructure is big enough that you have dozens of servers (or thousands), you should already be aware that:

  1. Long term commitment is a huge risk. Most people underestimate it.
  2. Predictions are always off. Most people are overconfident in their predictions.
  3. You are no exception to most people.
  4. Reservation is a mess when having many AWS accounts (dev, staging, prod).
  5. Anything that is testing/transient is too short-lived to be reserved.
  6. Less than 50% of reservable stuff can actually be reserved (margin for change/error).

Most people managers are stubborn. If you your manager is stubborn and really insists on reserving instances, you should bet exclusively on “1 year full upfront“.

fishing with gr
Safety Warning: There is no confirmation button when you purchase reserved instances. You can absolutely spend $73185 without seeing nor confirming an invoice.

Conclusion

google cloud vs aws pricing summary relative costs

AWS was the first generation of cloud, Google is the second. The second generation is always better because it can learn from the mistakes of the first and it doesn’t have the old legacy to support.

2016 should be remembered as the year Google became a better choice than AWS. If 50% cheaper is not a solid argument, I don’t know what is.


References:

Cloud Storage Performance, a benchmark with graphs on network performance.

Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network, A Google Research Paper, the story on what powers their internal network.

Amazon does everything wrong, and Google does everything right, A message by an employee from Amazon than Google, not directly relevant but still a good read.

Moby/Docker in Production: A History of Failure


April 2017: Updated the title. The Docker CEO decided to rename Docker to Moby overnight, without notice. I’ve tried to warn the world about unexpected Docker changes and breakages. That’s the order of magnitude you have to be prepared for.

Introduction

My first encounter with docker goes back to early 2015. Docker was experimented with to find out whether it could benefit us. At the time it wasn’t possible to run a container [in the background] and there wasn’t any command to see what was running, debug or ssh into the container. The experiment was quick, Docker was useless and closer to an alpha prototype than a release.

Fast forward to 2016. New job, new company and docker hype is growing like mad. Developers here have pushed docker into production projects, we’re stuck with it. On the bright side, the run command finally works, we can start, stop and see containers. It is functional.

We have 12 dockerized applications running in production as we write this article, spread over 31 hosts on AWS (1 docker app per host [note: keep reading to know why]).

The following article narrates our journey with Docker, an adventure full of dangers and unexpected turns.

so it begins, the greatest fuck up of our time

Production Issues with Docker

Docker Issue: Breaking changes and regressions

We ran all these versions (or tried to):

1.6 => 1.7 => 1.8 => 1.9 => 1.10 => 1.11 => 1.12

Each new version came with breaking changes. We started on docker 1.6 early this year to run a single application.

We updated 3 months later because we needed a fix only available in later versions. The 1.6 branch was already abandoned.

The versions 1.7 and 1.8 couldn’t run. We moved to the 1.9 only to find a critical bug on it two weeks later, so we upgraded (again!) to the 1.10.

There are all kind of subtle regressions between Docker versions. It’s constantly breaking unpredictable stuff in unexpected ways.

The most tricky regressions we had to debug were network related. Docker is entirely abstracting the host networking. It’s a big mess of port redirection, DNS tricks and virtual networks.

Bonus: Docker was removed from the official Debian repository last year, then the package got renamed from docker.io to docker-engine. Documentation and resources predating this change are obsolete.

Docker Issue: Can’t clean old images

The most requested and most lacking feature in Docker is a command to clean older images (older than X days or not used for X days, whatever). Space is a critical issue given that images are renewed frequently and they may take more than 1GB each.

The only way to clean space is to run this hack, preferably in cron every day:

docker images -q -a | xargs --no-run-if-empty docker rmi

It enumerates all images and remove them. The ones currently in use by running containers cannot be removed (it gives an error). It is dirty but it gets the job done.

The docker journey begins with a clean up script. It is an initiation rite every organization has to go through.

Many attempts can be found on the internet, none of which works well. There is no API to list images with dates, sometimes there are but they are deprecated within 6 months. One common strategy is to read date attribute from image files and call ‘docker rmi‘ but it fails when the naming changes. Another strategy is to read date attributes and delete files directly but it causes corruption if not done perfectly, and it cannot be done perfectly except by Docker itself.

Docker Issue: Kernel support (or lack thereof)

There are endless issues related to the interactions between the kernel, the distribution, docker and the filesystem

We are using Debian stable with backports, in production. We started running on Debian Jessie 3.16.7-ckt20-1 (released November 2015). This one suffers from a major critical bug that crashes hosts erratically (every few hours in average).

Linux 3.x: Unstable storage drivers

Docker has various storage drivers. The only one (allegedly) wildly supported is AUFS.

The AUFS driver is unstable. It suffers from critical bugs provoking kernel panics and corrupting data.

It’s broken on [at least] all “linux-3.16.x” kernel. There is no cure.

We follow Debian and kernel updates very closely. Debian published special patches outside the regular cycle. There was one major bugfix to AUFS around March 2016. We thought it was THE TRUE ONE FIX but it turned out that it wasn’t. The kernel panics happened less frequently afterwards (every week, instead of every day) but they were still loud and present.

Once during this summer there was a regression among a major update, that brought back a previous critical issue. It started killing CI servers one by one, with 2 hours in average between murders. An emergency patch was quickly released to fix the regression.

There were multiple fixes to AUFS published along the year 2016. Some critical issues were fixed but there are many more still left. AUFS is unstable on [at least] all “linux-3.16.x” kernels.

  • Debian stable is stuck on kernel 3.16. It’s unstable. There is nothing to do about it except switching to Debian testing (which can use the kernel 4).
  • Ubuntu LTS is running kernel 3.19. There is no guarantee that this latest update fixes the issue. Changing our main OS would be a major disruption but we were so desperate that we considered it for a while.
  • RHEL/CentOS-6 is on kernel 2.x and RHEL/CentoS-7 is on kernel 3.10 (with many later backports done by RedHat).

Linux 4.x: The kernel officially dropped docker support

It is well-known that AUFS has endless issues and it’s regarded as dead weight by the developers. As a long-standing goal, the AUFS filesystem was finally dropped in kernel version 4.

There is no unofficial patch to support it, there is no optional module, there is no backport whatsoever, nothing. AUFS is entirely gone.

[dramatic pause]

.

.

.

How does docker work without AUFS then? Well, it doesn’t.

[dramatic pause]

.

.

.

So, the docker guys wrote a new filesystem, called overlay.

OverlayFS is a modern union filesystem that is similar to AUFS. In comparison to AUFS, OverlayFS has a simpler design, has been in the mainline Linux kernel since version 3.18 and is potentially faster.” — Docker OverlayFS driver

Note that it’s not backported to existing distributions. Docker never cared about [backward] compatibility.

Update after comments: Overlay is the name of both the kernel module to support it (developed by linux maintainers) and the docker storage driver to use it (part of docker, developed by docker). They are two different components [with a possible overlap of history and developers]. The issues seem mostly related to the docker storage driver, not the filesystem itself.

The debacle of Overlay

A filesystem driver is a complex piece of software and it requires a very high level of reliability. The long time readers will remember the Linux migration from ext3 to ext4. It took time to write, more time to debug and an eternity to be shipped as the default filesystem in popular distributions.

Making a new filesystem in 1 year is an impossible mission. It’s actually laughable when considering that the task is assigned to Docker, they have a track record of unstability and disastrous breaking changes, exactly what we don’t want in a filesystem.

Long story short. That did not go well. You can still find horror stories with Google.

Overlay development was abandoned within 1 year of its initial release.

[dramatic pause]

.

.

.

Then comes Overlay2.

The overlay2 driver addresses overlay limitations, but is only compatible with Linux kernel 4.0 [or later] and docker 1.12” — Overlay vs Overlay2 storage drivers

Making a new filesystem in 1 year is still an impossible mission. Docker just tried and failed. Yet they’re trying again! We’ll see how it turns out in a few years.

Right now it’s not supported on any systems we run. We can’t use it, we can’t even test it.

Lesson learnt: As you can see with Overlay then Overlay2. No backport. No patch. No retro compatibility. Docker only moves forward and breaks things. If you want to adopt Docker, you’ll have to move forward as well, following the releases from docker, the kernel, the distribution, the filesystems and some dependencies.

Bonus: The worldwide docker outage

On 02 June 2016, at approximately 9am (London Time). New repository keys are pushed to the docker public repository.

As a direct consequence, any run of “apt-get update” (or equivalent) on a system configured with the broken repo will fail with an error “Error https://apt.dockerproject.org/ Hash Sum mismatch

This issue is worldwide. It affects ALL systems on the planet configured with the docker repository. It is confirmed on all Debian and ubuntu versions, independent of OS and docker versions.

All CI pipelines in the world which rely on docker setup/update or a system setup/update are broken. It is impossible to run a system update or upgrade on an existing system. It’s impossible to create a new system and install docker on it.

After a while. We get an update from a docker employee: “To give an update; I raised this issue internally, but the people needed to fix this are in the San Francisco timezone [8 hours difference with London], so they’re not present yet.

I personally announce that internally to our developers. Today, there is no Docker CI and we can’t create new systems nor update existing systems which have a dependency on docker. All our hope lies on a dude in San Francisco, currently sleeping.

[pause waiting for the fix, that’s when free food and drinks come in handy]

An update is posted from a Docker guy in Florida at around 3pm (London Time). He’s awake, he’s found out the issue and he’s working on the fix.

Keys and packages are republished later.

We try and confirm the fix at around 5pm (London Time).

That was a 7 hours interplanetary outage because of Docker. All that’s left from the outage is a few messages on a GitHub issue. There was no postmortem. It had little (none?) tech news or press coverage, in spite of the catastrophic failure.

Docker Registry

The docker registry is storing and serving docker images.

Automatic CI build  ===> (on success) push the image to ===> docker registry
Deploy command <=== pull the image from <=== docker registry

There is a public registry operated by docker. As an organization, we also run our own internal docker registry. It’s a docker image running inside docker on a docker host (that’s quite meta). The docker registry is the most used docker image.

There are 3 versions of the docker registry. The client can pull indifferently from any:

Docker Registry Issue: Abandon and Extinguish

The docker registry v2 is as a full rewrite. The registry v1 was retired soon after the v2 release.

We had to install a new thing (again!) just to keep docker working. They changed the configuration, the URLs, the paths, the endpoints.

The transition to the registry v2 was not seamless. We had to fix our setup, our builds and our deploy scripts.

Lesson learnt: Do not trust on any docker tool or API. They are constantly abandoned  and extinguished.

One of the goal of the registry v2 is to bring a better API. It’s documented here, a documentation that we don’t remember existed 9 months ago.

Docker Registry Issue: Can’t clean images

It’s impossible to remove images from the docker registry. There is no garbage collection either, the doc mentions one but it’s not real. (The images do have compression and de-duplication but that’s a different matter).

The registry just grows forever. Our registry can grow by 50 GB per week.

We can’t have a server with an unlimited amount of storage. Our registry ran out of space a few times, unleashing hell in our build pipeline, then we moved the image storage to S3.

Lesson learnt: Use S3 to store images (it’s supported out-of-the-box).

We performed a manual clean-up 3 times in total. In all cases we had to stop the registry, erase all the storage and start a new registry container. (Luckily, we can re-build the latest docker images with our CI).

Lesson learnt: Deleting any file or folder manually from the docker registry storage WILL corrupt it.

To this day, it’s not possible to remove an image from the docker registry. There is no API either. (One of the point of the v2 was to have a better API. Mission failed).

Docker Issue: The release cycle

The docker release cycle is the only constant in the Docker ecosystem:

  1. Abandon whatever exists
  2. Make new stuff and release
  3. Ignore existing users and retro compatibility

The release cycle applies but is not limited to: docker versions, features, filesystems, the docker registry, all API…

Judging by the past history of Docker, we can approximate that anything made by Docker has a half-life of about 1 year, meaning that half of what exist now will be abandoned [and extinguished] in 1 year. There will usually be a replacement available, that is not fully compatible with what it’s supposed to replace, and may or may not run on the same ecosystem (if at all).

We make software not for people to use but because we like to make new stuff.” — Future Docker Epitaph

The current status-quo on Docker in our organization

Growing in web and micro services

Docker first came in through a web application. At the time, it was an easy way for the developers to package and deploy it. They tried it and adopted it quickly. Then it spread to some micro services, as we started to adopt a micro services architecture.

Web applications and micro services are similar. They are stateless applications, they can be started, stopped, killed, restarted without thinking. All the hard stuff is delegated to external systems (databases and backend systems).

The docker adoption started with minor new services. At first, everything worked fine in dev, in testing and in production. The kernel panics slowly began to happen as more web services and web applications were dockerized. The stability issues became more prominent and impactful as we grew.

A few patches and regressions were published over the year. We’ve been playing catchup & workaround with Docker for a while now. It is a pain but it doesn’t seem to discourage people from adopting Docker. Support and demand is still growing inside the organisation.

Note: None of the failures ever affected any customer or funds. We are quite successful at containing Docker.

Banned from the core

We have some critical applications running in Erlang, managed by a few guys in the ‘core’ team.

They tried to run some of their applications in Docker. It didn’t work. For some reasons, Erlang applications and docker didn’t go along.

It was done a long time ago and we don’t remember all the details. Erlang has particular ideas about how the system/networking should behave and the expected load was in thousands of requests per second. Any unstability or incompatibility could justify an outstanding failure. (We know for sure now that the versions used during the trial suffered from multiple major unstability issues).

The trial raised a red flag. Docker is not ready for anything critical. It was the right call. The later crashes and issues managed to confirm it.

We only use Erlang for critical applications. For example, the core guys are responsible for a payment system that handled $96,544,800 in transaction this month. It includes a couple of applications and databases, all of which are under their responsibilities.

Docker is a dangerous liability that could put millions at risk. It is banned from all core systems.

Banned from the DBA

Docker is meant to be stateless. Containers have no permanent disk storage, whatever happens is ephemeral and is gone when the container stops. Containers are not meant to store data. Actually, they are meant by design to NOT store data. Any attempt to go against this philosophy is bound to disaster.

Moreover. Docker is locking away processes and files through its abstraction, they are unreachable as if they didn’t exist. It prevents from doing any sort of recovery if something goes wrong.

Long story short. Docker SHALL NOT run databases in production, by design.

It gets worse than that. Remember the ongoing kernel panics with docker?

A crash would destroy the database and affect all systems connecting to it. It is an erratic bug, triggered more frequently under intensive usage. A database is the ultimate IO intensive load, that’s a guaranteed kernel panic. Plus, there is another bug that can corrupt the docker mount (destroying all data) and possibly the system filesystem as well (if they’re on the same disk).

Nightmare scenario: The host is crashed and the disk gets corrupted, destroying the host system and all data in the process.

Conclusion: Docker MUST NOT run any databases in production, EVER.

Every once in a while, someone will come and ask “why don’t we put these databases into docker?” and we’ll tell some of our numerous war stories, so far, no-one asked twice.

Note: We started going over our Docker history as an integral part of our on boarding process. That’s the new damage control philosophy, kill the very idea of docker before it gets any chance to grow and kill us.

A Personal Opinion

Docker is gaining momentum, there is some crazy fanatic support out there. The docker hype is not only a technological liability any more, it has evolved into a sociological problem as well.

The perimeter is controlled at the moment, limited to some stateless web applications and micro services. It’s unimportant stuff, they can be dockerized and crash once a day, I do not care.

So far, all people who wanted to use docker for important stuff have stopped after a quick discussion. My biggest fear is that one day, a docker fanatic will not listen to reason and keep pushing. I’ll be forced to barrage him and it might not be pretty.

Nightmare scenario: The future accounting cluster revamp, currently holding $23M in customer funds (the M is for million dollars). There is already one guy who genuinely asked the architect “why don’t you put these databases into docker?“, there is no word to describe the face of the architect.

My duty is to customers. Protecting them and their money.

Surviving Docker in Production

gif-what-docker-pretends-to-be
What docker pretends to be.
gif-what-docker-really-is
What docker really is.

Follow releases and change logs

Track versions and change logs closely for kernel, OS, distributions, docker and everything in between. Look for bugs, hope for patches, read everything with attention.

ansible '*' -m shell -a "uname -a"

Let docker crash

Let docker crash. self-explanatory.

Once in a while, we look at which servers are dead and we force reboot them.

Have 3 instances of everything

High availability require to have at least 2 instances per service, to survive one instance failure.

When using docker for anything remotely important, we should have 3 instances of it. Docker die all the time, we need a margin of error to support 2 crashes in a raw to the same service.

Most of the time, it’s CI or test instances that crash. (They run lots of intensive tests, the issues are particularly outstanding). We’ve got a lot of these. Sometimes there are 3 of them crashing in a row in an afternoon.

Don’t put data in Docker

Services which store data cannot be dockerized.

Docker is designed to NOT store data. Don’t go against it, it’s a recipe for disaster.

On top, there are current issues killing the server and potentially destroying the data so that’s really a big no-go.

Don’t run anything important in Docker

Docker WILL crash. Docker WILL destroy everything it touches.

It must be limited to applications which can crash without causing downtime. That means mostly stateless applications, that can just be restarted somewhere else.

Put docker in auto scaling groups

Docker applications should be run in auto-scaling groups. (Note: We’re not fully there yet).

Whenever an instance is crashed, it’s automatically replaced within 5 minutes. No manual action required. Self healing.

Future roadmap

Docker

The impossible challenge with Docker is to come with a working combination of kernel + distribution + docker version + filesystem.

Right now. We don’t know of ANY combination that is stable (Maybe there isn’t any?). We actively look for one, constantly testing new systems and patches.

Goal: Find a stable ecosystem to run docker.

It takes 5 years to make a good and stable software, Docker v1.0 is only 28 months old, it didn’t have time to mature.

The hardware renewal cycle is 3 years, the distribution release cycle is 18-36 months. Docker didn’t exist in the previous cycle so systems couldn’t consider compatibility with it. To make matters worse, it depends on many advanced system internals that are relatively new and didn’t have time to mature either, nor reach the distributions.

That could be a decent software in 5 years. Wait and see.

Goal: Wait for things to get better. Try to not go bankrupt in the meantime.

Use auto scaling groups

Docker is limited to stateless applications. If an application can be packaged as a Docker Image, it can be packaged as an AMI. If an application can run in Docker, it can run in an auto scaling group.

Most people ignore it but Docker is useless on AWS and it is actually a step back.

First, the point of containers is to save resources by running many containers on the same [big] host. (Let’s ignore for a minute the current docker bug that is crashing the host [and all running containers on it], forcing us to run only 1 container per host for reliability).

Thus containers are useless on cloud providers. There is always an instance of the right size. Just create one with appropriate memory/CPU for the application. (The minimum on AWS is t2.nano which is $5 per month for 512MB and 5% of a CPU).

Second, the biggest gain of containers is when there is a complete orchestration system around them to automatically manage creation/stop/start/rolling-update/canary-release/blue-green-deployment. The orchestration systems to achieve that currently do not exist. (That’s where Nomad/Mesos/Kubernetes will eventually come in, there are not good enough in their present state).

AWS has auto scaling groups to manage the orchestration and life cycle of instances. It’s a tool completely unrelated to the Docker ecosystem yet it can achieve a better result with none of the drawbacks and fuck-ups.

Create an auto-scaling group per service and build an AMI per version (tip: use Packer to build AMI). People are already familiar with managing AMI and instances if operations are on AWS, there isn’t much more to learn and there is no trap. The resulting deployment is golden and fully automated. A setup with auto scaling groups is 3 years ahead of the Docker ecosystem.

Goal: Put docker services in auto scaling groups to have failures automatically handled.

CoreOS

Update after comments: Docker and CoreOS are made by separate companies.

To give some slack to Docker for once, it requires and depends on a lot of new advanced system internals. A classic distribution cannot upgrade system internals outside of major releases, even if it wanted to.

It makes sense for docker to have (or be?) a special purpose OS with an appropriate update cycle. It may be the only way to have a working bundle of kernel and operating system able to run Docker.

Goal: Trial the CoreOS ecosystem and assess stability.

In the grand scheme of operations, it’s doable to separate servers for running containers (on CoreOS) from normal servers (on Debian). Containers are not supposed to know (or care) about what operating systems they are running.

The hassle will be to manage the new OS family (setup, provisioning, upgrade, user accounts, logging, monitoring). No clue how we’ll do that or how much work it might be.

Goal: Deploy CoreOS at large.

Kubernetes

One of the [future] major breakthrough is the ability to manage fleets of containers abstracted away from the machines they end up running on, with automatic start/stop/rolling-update and capacity adjustment,

The issue with Docker is that it doesn’t do any of that. It’s just a dumb container system. It has the drawbacks of containers without the benefits.

There are currently no good, battle tested, production ready orchestration system in existence.

  • Mesos is not meant for Docker
  • Docker Swarm is not trustworthy
  • Nomad has only the most basic features
  • Kubernetes is new and experimental

Kubernetes is the only project that intends to solve the hard problems [around containers]. It is backed by resources that none of the other projects have (i.e. Google have a long experience of running containers at scale, they have Googley amount of resources at their disposal and they know how to write working software).

Right now, Kubernetes is young & experimental and it’s lacking documentation. The barrier to entry is painful and it’s far from perfection. Nonetheless, it is [somewhat] working and already benefiting a handful of people.

In the long-term, Kubernetes is the future. It’s a major breakthrough (or to be accurate, it’s the final brick that is missing for containers to be a major [r]evolution in infrastructure management).

The question is not whether to adopt Kubernetes, the question is when to adopt it?

Goal: Keep an eye on Kubernetes.

Note: Kubernetes needs docker to run. It’s gonna be affected by all docker issues. (For example, do not try Kubernetes on anything else than CoreOS).

Google Cloud: Google Container Engine

As we said before, there is no known stable combination of OS + kernel + distribution + docker version, thus there is no stable ecosystem to run Kubernetes on. That’s a problem.

There is a potential workaround: Google Container Engine. It is a hosted Kubernetes (and Docker) as a service, part of Google Cloud.

Google gotta solve the Docker issues to offer what they are offering, there is no alternative. Incidentally, they might be the only guys who can find a stable ecosystem around Docker, fix the bugs, and sell that ready-to-use as a cloud managed service. We might have a shared goal for once.

They already offer the service so that should mean that they already worked around the Docker issues. Thus the simplest way to have containers working in production (or at-all) may be to use Google Container Engine.

Goal: Move to Google Cloud, starting with our subsidiaries not locked in on AWS. Ignore the rest of the roadmap as it’s made irrelevant.

Google Container Engine: One more reason why Google Cloud is the future and AWS is the past (on top of 33% cheaper instances with 3 times the network speed and IOPS, in average).


Why docker is not yet succeeding in production, July 2015, from the Lead Production Engineer at Shopify.

Docker is not ready for primetime, August 2016.

Docker in Production: A retort, November 2016, a response to this article.

How to deploy an application with Docker… and without Docker, An introduction to application deployment, The HFT Guy.


Disclaimer (please read before you comment)

A bit of context missing from the article. We are a small shop with a few hundreds servers. At core, we’re running a financial system moving around multi-million dollars per day (or billions per year).

It’s fair to say that we have higher expectations than average and we take production issues rather (too?) seriously.

Overall, it’s “normal” that you didn’t experience all of these issues if you’re not using docker at scale in production and/or if you didn’t use it for long.

I’d like to point out that these are issues and workarounds happening over a period of [more than] a year, summarized all together in a 10 minutes read. It does amplify the dramatic and painful aspect.

Anyway, whatever happened in the past is already in the past. The most important section is the Roadmap. That’s what you need to know to run Docker (or use auto scaling groups instead).

GCE vs AWS in 2016: Why you shouldn’t use Amazon


Foreword

This story relates my experience at a typical web startup. We are running hundreds of instances on AWS, and we’ve been doing so for some time, growing at a sustained pace.

Our full operation is in the cloud: webservers, databases, micro-services, git, wiki, BI tools, monitoring… That includes everything a typical tech company needs to operate.

We have a few switches and a router left in the office to provide internet access and that’s all, no servers on-site.

The following highlights many issues encountered day to day on AWS so that [hopefully] you don’t do the same mistakes we’ve done by picking AWS.

What does the cloud provide?

There are a lot of clouds: GCE, AWS, Azure, Digital Ocean, RackSpace, SoftLayer, OVH, GoDaddy… Check out our article Choosing a Cloud Provider: AWS vs GCE vs SoftLayer vs DigitalOcean vs …

We’ll focus only on GCE and AWS in this article. They are the two majors, fully featured, shared infrastructure, IaaS offerings.

They both provide everything needed in a typical datacenter.

Infrastructure and Hardware:

  • Get servers with various hardware specifications
  • In multiple datacenters across the planet
  • Remote and local storage
  • Networking (VPC, subnets, firewalls)
  • Start, stop, delete anything in a few clicks
  • Pay as you go

Additional Managed Services (optional):

  • SQL Database (RDS, Cloud SQL)
  • NoSQL Database (DynamoDB, Big Table)
  • CDN (CloudFront, Google CDN)
  • Load balancer (ELB, Google Load Balancer)
  • Long term storage (S3, Google Storage)

Things you must know about Amazon

GCE vs AWS pricing: Good vs Evil

Real costs on the AWS side:

  • Base instance plus storage cost
  • Add provisioned IOPS for databases (normal EBS IO are not reliable enough)
  • Add local SSD (675$ per 800 GB + 4 CPU + 30 GB. ALWAYS ALL together)
  • Add 10% on top of everything for Premium Support (mandatory)
  • Add 10% for dedicated instances or dedicated hosts (if subject to regulations)

Real costs on the GCE side:

  • Base instance plus storage cost
  • Enjoy fast and dependable IOPS out-of-the-box on remote SSD volumes
  • Add local SSD (82$ per 375 GB, attachable to any existing instance)
  • Enjoy automatic discount for sustained usage (~30% for instances running 24/7)

AWS IO are expensive and inconsistent

EBS SSD volumes: IOPS, and P-IOPS

We are forced to pay for Provisioned-IOPS whenever we need dependable IO.

The P-IOPS are NOT really faster. They are slightly faster but most importantly they have a lower variance (i.e. 90%-99.9% latency). This is critical for some workload (e.g. databases) because normal IOPS are too inconsistent.

Overall, P-IOPS can get very expensive and they are pathetic compared to what any drive can do nowadays (720$/month for 10k P-IOPS, in addition to $0.14 per GB).

Local SSD storage

Local SSD storage is only available via the i2 instances family which are the most expensive instances on AWS (and over all clouds).

There is no granularity possible. CPU, memory and SSD storage amount all DOUBLE between the few i2.xxx instance types available. They grow in powers of 4CPU + 30GB memory + 800 GB SSD and the multiplier is $765/month.

These limitations make local SSD storage expensive to use and special to manage.

AWS Premium Support is mandatory

The premium support is +10% on top of the total AWS bill (i.e. EC2 instances + EBS volumes + S3 storage + traffic fees + everything).

Handling spikes in traffic

ELB cannot handle sudden spikes in traffic. They need to be scaled manually by support beforehand.

An unplanned event is a guaranteed 5 minutes of unreachable site with 503 errors.

Handling limits

All resources are artificially limited by a hardcoded quota, which is very low by default. Limits can only be increased manually, one by one, by sending a ticket to the support.

I cannot fully express the frustration when trying to spawn two c4.large instances (we already got 15) only to fail because “limit exhaustion: 15 c4.large in eu-central region“. Message support and wait for one day of back and forth email. Then try again and fail again because “limit exhaustion: 5TB of EBS GP2 in eu-central region“.

This circus goes on every few weeks, sometimes hitting 3 limits in a row. There are limits for all resources, by region, by availability zone, by resource types and by resource specifics criteria.

Paying guarantees a 24h SLA to get a reply to a limit ticket. The free tiers might have to wait for a week (maybe more), being unable to work in the meantime. It is an absurd yet very real reason pushing for premium support.

Handling failures on the AWS side

There is NO log and NO indication of what’s going on in the infrastructure. The support is required whenever something wrong happens.

For example. An ELB started dropping requests erratically. After contacting support, they acknowledged to have no idea what’s going on and took action “Thank you for your request. One of the ELB was acting weird, we stopped it and replaced it with a new one“.

The issue was fixed. Sadly, they don’t provide any insight or meaningful information. This is a strong pain point for debugging and planning future failures.

Note: We are barraging further managed service from being introduced in our stack. At first they were tried because they were easy to setup (read: limited human time and a bit of curiosity). They soon proved to be causing periodic issues while being impossible to debug and troubleshoot.

ELB are unsuitable to many workloads

[updated paragraph after comments on HN]

ELB are only accessible with a hostname. The underlying IPs have a TTL of 60s and can change at any minute.

This makes ELB unsuitable for all services requiring a fixed IP and all services resolving the IP only once at startup.

ELB are impossible to debug when they fail (they do fail), they can’t handle sudden spike and the CloudWatch graphs are terrible. (Truth be told. We are paying Datadog $18/month per node to entirely replace CloudWatch).

Load balancing is a core aspect of high-availability and scalable design. Redundant load balancing is the next one. ELB are not up to the task.

The alternative to ELB is to deploy our own HAProxy in pairs with VRRP/keepalived. It takes multiple weeks to setup properly and deploy in production.

By comparison, we can achieve that with google load balancers in a few hours. A Google load balancer can have a single fixed IP. That IP can go from 1k/s to 10k/s requests instantly without loosing traffic. It just works.

Note: Today, we’ve seen one service in production go from 500 requests/s to 15000 requests/s in less than 3 seconds. We don’t trust an ELB to be in the middle of that

Dedicated Instances

Dedicated instances are Amazon EC2 instances that run in a virtual private cloud (VPC) on hardware that’s dedicated to a single customer. Your Dedicated instances are physically isolated at the host hardware level from your instances that aren’t Dedicated instances and from instances that belong to other AWS accounts.

Dedicated instances/hosts may be mandatory for some services because of legal compliance, regulatory requirements and not-having-neighbours.

We have to comply to a few regulations so we have a few dedicated options here and there. It’s 10% on top of the instance price (plus a $1500 fixed monthly fee per region).

Note: Amazon doesn’t explain in great details what dedicated entails and doesn’t commit to anything clear. Strangely, no regulators pointed that out so far.

Answer to HN comments: Google doesn’t provide “GCE dedicated instances”. There is no need for it. The trick is that regulators and engineers don’t complain about not having something which is non-existent, they just live without it and our operations get simpler.

Reserved Instances are bullshit

A reservation is attached to a specific region, an availability zone, an instance type, a tenancy, and more. In theory the reservation can be edited, in practice that depends on what to change. Some combinations of parameters are editable, most are not.

Plan carefully and get it right on the first try, there is no room for errors. Every hour of a reservation will be paid along the year, no matter whether the instance is running or not.

For the most common instance types, it takes 8-10 months to break even on a yearly reservation. Think of it as gambling game in a casino. A right reservation is -20% and a wrong reservation is +80% on the bill. You have to be right MORE than 4/5 times to save any money.

Keep in mind that the reserved instances will NOT benefit from the regular price drop happening every 6-12 months. If there is a price drop early on, you’re automatically loosing money.

Critical Safety Notice: 3 years reservation is the most dramatic way to loose money on AWS. We’re talking potential 5 digits loss here, per click. Do not go this route. Do not let your co-workers go this route without a warning. 

What GCE does by comparison is a PURELY AWESOME MONTHLY AUTOMATIC DISCOUNT. Instances hours are counted at the end of every month and discount is applied automatically (e.g. 30% for instances running 24/7). The algorithm also accounts for multiple started/stopped/renewed instances, in a way that is STRONGLY in your favour.

Reserving capacity does not belong to the age of Cloud, it belongs to the age of data centers.

AWS Networking is sub-par

Network bandwidth allowance is correlated with the instance size.

The 1-2 cores instances peak around 100-200 Mbps. This is very little in a world more and more connected where so many things rely on the network.

Typical things experiencing slow down because of the rate limited networking:

  • Instance provisioning, OS install and upgrade
  • Docker/Vagrant image deployment
  • sync/sftp/ftp file copying
  • Backups and snapshots
  • Load balancers and gateways
  • General disk read/writes (EBS is network storage)

Our most important backup takes 97 seconds to be copied from the production host to another site location. Half time is saturating the network bandwidth (130 Mbps bandwidth cap), half time is saturating the EBS volume on the receiving host (file is buffered in memory during initial transfer then 100% iowait, EBS bandwidth cap).

The same backup operation would only take 10-20 seconds on GCE with the same hardware.

Cost Comparison

This post wouldn’t be complete without an instance to instance price comparison.

In fact, it is so important that it was split to dedicated article: Google Cloud is 50% cheaper than AWS.

Hidden fees everywhere + unreliable capabilities = human time wasted in workarounds

Capacity planning and day to day operations

Capacity planning is unnecessary hard with the not-scalable resources, unreliable performances capabilities, insufficient granularity, and hidden constraints everywhere. Planning cost is a nightmare.

Every time we have to add an instance. We have to read the instances page, pricing page, EBS page again. There are way too many choices, some of which being hard to change latter. That could be printed on papers and cover a4x7 feet table. By comparison it takes only 1 page both-sided to pick an appropriate instance from Google.

Optimizing usage is doomed to fail

The time taken to optimizing reserved instance is a similar cost to the savings done.

Between CPU count, memory size, EBS volume size, IOPS, P-IOPS. Everything is over-provisioned on AWS. Partly because there are too many things to follow and optimize for a human being, partly as workaround against the inconsistent capabilities, partly because it is hard to fix later for some instances live in production.

All these issues are directly related to the underlying AWS platform itself, being not neat and unable to scale horizontal cleanly, neither in hardware options, nor in hardware capabilities nor money-wise.

Every time we think about changing something to reduce costs, it is usually more expensive than NOT doing anything (when accounting for engineering time).

Conclusion

AWS has a lot of hidden costs and limitations. System capabilities are unsatisfying and cannot scale consistently. Choosing AWS was a mistake. GCE is always a better choice.

GCE is systematically 20% to 50% cheaper for the equivalent infrastructure, without having to do any thinking or optimization. Last but not least it is also faster, more reliable and easier to use day-to-day.

The future of our company

Unfortunately, our infrastructure on AWS is working and migrating is a serious undertaking.

I learned recently that we are a profitable company, more so than I thought. Looking at the top 10 companies by revenue per employee, we’d be in the top 10. We are stuck with AWS in the near future and the issues will have to be worked around with lots of money. The company is able to cover the expenses and cost optimisation ain’t a top priority at the moment.

There’s a saying “throwing money at a problem“. We shall say “throwing houses at the problem” from now on as it better represents the status quo.

If we get to keep growing at the current pace, we’ll have to scale vertically, and by that we mean “throwing buildings at Amazon” 😀

burning money
The official AWS answer to all their issues: “Get bigger instances”