The previous article Docker in Production: A History of Failure was quite a hit.
After long discussions, hundreds of feedbacks, thousands of comments, meetings with various individuals and major players, more experimentation and more failures, it’s time for an update on the situation.
We’ll go over the lessons learned from all the recent interactions and articles, but first, a reminder and a bit of context.
Disclaimer: Intended Audience
The large amount of comments made it clear that the world is divided in 10 kind of people:
1) The Amateur
Running mostly test and side projects with no real users. May think that using Ubuntu beta is the norm and call anything “stable” obsolete.
2) The Professional
Running critical systems for a real business with real users, definitely accountable, probably get a phone call when shit hits the fan.
What Audience Are You?
There is a fine line between these worlds and they clash pretty hard when they ever meet. Obviously, they have very different standards and expectations.
One of the reason I love finance is because that it has a great culture of risk. It doesn’t mean to be risk-averse contrary to a popular belief. It means to evaluate potential risks and potential gains and weight them against each other.
You should take a minute to think about your standards. What do you expect to achieve with Docker? What do you have to lose if it crashes all systems it’s running on and corrupt the mounted volumes? These are important factor to drive your decisions.
What pushed me to publish the last article was a conversation with a guy from a random finance company, just asking my thoughts about Docker, because he was considering to consider it. Among other things, this company -and this guy in particular- manages systems that handle trillions of dollars, including the pensions of millions of Americans.
Docker is nowhere ready to handle my mother’s pension, how could anyone ever think that??? Well, it seemed the Docker experience wasn’t documented enough.
What Do You Need to Run Docker?
As you should be aware by know, Docker is highly sensitive to the kernel, the host and the filesystem it’s using. Pick the wrong combination and you’re talking kernel panic, filesystem corruption, Docker daemon lock down, etc…
I had time to collect feedback on various operating conditions and test a couple more myself.
We’ll go over the results of the research, what has been registered to work, not work, experience intermittent failures, or blow up entirely in epic proportions.
Spoiler Alert: There is nothing with or around Docker that’s guaranteed to work.
Disclaimer: Understand the Risks and the Consequences
I am biased toward my own standards (as a professional who has to handle real money) and following the feedback I got (with a bias toward reliable sources known for operating real world systems).
For instance, if a combination of operating system and filesystem is marked as “no-go: registered catastrophic filesystem failure with full volume data loss“. It is not production ready (for me) but it is good enough for a student who has to do a one-off exercise in a vagrant virtual machine.
You may or may not experience the issues mentioned. Either way, they are mentioned because they are certified to be present in the wild as confirmed by the people who hit them. If you try an environment that is similar enough, you are on the right path to become the next witness.
The worst that can -and usually- happen with Docker is that it seems okay during the proof of concepts and you’ll only begin to notice and understand issues far down the line, when you cannot easily move away from it.
CoreOS is an operating that can only run containers and is exclusively intended to run containers.
Last article, the conclusion was that it might be the only operating system that may be able to run Docker. This may or may not be accurate.
We abandoned the idea of running CoreOS.
First, the main benefit of Docker is to unify dev and production. Having a separate OS in production only for containers totally ruins this point.
Second, Debian (we were on Debian) announced the next major release for Q1 2017. It takes a lot of effort to understand and migrate everything to CoreOS, with no guarantee of success. It’s wiser to just wait for the next Debian.
Docker on CentOS/RHEL 6 is no-go: known filesystem failures, full volume data loss
- Various known issues with the devicemapper driver.
- Critical issues with LVM volumes in combination with devicemapper causing data corruption, container crash, and docker daemon freeze requiring hard reboot to fix.
- The Docker packages are not maintained on this distribution. There are numerous critical bug fixes that were released in the CentOS/RHEL 7 packages but were not back ported to the CentOS/RHEL 6 packages.
Originally running the kernel 3, RedHat has been back porting the kernel 4 features into it, which is mandatory for running Docker.
It caused problems at time because Docker failed to detect the custom kernel version and the available features on it, thus it cannot set proper system settings and fails in various mysterious ways. Every time this happens, this can only be resolved by Docker publishing a fix on feature detection for specific kernels, which is neither a timely nor systematic process..
There are various issues with the usage of LVM volumes, depends on the version.
Otherwise, it’s a mixed bag. Your mileage may vary.
As of CentOS 7.0, RedHat recommended some settings but I can’t find the page on their website anymore. Anyway, there are a tons of critical bugfixes in later version so you MUST update to the latest version.
As of CentOS 7.2, RedHat recommends and supports exclusively XFS and they give special flags for the configuration. AUFS doesn’t exist, OverlayFS is officially considered unstable, BTRFS is beta (technology preview).
The RedHat employees are admitting themselves that they struggle pretty hard to get docker working in proper conditions, which is a major problem because they gotta resell it as part of their OpenShift offering. Try making a product on an unstable core.
If you like playing with fire, it looks like that’s the OS of choice.
Note that for once, it is a case where you surely wants to have RHEL and not CentOS, meaning timely updates and helpful support at your disposal.
Debian 8 jessie (stable)
A major cause of the issues we experienced was because our production OS was Debian stable, as explained in the previous article.
Basically, Debian froze the kernel to a version that doesn’t support anything Docker needs and the few components that are present are rigged with bugs.
Docker on Debian is major no-go: There is a wide range of bugs in the AUFS driver (but not only), usually crashing the host, potentially corrupting the data, and that’s just the tip of the iceberg.
Docker is 100% guaranteed suicide on Debian 8 and it’s been since the inception of Docker a few years ago. It’s killing me no one ever documented this earlier.
I wanted to show you a graph of AWS instances going down like dominoes but I didn’t have a good monitoring and drawing tool to do that, so instead I’ll illustrate with a piano chart that looks the same.
Typical Docker cascading failure on our test systems. A test slave crashes… the next one retries two minutes later… and dies too. This specific cascade took 6 tries to go past the bug, slightly more than usual, but nothing fancy.
You should have CloudWatch alarms to restart dead hosts automatically and send a crash notifications.
Fancy: You can also have a CloudWatch alarm to automatically send a customized issue report to your regulator whenever there is an issue persisting more than 5 minutes.
Not to brag but we got quite good at containing Docker. Forget about Chaos Monkey, that’s child play, try running trading systems handling billions of dollars on Docker .
 Please don’t do that. That’s a terrible idea.
Debian 9 stretch
Debian stretch is planned to become the stable edition in 2017. (Note: might be released as I write and edit this article).
It will feature the kernel 4.10 which is the latest LTS, published simultaneously.
At the time of release, Debian Stretch will be the most up to date stable operating system and it will allegedly have all the shiny things necessary to run Docker (until the Docker requirements change again).
It may resolve a lot of the issues and it may make a tons of new ones. We’ll see how it goes.
Ubuntu has always been more up to date than the regular server distributions. Sadly, I am not aware of any serious companies than run on Ubuntu. This has been a source of much misunderstanding in the docker community because dev and amateur bloggers try things on the latest Ubuntu (not even the LTS ) yet it’s utterly non representative of production systems in the real world (RHEL, CentOS, Debian or one of the exotic Unix/BSD/Solaris). I cannot comment on the LTS 16 as I do not use it. It’s the only distribution to have Overlay2 and ZFS available, that gives some more options to be tried and maybe find something working? The LTS 14 is a definitive no-go: Too old, don’t have the required components.  I received quite a few comments and unfriendly emails of people saying to “just” use the latest Ubuntu beta. As if migrating all live systems, changing distribution and running on a beta platform that didn’t even exist at the time was an actual solution.
Update: I said I’m never coming back to Docker and certainly not to spend an hour on digging up references but I guess I have to now that they are handed to me in spectacular ways.
I received a quite insulting email from a guy who is clearly in the amateur league to say that “any idiot can run Docker on Ubuntu” then proceed to give a list of software packages and advanced system tweaks that are mandatory to run Docker on Ubuntu, that allegedly “anyone could have found in 5 seconds with Google“.
At the heart of his mail is this bug report, which is indeed the first Google result for “Ubuntu docker not working” and “Ubuntu docker crash“: Ubuntu 16.04 install for 1.11.2 hangs.
This bug report, published on June 2016 highlights that the Ubuntu installer simply doesn’t work at all because it doesn’t install some dependencies which are required by Docker to run, then it’s a see of comments, user workarounds and not-giving-a-fuck #WONTFIX by Docker developers.
The last answer is given by an employee 5 months later to say that the Ubuntu installer will never be fixed, however the next major version of Docker may use something completely different that won’t be affected by this issue.
A new major version (v1.13) just got released (8 months since the report), it is not confirmed whether it is affected by the bug or not (but it is confirmed to come with breaking changes).
It’s fairly typical of what to expect from Docker. Checklist:
- Is everything broken to the point Docker can’t run at all? YES.
- Is it broken for all users, of say a major distribution? YES.
- Is there a timely reply to acknowledge the issue? NO.
- Is it confirmed that the issue is present and how severe it is? NO.
- Is there any fix planned? NO.
- Is there a ton of workarounds of various danger and complexity? YES.
- Will it ever be fixed? Who knows.
- Will the fix, if it ever comes, be backported? NEVER.
- Is the ultimate answer to everything to just update to latest? Of course.
AWS Container Service
AWS has an AMI dedicated to running Docker. It is based on an Ubuntu.
As confirmed by internal sources, they experienced massive troubles to get Docker working in any decent condition
Ultimately, they released am AMI for it, running a custom OS with a custom docker package with custom bug fixes and custom backports. They went and are still going through extensive efforts and testing to keep things together.
If you are locked-in on Docker and running on AWS, your only salvation might be to let AWS handles it for you.
Google Container Service
Google offers containers as a service. Google merely exposes a Docker interface, the containers are run on internal google containerization technologies, that cannot possibly suffer from all the Docker implementation flaws.
Don’t get me wrong. Containers are great as a concept, the problem is not the theoretical aspect, it’s the practical implementation and tooling we have (i.e. Docker) which are experimental at best.
If you really want to play with Docker (or containers) and you are not operating on AWS, that leaves Google as the single strongest choice, better yet, it comes with Kubernetes for orchestration, making it a league of its own.
That should still be considered experimental and playing with fire. It just happens that it’s the only thing that may deliver the promises and also the only thing that comes with containers AND orchestration.
It’s not possible to build a stable product on a broken core, yet RedHat is trying.
From the feedback I had, they are both struggling pretty hard to mitigate the Docker issues, with variable success. Your mileage may vary.
Considering that they both appeal to large companies, who have quite a lot to lose, I’d really question the choice of going for that route (i.e. anything build on top of Docker).
You should try the regular clouds instead: AWS or Google or Azure. Using virtual machines and some of the hosted services will achieve 90% of what Docker does, 90% of what Docker doesn’t do, and it’s dependable. It’s also a better long-term strategy.
Chances are that you want to do OpenShift because you can’t do public cloud. Well, that’s a tough spot to be in. (Good luck with that. Please write a blog in reply to talk about your experience).
- CentOS/RHEL: Russian roulette
- Debian: Jumping off a plane naked
Not sureUpdate: LOL.
- CoreOS: Not worth the effort
- AWS Containers: Your only salvation if you are locked-in with Docker and on AWS
- Google Containers: The only practical way to run Docker that is not entirely insane.
- OpenShift: Not sure. Depends how good the support and engineers can manage?
A Business Perspective
Docker has no business model and no way to monetize. It’s fair to say that they are releasing to all platforms (Mac/Windows) and integrating all kind of features (Swarm) as a desperate move to 1) not let any competitor have any distinctive feature 2) get everyone to use docker and docker tools 3) lock customers completely in their ecosystem 4) publish a ton of news, articles and releases in the process, increasing hype 5) justify their valuation.
It is extremely tough to execute an expansion both horizontally and vertically to multiple products and markets. (Ignoring whether that is an appropriate or sustainable business decision, which is a different aspect).
In the meantime, the competitors, namely Amazon, Microsoft, Google, Pivotal and RedHat all compete in various ways and make more money on containers than Docker does, while CoreOS is working an OS (CoreOS) and competing containerization technology (Rocket).
That’s a lot of big names with a lot of firepower directed to compete intensively and decisively against Docker. They have zero interest whatsoever to let Docker locks anyone. If anything, they individually and collectively have an interest in killing Docker and replacing it with something else.
Let’s call that the war of containers. We’ll see how it plays out.
Currently, Google is leading the way, they are replacing Docker and they are the only one to provide out of the box orchestration (Kubernetes).
Did I say that Docker is an unstable toy project?
Invariably some people will say that the issues are not real or in the past. They are not in the past, the challenges and the issues are very current and very real. There is definite proof and documentation that Docker has suffered from critical bugs making it plain unusable on ALL major distributions, bugs that ran rampant for years, some still present as of today.
If you look for any combination of “docker + version + filesystem + OS” on Google, you’ll find a trail of issues with various impact going back all the way to docker birth. It’s a mystery how something could fail that bad for that long and no one writes about it. (Actually, there are a few articles, they were just lost under the mass of advertisement and quick evaluations). The last software to achieve that level of expectation with that level of failure was MongoDB.
I didn’t manage to find anyone on the planet using Docker seriously AND successfully AND without major hassle. The experiences mentioned in this article were acquired by blood, the blood of employees and companies who learned Docker the hard way while every second of downtime was a $1000 loss.
Hopefully, you can learn from our past, as to not repeat it.
If you were wondering whether you should have adopted docker years ago => The answer is hell no, you dodged a bullet. You can tell that to your boss. (It’s still not that much useful today if you don’t proper have orchestration around it, which is itself an experimental subject).
If you are wondering whether you should adopt it now… while what you run is satisfactory and you have any considerations for quality => The reasonable answer is to wait until RHEL 8 and Debian 10. No rush. Things need to mature and the packages ain’t gonna move faster than the distributions you’ll run them on.
If you like to play with fire => Full-on Google Container Engine on Google Cloud. Definitive high risk, probable high reward.
Would this article have more credibility if I linked numerous bug reports, screenshots of kernel panics, personal charts of system failures over the day, relevant forum posts and disclosed private conversations? Probably.
Do I want to spend yet-another hundred hours to dig that off, once again? Nope. I’d rather spend my evening on Tinder than Docker. Bye bye Docker.
Back to me. My action plan to lead the way on Containers and Clouds had a major flaw I missed out, the average tenure in tech companies is still not counted in yearS, thus the year 2017 began by being poached.
Bad news: No more cloud and no more Docker where I am going. Meaning no more groundbreaking news. you are on your own to figure it out.
Good news: No more toying around with billions dollars of other people’s money… since I am moving up by at least 3 orders of magnitude! I am moderately confident that my new immediate playground may include the pensions of a few millions of Americans, including a lot of people who read this blog.