Docker in Production: A History of Failure

1 November 201627 August 2017 thehftguy139 Comments

Introduction

My first encounter with docker goes back to early 2015. Docker was experimented with to find out whether it could benefit us. At the time it wasn’t possible to run a container [in the background] and there wasn’t any command to see what was running, debug or ssh into the container. The experiment was quick, Docker was useless and closer to an alpha prototype than a release.

Fast forward to 2016. New job, new company and docker hype is growing like mad. Developers here have pushed docker into production projects, we’re stuck with it. On the bright side, the run command finally works, we can start, stop and see containers. It is functional.

We have 12 dockerized applications running in production as we write this article, spread over 31 hosts on AWS (1 docker app per host [note: keep reading to know why]).

The following article narrates our journey with Docker, an adventure full of dangers and unexpected turns.

so it begins, the greatest fuck up of our time

Production Issues with Docker

Docker Issue: Breaking changes and regressions

We ran all these versions (or tried to):

1.6 => 1.7 => 1.8 => 1.9 => 1.10 => 1.11 => 1.12

Each new version came with breaking changes. We started on docker 1.6 early this year to run a single application.

We updated 3 months later because we needed a fix only available in later versions. The 1.6 branch was already abandoned.

The versions 1.7 and 1.8 couldn’t run. We moved to the 1.9 only to find a critical bug on it two weeks later, so we upgraded (again!) to the 1.10.

There are all kind of subtle regressions between Docker versions. It’s constantly breaking unpredictable stuff in unexpected ways.

The most tricky regressions we had to debug were network related. Docker is entirely abstracting the host networking. It’s a big mess of port redirection, DNS tricks and virtual networks.

Bonus: Docker was removed from the official Debian repository last year, then the package got renamed from docker.io to docker-engine. Documentation and resources predating this change are obsolete.

Docker Issue: Can’t clean old images

The most requested and most lacking feature in Docker is a command to clean older images (older than X days or not used for X days, whatever). Space is a critical issue given that images are renewed frequently and they may take more than 1GB each.

The only way to clean space is to run this hack, preferably in cron every day:

docker images -q -a | xargs --no-run-if-empty docker rmi

It enumerates all images and remove them. The ones currently in use by running containers cannot be removed (it gives an error). It is dirty but it gets the job done.

The docker journey begins with a clean up script. It is an initiation rite every organization has to go through.

Many attempts can be found on the internet, none of which works well. There is no API to list images with dates, sometimes there are but they are deprecated within 6 months. One common strategy is to read date attribute from image files and call ‘docker rmi‘ but it fails when the naming changes. Another strategy is to read date attributes and delete files directly but it causes corruption if not done perfectly, and it cannot be done perfectly except by Docker itself.

Docker Issue: Kernel support (or lack thereof)

There are endless issues related to the interactions between the kernel, the distribution, docker and the filesystem

We are using Debian stable with backports, in production. We started running on Debian Jessie 3.16.7-ckt20-1 (released November 2015). This one suffers from a major critical bug that crashes hosts erratically (every few hours in average).

Linux 3.x: Unstable storage drivers

Docker has various storage drivers. The only one (allegedly) wildly supported is AUFS.

The AUFS driver is unstable. It suffers from critical bugs provoking kernel panics and corrupting data.

It’s broken on [at least] all “linux-3.16.x” kernel. There is no cure.

We follow Debian and kernel updates very closely. Debian published special patches outside the regular cycle. There was one major bugfix to AUFS around March 2016. We thought it was THE TRUE ONE FIX but it turned out that it wasn’t. The kernel panics happened less frequently afterwards (every week, instead of every day) but they were still loud and present.

Once during this summer there was a regression among a major update, that brought back a previous critical issue. It started killing CI servers one by one, with 2 hours in average between murders. An emergency patch was quickly released to fix the regression.

There were multiple fixes to AUFS published along the year 2016. Some critical issues were fixed but there are many more still left. AUFS is unstable on [at least] all “linux-3.16.x” kernels.

Debian stable is stuck on kernel 3.16. It’s unstable. There is nothing to do about it except switching to Debian testing (which can use the kernel 4).

Ubuntu LTS is running kernel 3.19. There is no guarantee that this latest update fixes the issue. Changing our main OS would be a major disruption but we were so desperate that we considered it for a while.

RHEL/CentOS-6 is on kernel 2.x and RHEL/CentoS-7 is on kernel 3.10 (with many later backports done by RedHat).

Linux 4.x: The kernel officially dropped docker support

It is well-known that AUFS has endless issues and it’s regarded as dead weight by the developers. As a long-standing goal, the AUFS filesystem was finally dropped in kernel version 4.

There is no unofficial patch to support it, there is no optional module, there is no backport whatsoever, nothing. AUFS is entirely gone.

[dramatic pause]

How does docker work without AUFS then? Well, it doesn’t.

[dramatic pause]

So, the docker guys wrote a new filesystem, called overlay.

“OverlayFS is a modern union filesystem that is similar to AUFS. In comparison to AUFS, OverlayFS has a simpler design, has been in the mainline Linux kernel since version 3.18 and is potentially faster.” — Docker OverlayFS driver

Note that it’s not backported to existing distributions. Docker never cared about [backward] compatibility.

Update after comments: Overlay is the name of both the kernel module to support it (developed by linux maintainers) and the docker storage driver to use it (part of docker, developed by docker). They are two different components [with a possible overlap of history and developers]. The issues seem mostly related to the docker storage driver, not the filesystem itself.

The debacle of Overlay

A filesystem driver is a complex piece of software and it requires a very high level of reliability. The long time readers will remember the Linux migration from ext3 to ext4. It took time to write, more time to debug and an eternity to be shipped as the default filesystem in popular distributions.

Making a new filesystem in 1 year is an impossible mission. It’s actually laughable when considering that the task is assigned to Docker, they have a track record of unstability and disastrous breaking changes, exactly what we don’t want in a filesystem.

Long story short. That did not go well. You can still find horror stories with Google.

Overlay development was abandoned within 1 year of its initial release.

[dramatic pause]

Then comes Overlay2.

“The overlay2 driver addresses overlay limitations, but is only compatible with Linux kernel 4.0 [or later] and docker 1.12” — Overlay vs Overlay2 storage drivers

Making a new filesystem in 1 year is still an impossible mission. Docker just tried and failed. Yet they’re trying again! We’ll see how it turns out in a few years.

Right now it’s not supported on any systems we run. We can’t use it, we can’t even test it.

Lesson learnt: As you can see with Overlay then Overlay2. No backport. No patch. No retro compatibility. Docker only moves forward and breaks things. If you want to adopt Docker, you’ll have to move forward as well, following the releases from docker, the kernel, the distribution, the filesystems and some dependencies.

Bonus: The worldwide docker outage

On 02 June 2016, at approximately 9am (London Time). New repository keys are pushed to the docker public repository.

As a direct consequence, any run of “apt-get update” (or equivalent) on a system configured with the broken repo will fail with an error “Error https://apt.dockerproject.org/ Hash Sum mismatch”

This issue is worldwide. It affects ALL systems on the planet configured with the docker repository. It is confirmed on all Debian and ubuntu versions, independent of OS and docker versions.

All CI pipelines in the world which rely on docker setup/update or a system setup/update are broken. It is impossible to run a system update or upgrade on an existing system. It’s impossible to create a new system and install docker on it.

After a while. We get an update from a docker employee: “To give an update; I raised this issue internally, but the people needed to fix this are in the San Francisco timezone [8 hours difference with London], so they’re not present yet.”

I personally announce that internally to our developers. Today, there is no Docker CI and we can’t create new systems nor update existing systems which have a dependency on docker. All our hope lies on a dude in San Francisco, currently sleeping.

[pause waiting for the fix, that’s when free food and drinks come in handy]

An update is posted from a Docker guy in Florida at around 3pm (London Time). He’s awake, he’s found out the issue and he’s working on the fix.

Keys and packages are republished later.

We try and confirm the fix at around 5pm (London Time).

That was a 7 hours interplanetary outage because of Docker. All that’s left from the outage is a few messages on a GitHub issue. There was no postmortem. It had little (none?) tech news or press coverage, in spite of the catastrophic failure.

Docker Registry

The docker registry is storing and serving docker images.

Automatic CI build  ===> (on success) push the image to ===> docker registry

Deploy command <=== pull the image from <=== docker registry

There is a public registry operated by docker. As an organization, we also run our own internal docker registry. It’s a docker image running inside docker on a docker host (that’s quite meta). The docker registry is the most used docker image.

There are 3 versions of the docker registry. The client can pull indifferently from any:

The Registry v1, now deprecated and abandoned
The Registry v2, a full rewrite in Go, first released in April 2015
The Trusted Registry, a (paid?) service mentioned everywhere in the doc, not sure what it is, just ignore it

Docker Registry Issue: Abandon and Extinguish

The docker registry v2 is as a full rewrite. The registry v1 was retired soon after the v2 release.

We had to install a new thing (again!) just to keep docker working. They changed the configuration, the URLs, the paths, the endpoints.

The transition to the registry v2 was not seamless. We had to fix our setup, our builds and our deploy scripts.

Lesson learnt: Do not trust on any docker tool or API. They are constantly abandoned and extinguished.

One of the goal of the registry v2 is to bring a better API. It’s documented here, a documentation that we don’t remember existed 9 months ago.

Docker Registry Issue: Can’t clean images

It’s impossible to remove images from the docker registry. There is no garbage collection either, the doc mentions one but it’s not real. (The images do have compression and de-duplication but that’s a different matter).

The registry just grows forever. Our registry can grow by 50 GB per week.

We can’t have a server with an unlimited amount of storage. Our registry ran out of space a few times, unleashing hell in our build pipeline, then we moved the image storage to S3.

Lesson learnt: Use S3 to store images (it’s supported out-of-the-box).

We performed a manual clean-up 3 times in total. In all cases we had to stop the registry, erase all the storage and start a new registry container. (Luckily, we can re-build the latest docker images with our CI).

Lesson learnt: Deleting any file or folder manually from the docker registry storage WILL corrupt it.

To this day, it’s not possible to remove an image from the docker registry. There is no API either. (One of the point of the v2 was to have a better API. Mission failed).

Docker Issue: The release cycle

The docker release cycle is the only constant in the Docker ecosystem:

Abandon whatever exists
Make new stuff and release
Ignore existing users and retro compatibility

The release cycle applies but is not limited to: docker versions, features, filesystems, the docker registry, all API…

Judging by the past history of Docker, we can approximate that anything made by Docker has a half-life of about 1 year, meaning that half of what exist now will be abandoned [and extinguished] in 1 year. There will usually be a replacement available, that is not fully compatible with what it’s supposed to replace, and may or may not run on the same ecosystem (if at all).

“We make software not for people to use but because we like to make new stuff.” — Future Docker Epitaph

The current status-quo on Docker in our organization

Growing in web and micro services

Docker first came in through a web application. At the time, it was an easy way for the developers to package and deploy it. They tried it and adopted it quickly. Then it spread to some micro services, as we started to adopt a micro services architecture.

Web applications and micro services are similar. They are stateless applications, they can be started, stopped, killed, restarted without thinking. All the hard stuff is delegated to external systems (databases and backend systems).

The docker adoption started with minor new services. At first, everything worked fine in dev, in testing and in production. The kernel panics slowly began to happen as more web services and web applications were dockerized. The stability issues became more prominent and impactful as we grew.

A few patches and regressions were published over the year. We’ve been playing catchup & workaround with Docker for a while now. It is a pain but it doesn’t seem to discourage people from adopting Docker. Support and demand is still growing inside the organisation.

Note: None of the failures ever affected any customer or funds. We are quite successful at containing Docker.

Banned from the core

We have some critical applications running in Erlang, managed by a few guys in the ‘core’ team.

They tried to run some of their applications in Docker. It didn’t work. For some reasons, Erlang applications and docker didn’t go along.

It was done a long time ago and we don’t remember all the details. Erlang has particular ideas about how the system/networking should behave and the expected load was in thousands of requests per second. Any unstability or incompatibility could justify an outstanding failure. (We know for sure now that the versions used during the trial suffered from multiple major unstability issues).

The trial raised a red flag. Docker is not ready for anything critical. It was the right call. The later crashes and issues managed to confirm it.

We only use Erlang for critical applications. For example, the core guys are responsible for a payment system that handled $96,544,800 in transaction this month. It includes a couple of applications and databases, all of which are under their responsibilities.

Docker is a dangerous liability that could put millions at risk. It is banned from all core systems.

Banned from the DBA

Docker is meant to be stateless. Containers have no permanent disk storage, whatever happens is ephemeral and is gone when the container stops. Containers are not meant to store data. Actually, they are meant by design to NOT store data. Any attempt to go against this philosophy is bound to disaster.

Moreover. Docker is locking away processes and files through its abstraction, they are unreachable as if they didn’t exist. It prevents from doing any sort of recovery if something goes wrong.

Long story short. Docker SHALL NOT run databases in production, by design.

It gets worse than that. Remember the ongoing kernel panics with docker?

A crash would destroy the database and affect all systems connecting to it. It is an erratic bug, triggered more frequently under intensive usage. A database is the ultimate IO intensive load, that’s a guaranteed kernel panic. Plus, there is another bug that can corrupt the docker mount (destroying all data) and possibly the system filesystem as well (if they’re on the same disk).

Nightmare scenario: The host is crashed and the disk gets corrupted, destroying the host system and all data in the process.

Conclusion: Docker MUST NOT run any databases in production, EVER.

Every once in a while, someone will come and ask “why don’t we put these databases into docker?” and we’ll tell some of our numerous war stories, so far, no-one asked twice.

Note: We started going over our Docker history as an integral part of our on boarding process. That’s the new damage control philosophy, kill the very idea of docker before it gets any chance to grow and kill us.

A Personal Opinion

Docker is gaining momentum, there is some crazy fanatic support out there. The docker hype is not only a technological liability any more, it has evolved into a sociological problem as well.

The perimeter is controlled at the moment, limited to some stateless web applications and micro services. It’s unimportant stuff, they can be dockerized and crash once a day, I do not care.

So far, all people who wanted to use docker for important stuff have stopped after a quick discussion. My biggest fear is that one day, a docker fanatic will not listen to reason and keep pushing. I’ll be forced to barrage him and it might not be pretty.

Nightmare scenario: The future accounting cluster revamp, currently holding $23M in customer funds (the M is for million dollars). There is already one guy who genuinely asked the architect “why don’t you put these databases into docker?“, there is no word to describe the face of the architect.

My duty is to customers. Protecting them and their money.

Surviving Docker in Production

gif-what-docker-pretends-to-be — What docker pretends to be.

gif-what-docker-really-is — What docker really is.

Follow releases and change logs

Track versions and change logs closely for kernel, OS, distributions, docker and everything in between. Look for bugs, hope for patches, read everything with attention.

ansible '*' -m shell -a "uname -a"

Let docker crash

Let docker crash. self-explanatory.

Once in a while, we look at which servers are dead and we force reboot them.

Have 3 instances of everything

High availability require to have at least 2 instances per service, to survive one instance failure.

When using docker for anything remotely important, we should have 3 instances of it. Docker die all the time, we need a margin of error to support 2 crashes in a raw to the same service.

Most of the time, it’s CI or test instances that crash. (They run lots of intensive tests, the issues are particularly outstanding). We’ve got a lot of these. Sometimes there are 3 of them crashing in a row in an afternoon.

Don’t put data in Docker

Services which store data cannot be dockerized.

Docker is designed to NOT store data. Don’t go against it, it’s a recipe for disaster.

On top, there are current issues killing the server and potentially destroying the data so that’s really a big no-go.

Don’t run anything important in Docker

Docker WILL crash. Docker WILL destroy everything it touches.

It must be limited to applications which can crash without causing downtime. That means mostly stateless applications, that can just be restarted somewhere else.

Put docker in auto scaling groups

Docker applications should be run in auto-scaling groups. (Note: We’re not fully there yet).

Whenever an instance is crashed, it’s automatically replaced within 5 minutes. No manual action required. Self healing.

Future roadmap

Docker

The impossible challenge with Docker is to come with a working combination of kernel + distribution + docker version + filesystem.

Right now. We don’t know of ANY combination that is stable (Maybe there isn’t any?). We actively look for one, constantly testing new systems and patches.

Goal: Find a stable ecosystem to run docker.

It takes 5 years to make a good and stable software, Docker v1.0 is only 28 months old, it didn’t have time to mature.

The hardware renewal cycle is 3 years, the distribution release cycle is 18-36 months. Docker didn’t exist in the previous cycle so systems couldn’t consider compatibility with it. To make matters worse, it depends on many advanced system internals that are relatively new and didn’t have time to mature either, nor reach the distributions.

That could be a decent software in 5 years. Wait and see.

Goal: Wait for things to get better. Try to not go bankrupt in the meantime.

Use auto scaling groups

Docker is limited to stateless applications. If an application can be packaged as a Docker Image, it can be packaged as an AMI. If an application can run in Docker, it can run in an auto scaling group.

Most people ignore it but Docker is useless on AWS and it is actually a step back.

First, the point of containers is to save resources by running many containers on the same [big] host. (Let’s ignore for a minute the current docker bug that is crashing the host [and all running containers on it], forcing us to run only 1 container per host for reliability).

Thus containers are useless on cloud providers. There is always an instance of the right size. Just create one with appropriate memory/CPU for the application. (The minimum on AWS is t2.nano which is $5 per month for 512MB and 5% of a CPU).

Second, the biggest gain of containers is when there is a complete orchestration system around them to automatically manage creation/stop/start/rolling-update/canary-release/blue-green-deployment. The orchestration systems to achieve that currently do not exist. (That’s where Nomad/Mesos/Kubernetes will eventually come in, there are not good enough in their present state).

AWS has auto scaling groups to manage the orchestration and life cycle of instances. It’s a tool completely unrelated to the Docker ecosystem yet it can achieve a better result with none of the drawbacks and fuck-ups.

Create an auto-scaling group per service and build an AMI per version (tip: use Packer to build AMI). People are already familiar with managing AMI and instances if operations are on AWS, there isn’t much more to learn and there is no trap. The resulting deployment is golden and fully automated. A setup with auto scaling groups is 3 years ahead of the Docker ecosystem.

Goal: Put docker services in auto scaling groups to have failures automatically handled.

CoreOS

Update after comments: Docker and CoreOS are made by separate companies.

To give some slack to Docker for once, it requires and depends on a lot of new advanced system internals. A classic distribution cannot upgrade system internals outside of major releases, even if it wanted to.

It makes sense for docker to have (or be?) a special purpose OS with an appropriate update cycle. It may be the only way to have a working bundle of kernel and operating system able to run Docker.

Goal: Trial the CoreOS ecosystem and assess stability.

In the grand scheme of operations, it’s doable to separate servers for running containers (on CoreOS) from normal servers (on Debian). Containers are not supposed to know (or care) about what operating systems they are running.

The hassle will be to manage the new OS family (setup, provisioning, upgrade, user accounts, logging, monitoring). No clue how we’ll do that or how much work it might be.

Goal: Deploy CoreOS at large.

Kubernetes

One of the [future] major breakthrough is the ability to manage fleets of containers abstracted away from the machines they end up running on, with automatic start/stop/rolling-update and capacity adjustment,

The issue with Docker is that it doesn’t do any of that. It’s just a dumb container system. It has the drawbacks of containers without the benefits.

There are currently no good, battle tested, production ready orchestration system in existence.

Mesos is not meant for Docker
Docker Swarm is not trustworthy
Nomad has only the most basic features
Kubernetes is new and experimental

Kubernetes is the only project that intends to solve the hard problems [around containers]. It is backed by resources that none of the other projects have (i.e. Google have a long experience of running containers at scale, they have Googley amount of resources at their disposal and they know how to write working software).

Right now, Kubernetes is young & experimental and it’s lacking documentation. The barrier to entry is painful and it’s far from perfection. Nonetheless, it is [somewhat] working and already benefiting a handful of people.

In the long-term, Kubernetes is the future. It’s a major breakthrough (or to be accurate, it’s the final brick that is missing for containers to be a major [r]evolution in infrastructure management).

The question is not whether to adopt Kubernetes, the question is when to adopt it?

Goal: Keep an eye on Kubernetes.

Note: Kubernetes needs docker to run. It’s gonna be affected by all docker issues. (For example, do not try Kubernetes on anything else than CoreOS).

Google Cloud: Google Container Engine

As we said before, there is no known stable combination of OS + kernel + distribution + docker version, thus there is no stable ecosystem to run Kubernetes on. That’s a problem.

There is a potential workaround: Google Container Engine. It is a hosted Kubernetes (and Docker) as a service, part of Google Cloud.

Google gotta solve the Docker issues to offer what they are offering, there is no alternative. Incidentally, they might be the only guys who can find a stable ecosystem around Docker, fix the bugs, and sell that ready-to-use as a cloud managed service. We might have a shared goal for once.

They already offer the service so that should mean that they already worked around the Docker issues. Thus the simplest way to have containers working in production (or at-all) may be to use Google Container Engine.

Goal: Move to Google Cloud, starting with our subsidiaries not locked in on AWS. Ignore the rest of the roadmap as it’s made irrelevant.

Google Container Engine: One more reason why Google Cloud is the future and AWS is the past (on top of 33% cheaper instances with 3 times the network speed and IOPS, in average).

Why docker is not yet succeeding in production, July 2015, from the Lead Production Engineer at Shopify.

Docker is not ready for primetime, August 2016.

Docker in Production: A retort, November 2016, a response to this article.

How to deploy an application with Docker… and without Docker, An introduction to application deployment, The HFT Guy.

Disclaimer (please read before you comment)

A bit of context missing from the article. We are a small shop with a few hundreds servers. At core, we’re running a financial system moving around multi-million dollars per day (or billions per year).

It’s fair to say that we have higher expectations than average and we take production issues rather (too?) seriously.

Overall, it’s “normal” that you didn’t experience all of these issues if you’re not using docker at scale in production and/or if you didn’t use it for long.

I’d like to point out that these are issues and workarounds happening over a period of [more than] a year, summarized all together in a 10 minutes read. It does amplify the dramatic and painful aspect.

Anyway, whatever happened in the past is already in the past. The most important section is the Roadmap. That’s what you need to know to run Docker (or use auto scaling groups instead).

139 thoughts on “Docker in Production: A History of Failure”

Josh Williams says:

4 November 2016 at 14:15

The amount of FUD in this article is laughable. Near the beginning of the article, you mention you’re running 1 docker container per server. While there is nothing wrong with this, it is almost certainly a bad choice for your architecture. Based on that, one has to wonder what other mistakes were made in the implementation.

I’ve been running Docker in production since 1.3 (circa 2014?). Other than the orphan image issue you mention, I haven’t seen any of the issues you discuss here. It may be that I’m just extremely lucky, bit based on the multitudes of other success stories, I doubt it.

LikeLiked by 8 people

Reply
- thehftguy says:
  
  4 November 2016 at 18:00
  
  It’s possible that we have a different perspective or different usage.
  
  Would you mind sharing on what OS/filesystem/version you run? How many servers do you have in production? What kind of load are you running? How critical is it?
  
  I can tell you that we have a few hundred servers. At the core, is a financial system actively responsible for multi-million sums every day. It’s fair to say that we have higher expectations than average.
  
  LikeLiked by 3 people
  
  Reply
  - samu (@samux) says:
    
    4 November 2016 at 18:35
    
    i had some docker experience but i was able to run away from it however, one thing i thought i understood: there’s no point on having 1 docker container on 1 vm.
    
    Yet in your article you state you have 12 application per 31 servers which is good with the idea of have at least 3 docker x per app.
    
    May i ask you the reasons behind the 1 docker per host? i think it’s a bit overkill but i’d like to hear your point of view if it’s possible and doesn’t go any NDA
    
    LikeLiked by 1 person
    
    Reply
    - thehftguy says:
      
      4 November 2016 at 19:23
      
      Docker is provoking kernel panics frequently. It kills the host, including all containers running on it. (The only way to bring back the host online is a hard reboot).
      
      We have to limit ourselves to 1 container per host to limit the impact of system crashes.
      
      LikeLike
      
      Reply
  - Jeff says:
    
    4 November 2016 at 19:27
    
    There is an immense amount of hyperbole in this post. I completely believe that you experienced these issues you describe, but the way you portray them is disingenuous. Your assertion that stateless applications are useless, and only stateful applications are worth anything made me laugh out loud.
    
    You don’t ever differentiate between Docker and LXC (the underlying container technology), so when you say Erlang can’t run in a container (which is 100% false by the way), it is hard to judge what problems you had. The team I previously worked with wrote their API gateway in Elixir which compiles to bytecode that runs on the Erlang VM. They ran thousands of requests per minute through it with zero problems.
    
    You mention AUFS and Overlay, but completely leave out DevMapper and btrfs, both of which worked fine for us in production. DevMapper did require some tuning, the defaults are fairly awful, but after that it is pretty simple to use. We ran Ubuntu 12.04 and later 14.04, with both ext4/devmapper and btrfs filesystems, and we never had any issues I would attribute solely to Docker.
    
    I agree that you can (and should) be judicious in your use of a new technology. Docker’s release pace is fast, and they have had a few missteps. It sounds like your team has also had some missteps, in your adoption of Docker. Everyone makes mistakes.
    
    1/10 Troll Post
    
    LikeLiked by 8 people
    
    Reply
    - thehftguy says:
      
      4 November 2016 at 20:12
      
      Well, the issues happened over a period of [more than] a year. It does have a dramatic feel when summarized in a 10 minutes read.
      
      > They ran thousands of requests per minute through it with zero problems.
      
      I wasn’t present during the Erlang test. The feedback I had is that it failed miserably and they gave up quickly. Our applications are expected to take thousands of hits per second. That’s a lot of stress, no offense but it seems we’re not playing in the same league ^^
      
      Please share your setup OS/kernel/filesystem/docker-version and the tweaks you had to do. That will make it easier for future Docker users.
      
      LikeLiked by 1 person
      
      Reply
    - bmullan says:
      
      11 January 2017 at 18:16
      
      Docker hasn’t used LXC for at least a year or two.!
      
      They moved to their own container technology which I believe is called libcontainer.
      
      LikeLike
      
      Reply
  - Jeff says:
    
    4 November 2016 at 21:32
    
    You are right. We might not be in the same league. However, this global hedge fund company Citadel, is in your league, if not even bigger. They use Nomad (which you didn’t think was worth more than a bullet point) as well as Docker.
    
    They leverage Nomad to stand up Docker containers at a scale. Maybe their workload is different, maybe you shouldn’t put your core application in Docker. I don’t know.
    
    What I do know, is that based on the stuff you have written, your Docker experience seems like you either have the worst luck, or you didn’t really bother trying to understand the issues.
    
    The problems you describe aren’t even scaling issues, so I have a hard time with your specious “we aren’t in the same league” rhetoric.
    
    > Please share your setup OS/kernel/filesystem/docker-version and the tweaks you had to do. That will make it easier for future Docker users.
    
    I am happy to have your company sign a consulting agreement to provide you with that information.
    
    LikeLiked by 2 people
    
    Reply
    - thehftguy says:
      
      4 November 2016 at 22:03
      
      If you’re that familiar with Citadel, the guys in charge will happily confirm to you the issues they also had to solve 😉
      
      LikeLiked by 1 person
      
      Reply
- Jessica Gadling (@jgadling) says:
  
  4 November 2016 at 19:38
  
  I’ve been running Docker since 2013. Yes, in production. Yes, at scale. Yes, there are even some war stories.
  
  Much of the information in this article is factually incorrect:
  “There is no unofficial patch to support it [aufs in linux 4], there is no optional module, there is no backport whatsoever, nothing. AUFS is entirely gone.”
  
  Operationally suspect:
  “Lesson learnt: Deleting any file or folder manually from the docker registry storage WILL corrupt it.
  
  Or based on an incomplete understanding of docker’s functionality:
  “Docker is meant to be stateless. Containers have no permanent disk storage, whatever happens is ephemeral and is gone when the container stops.”
  
  … to point out just a few examples.
  
  While there are PLENTY of challenges running docker in production, please look elsewhere for advice. The docker weekly updates are great for keeping up with how the community at large is using Docker:
  https://blog.docker.com/docker-weekly-archives/
  
  LikeLiked by 4 people
  
  Reply
Raymond Rackiewicz (@rayrackiewicz) says:

4 November 2016 at 14:22

OMG, those last two gifs killed me!

LikeLiked by 2 people

Reply
Bernd Zeimetz says:

4 November 2016 at 15:06

CoreOS is made by Docker for Docker.
… If you tell that a CoreOS guy he’ll jump into your face and kill you slowly. They have nothing to do with docker (the company), and they actually have the only useful replacement for docker, called rkt.

LikeLiked by 4 people

Reply
- thehftguy says:
  
  4 November 2016 at 17:31
  
  Good to know. Updated.
  
  LikeLike
  
  Reply
MeIr says:

4 November 2016 at 15:08

Thank you for sharing!

I’m currently on Docker path (not in production yet) for core product + core DB and other services. But not I’m reevaluating my decision. Thanks once again.

LikeLike

Reply
Jeyanthan says:

4 November 2016 at 17:35

Reblogged this on Jeyanthan's Blog.

LikeLike

Reply
Ben says:

4 November 2016 at 18:12

The current Ubuntu LTS is not on kernel 3.x. It’s in 4.4. My current uname -r gives “4.4.0-45-generic”.

LikeLiked by 5 people

Reply
Sven Ehlert says:

4 November 2016 at 18:22

Instead of making such a fuss, how about a little research before crying doom? Some points.

Storage drivers
Yes, they are complicated. Yes, there are problems. Yes, you got a lot of it wrong, maybe you didn’t see the official documentation https://docs.docker.com/engine/userguide/storagedriver/selectadriver/
– docker chooses the best supported storage driver for each distribution during install. It selects AuFS for Ubuntu and DeviceMapper for Amazon Linux (based on RHEL), and so far this went smooth for us. I don’t know what it selects for Debian.
– AuFS was never merged into the Linux kernel. As such, it was never dropped, not even for Linux 4.0. And of course, AuFS is still maintained for Kernel 4.x. -> http://aufs.sourceforge.net . Again, Ubuntu does package AuFS and it works well there.

Docker and DBA
– you acknowledge that Docker is not meant to run stateless servers. Yet you complain that it does not work with databases … why not complaining that you cannot run games in Docker?

Docker and AWS
– you are aware that AWS has a fully managed docker service that manages your autoscaling issues and high availability issues -> https://aws.amazon.com/ecs ? Amazon provides an Amazon Linux image with the right Docker setup. Have you tried that? Works quite well for us.

Docker in Production
You are aware that Google is running ALL of their services in containers / Docker for the last 10 years? Could it be that it is feasible to run Docker in production? And calling Kubernetes young and immature misses the point. Kubernetes is just the third incarnation of Google’s orchestration tools (After Borg and Omega), so there is a lot of experience behind it.

The biggest issue I have with your article is your focus on the OS. Docker (or rkt or any other container system) is supposed to abstract away the Operation System for your services. The OS does not matter there anymore. Case in point – we run our servers all on Ubuntu, because that’s what we know. But for our container setup (in production) we don’t care. Amazon´s Linux image provides everything we need (a working container runtime, that is)

LikeLiked by 3 people

Reply
- Eric says:
  
  4 November 2016 at 19:32
  
  Nobody said containers are not production ready, heck systemd is a container application it just doesn’t run with much isolation. Docker OTOH has not been the platform that people have hoped for.
  
  LikeLike
  
  Reply
  - alexderzhi says:
    
    6 November 2016 at 02:19
    
    Then their hopes are based on ignorance.
    
    Docker is not perfect (And is indeed terrible for something), but it is just a tool, no more, no less. It’s not as good at stateful containers (although as this guys runs it with 1 container per host, the fact that they are having problems show a lack of understand of Docker 101. I mean, just export the DB directory!?)
    
    They also seem to be addicted to Debian despite understanding that it’s not working for them. Switching to Ubuntu would have made 99% of their existing scripts and automation work and remove Docker issues too. Or, run an OS designed to run Docker like CoreOS or Atomic or others.
    
    LikeLike
    
    Reply
    - thehftguy says:
      
      6 November 2016 at 10:50
      
      It’s explained that we have to run 1 container per host because docker is unstable and crashes our hosts periodically.
      
      Given your comment and your erroneous thinking that an operation could change its OS overnight. I’m left to assume that you didn’t read the article and your usage is so irrelevant or your user base so small that you didn’t have to think hard about operational difficulties that every DBA/SRE gets into eventually.
      
      LikeLike
      
      Reply
- zetaops says:
  
  6 November 2016 at 11:49
  
  Just a small correction, Google started with LXC containers.
  
  LikeLike
  
  Reply
- Jose Fernando (@fisholito) says:
  
  17 November 2016 at 18:47
  
  Hi Sven Ehlert, about OS:
  “But for our container setup (in production) we don’t care”
  For some class of services (think about a cloud provider or organization with private clouds), the OS is important. Eg. Standardization of Scripts. Otherwise, you need to maintain different versions for each distribution at data center. Thus, the problem about OS make sense.
  
  LikeLike
  
  Reply
Eric says:

4 November 2016 at 19:26

It’s not sexy as Docker, but LXD is a dirt simple container platform ( system containers ) which can be managed like most other hosts and actually works well. So take your existing system management platform and add lightweight micro servers. With a 4g i3 laptop I can run a dozen images and when I’m done testing I can push them to a server. It works shockingly well.

LikeLiked by 1 person

Reply
- bmullan says:
  
  11 January 2017 at 18:31
  
  LXD does work extremely well, lets you run distro of choice for any particular application best-fit requirement, supports vxlan, ipv6, ipv4, manages local/remote LXD containers easily, supports live migration/CRIU, can use btrfs or zfs, supports instant snapshot/restore and many people are innovating with concepts such as w overlayfs (https://github.com/Jayfrown/container-overlay) … and you can even run Docker in LXD system containers. You can also now install Kubernetes in LXD (https://www.google.com/amp/blog.astokes.org/conjure-up-canonical-kubernetes-under-lxd-today/amp/), lots more examples on the LXD subreddit:
  https://www.reddit.com/r/LXD
  
  LikeLike
  
  Reply
Rha says:

4 November 2016 at 20:56

Mesos perfectly works with docker, https://mesosphere.github.io/marathon/docs/native-docker.html

LikeLiked by 3 people

Reply
jupp0r says:

4 November 2016 at 22:20

I think docker as a software has serious quality problems and docker as a business has still to find a business model that does not incentivize making it hard to use.

Nevertheless, most of your problems seem to be problems of the organization you are part of, namely:
* don’t add stuff (docker) to your stack unless it solves a concrete problem for you, otherwise your stack will get more complicated over time and you will accumulate single points of failures
* just putting your app inside docker and then treating it like a pet (as opposed to treating it as cattle) gives you the combined disadvantages of both worlds. If you are having problems cleaning up old container images, don’t write some cron job but think about infrastructure as code
* containers were invented to be more managable by orchestrators like kubernetes, not the other way around. Netflix runs a perfectly fine infrastructure on AWS VMs. It’s not the technology, but the culture that makes the big difference.

LikeLiked by 2 people

Reply
- thehftguy says:
  
  4 November 2016 at 22:48
  
  > 1st paragraph
  
  Docker solved a concrete problem. It makes it quick & easy to package and deploy applications.
  
  The AUFS kernel panics are rare occurrences. They didn’t occur at first. It’s only after a while, with more systems and more load that they became an annoyance, then a critical issue. There were a few patches published along the year (and a few regressions) but it’s still happening.
  
  Same with the release cycle and the abandonment. All of this is the kind of discovery that can only happen given enough time and a large enough environment. Whatever we picked, there’d be backfire in production. (Well, maybe not that much).
  
  > 3rd paragraph
  
  I agree that containers need orchestration. A container without orchestration is like a car without a road.
  
  We’re in this weird place. Docker is present for a [short] while and advertised alone, but the orchestrators are still [relatively] new.
  
  We adopted too early [someone had to beta test, right?). We WILL have the OS updates, the bugfixes and the orchestrators working perfectly soon. Hopefully.
  
  LikeLike
  
  Reply
  - Mark Richards says:
    
    5 November 2016 at 04:47
    
    “We’re in this weird place. Docker is present for a [short] while and advertised alone, but the orchestrators are still [relatively] new.”
    
    _Which_ orchestrators? Because Mesos is practically ancient at this point and was long since updated to add container support. You have Kubernetes, Mesos, Swarm, Nomad, and a few others. They have high learning curves, but the rewards are many.
    
    What I don’t understand is that you talk about things like a kernel panic causing the loss of a container (or even host). My response is- so? Who cares? Not that we see it a lot but hosts and containers fail all the time. If you are lovingly tending to your containers- you’re doing it wrong.
    
    Containers and hosts are nothing more than grist for the mill. One dies and it gets replace automatically and immediately. Hell- we use Simian Army (Netflix) to randomly blow stuff up intentionally- just to ensure that failures are correctly handled. I don’t even care why a host or container failed (unless it happens multiple times) because there is simply no other way to manage things like this at scale. If you are worried about a kernel panic- you’re doing something wrong (though the number of failures you are describing suggests you’re doing many things wrong).
    
    I’m not saying this as some kid fresh out of college who sees some shiny new toy and has to play with it- I’m saying this as someone who got his start working with SparcStation 20’s and Linux 0.95.
    
    LikeLiked by 1 person
    
    Reply
vlad says:

4 November 2016 at 22:43

And security issues with Docker were not even raised…

LikeLike

Reply
- John says:
  
  5 November 2016 at 06:42
  
  Could you expand on this? Any resources / blog posts you can suggest?
  
  LikeLike
  
  Reply
  - Anna says:
    
    9 November 2016 at 13:35
    
    Check this out:
    http://reventlov.com/advisories/using-the-docker-command-to-root-the-host
    
    LikeLike
    
    Reply
  - Victor Vess says:
    
    12 November 2016 at 19:08
    
    Docker 1.10+ has user namespacing which maps root inside the container to non-root-uid outside of the container. It is not enabled by default and seems to be a bit of a headache if you are doing a lot of volume mounting from the host, but it’s still getting in the right direction.
    
    https://docs.docker.com/engine/security/security/
    
    LikeLike
    
    Reply
Rohi says:

5 November 2016 at 00:10

I might be old(ish) but I think that this is what happens when a whole new breed of developers who have been doused with ‘release early and release often’ mantra grow up and play adults!
Imagine, if this started to happen in hardware.Oh ya, it already is.
It’s lame to say, it will be fixed in the next release.Connected world with the ability of endless updates have made it possible.
There should a small but core see of features which should ABSOLUTELY work for the release to be possible.
Or at least have the decency to keep the version numbers pre-1.0

LikeLiked by 1 person

Reply
Christian Hudon says:

5 November 2016 at 00:13

Google Container Engine does not run on Docker. It uses Google’s own Linux container technology (of which an opensource version is here: https://github.com/google/lmctfy), with a Docker-compatible interface added. That’s why it’s more stable than Docker.

LikeLike

Reply
vanuan says:

5 November 2016 at 03:37

It isn’t clear why have you started with version 1.6 while at that time 1.9 and 1.10 were available. I assume you were running an unsupported OS (maybe without systemd?). That would explain most of you problems.

You’re totally right regarding registry and an absence of the delete. But I assure you that it’s possible with a bit of work. Just use manifests instead of tags. Garbage collection is there. But it’s not easy to delete anything because Docker guys are obsessed with reproducibility. That’s its main use case, after all: provide developers a way to reproduce bugs in production.

The “docker outage” thing isn’t clear. If you’re regularly downloading staff from the internet you should set up mirrors. It sucks that nobody has provided 3rd party docker mirrors, but it’s not specific to docker. You might’ve heard a similar story about npm.

Persistence. It’s widely known that docker doesn’t provide any data sharing or replication. You should use NFS or something similar. If you replicate dbms using docker, you should do it only for failover. Of course it will cause data corruption if you’re running many database instances pointing to the same location. Unless they support sharding.

For what I know, learning docker by running it in production isn’t a good idea. You should’ve started with adopting it for local development and CI first. That way you’d shave off one problem at a time, contributing to the community.

Docker 1.12 swarm mode is much more suitable for production, but I would wait for 1.13 or even 1.14. Or you can use K8s/CoreOS/GCE/etc.

Hey, and don’t forget that as with every open source company, Docker provides paid customer support.

LikeLiked by 3 people

Reply
Richard Vowles says:

5 November 2016 at 03:57

I’m not sure what you mean by “Kubernetes requires Docker”? Kubernetes supports the OCI – and particularly supports rkt from CoreOS.

LikeLiked by 1 person

Reply
- thehftguy says:
  
  5 November 2016 at 10:46
  
  Right, Kubernetes supports other systems than Docker. Our use case is only for Docker now.
  
  LikeLike
  
  Reply
Khalil Gibran says:

5 November 2016 at 06:06

For personal projects I depend upon the stable and good old BSD Jails, there are some tools to help get Docker kind of feel, CBSD and Tredly.

LikeLike

Reply
Pues vaya decepción: #Docker … – manuel.cillero.es says:

5 November 2016 at 10:18

[…] Pues vaya decepción: #Docker in Production: A History of Failure > https://thehftguy.wordpress.com/2016/11/01/docker-in-production-an-history-of-failure/ […]

LikeLike

Reply
Michael says:

5 November 2016 at 10:56

The “worldwide docker outage” would not be a problem if you followed best practices w.r.t. 3rd party dependencies. 1) Always pin specific versions. Promotion to a new version is done by developer who validates new version. 2) Maintain an internal repo for 3rd party dependencies and always build against that. You can’t get reliably build without these in your processes, docker or no docker. I was working at an org where we hadn’t yet learned this (using maven for Java builds with public repos and “LATEST” version specified), and it screwed us repeatedly.

LikeLiked by 1 person

Reply
- thehftguy says:
  
  5 November 2016 at 11:19
  
  We have a mirror for debian repositories (official and third party). The mirror mirrored the docker repo after the new packages came out. It did not help.
  
  Versions are pinned down and we do testing before we update. It’s unrelated to the issue. “apt-get update” will break when any of the configured repo is unavailable or corrupted, there’s nothing to be done about.
  
  LikeLike
  
  Reply
lerrigatto says:

5 November 2016 at 11:34

Thanks for this. One of the best not-so-rant article on docker.

LikeLike

Reply
Cal Leeming says:

5 November 2016 at 12:30

Excellent write up, it’s provoked many of the same responses I got to my anti docker article on HN back in 2014. Have you considered using SmartOS / Triton for running containers? Check out “Run Containers on Bare Metal Already” on Youtube, Bryan Cantrill goes into a lot of detail about this, I’ve personally had great success running containers in production on SmartOS/Triton, along with a string of failures prior to this.

LikeLike

Reply
Christopher Najewicz (@tehsuck) says:

5 November 2016 at 12:32

I have encountered docker in production at my last two gigs and have been hit by breaking changes between minor version changes in Docker. I will say that in general, the people at Docker seem eager to fix things and make things right. Nothing in this article mentions Docker logging drivers, which I’ve personally have had major issues with at scale – specifically when you’re not using the JSON driver, anything else seems to have major drawbacks.

LikeLike

Reply
- thehftguy says:
  
  5 November 2016 at 13:15
  
  Logging drivers could be a new section, but the article is already very long ^^
  
  We use the syslog driver. We need docker v1.12 to get fixes for it, but the updates is too risky.
  
  LikeLike
  
  Reply
John Garcia (@bitbucketeer) says:

5 November 2016 at 13:02

I hear you buddy, Docker can be really fickle. My advice is as follows:

– Mesos (esp. DC/OS) are actually coming along rather well, and the dual-nature native c-group container or Docker container option allows the option of running apps direct to the host and direct in the host’s filesystem, which is really nice when it’s time to fetch the data files or logs from a failed app. Meanwhile, Docker users are placated.
– CoreOS is actually pretty fantastic if you don’t mind bleeding edge kernel and Docker _everything_; that said, I’ve run Mesos on CoreOS and launched native JVM no problemo. I just have to download and unpack a licensed JVM when I start. The tradeoff is that it comes with a consistent Docker package in the OS image, which means no worries about Docker and their GPG fumbles.
– The magical incantation for CoreOS is /usr/bin/toolbox. This runs a persistent Fedora 24 container that you can yum/dnf within.
– Yeah, databases != microservices. To quote a funny video, “You need 64gb and 16 cores to run the database. What else do you plan to put on the same machine when you get it in the Docker container, hmm?”

Good luck!

LikeLike

Reply
Ivo Ribeiro says:

5 November 2016 at 21:48

Just here for the comments.

ps: Good post (from a Docker enthusiast)

LikeLike

Reply
kuja says:

5 November 2016 at 22:23

We actually try to adapt our small set of applications in to Docker. But only pure Docker without registry and with prebuilt images on host systems. Our main goal is to make a clean dev cycle and easy to use environment thru pillars. And it works like a charm. But mine expectations are not so high as what is explained in article 🙂

LikeLike

Reply
jsosic says:

5 November 2016 at 22:54

Your CI was broken for a whole day because of a wrongly signed package? Sorry man that isn’t a docker-caused outage, it’s completely your OPS teams fault. What happened to old fashioned exclude, versionlock or whatever mechanisms your distro offers to block a certain package from upgrading?

LikeLiked by 1 person

Reply
- thehftguy says:
  
  5 November 2016 at 23:46
  
  The packages are pinned down to specific version and the docker repositories are served by our internal mirrors.
  
  It’s not a simple case of an upstream server being unavailable. Signature errors have dire consequences on apt and Debian systems.
  
  Your suggestion is not even relevant.
  
  LikeLike
  
  Reply
Luis Miguel Silva says:

6 November 2016 at 00:40

I don’t really see the problem with Docker, the problems described in there are NOT my experience at all…it just really depends on your use case and how you use it.
We use it for continuous integration / devops and it is a GODSEND. It allows us to have stable build systems that are self contained and for portability between our US and Russian “branches”…
To me, the main problems for using this in actual production are:
– lack of tenant support
– and of a better management layer / security
However, if you use something on top like Mesos, those problems kind of go away…
I think whoever wrote this had a very negative bias against Docker because they tried the technology at a very early stage and didn’t understand the usefulness of it.
And I think that is very clear by comments like “At the time it wasn’t possible to run a container [in the background] and there wasn’t any command to see what was running, debug or ssh into the container.”
Yeah, you can’t ssh into the container today too, unless your master process is an ssh server OR the master process spawns an ssh server. That’s the whole point of the technology!

LikeLike

Reply
palincuta says:

6 November 2016 at 08:41

There’s nothing wrong with running an application directly on a server. Why would someone -apart from lazy developers- use dockeru?

LikeLike

Reply
- thehftguy says:
  
  6 November 2016 at 10:53
  
  It makes it really quick & easy to package & deploy applications.
  
  It’s a godsend for dev & testing environment to run many applications on the same host without dependency/deployment hell.
  
  LikeLike
  
  Reply
SlaterX says:

6 November 2016 at 11:11

Good article. I have one suggestion for you: OpenShift (origin). It has some hype out there, I know, but trust me when I tell you that your storage issue and your registry issue would be natively handled if you had an openshift cluster instead of native docker/docker registry. Take a look, you will like what you will see.

LikeLiked by 1 person

Reply
mpeg says:

6 November 2016 at 11:26

Thank you for sharing this insightful article, I’m going through the “bringing Docker to production” myself and it is always useful to read war stories so to prevent running into already experienced pain.

You may agree with me that, by reading your article, we (the readers) might get the opinion that your team ran many fast and not optimal decision, blaming Docker for the unpredictable outcomes. I totally share with you the painful (but hell, exciting) experience of trying out a new technology, but we (my team) try not to blame the tools for our own failures mostly driven by unexperienced approach.

I would like to share here a personal though that is being left unspoken by this article, and many others around the web. Docker is not only meant for huge scalability dependant systems. Docker is there for small realities too.

1. by dockerizing an app you may achieve Zero Setup Time for new developers, speeding up on-boarding and cutting away a lot of unproductive time
2. by dockerizing an app you enhance the reliability of “it works on my machine” assertion. Let’s face the truth: me and you agree that we should have CI and build pipelines and all the Army of the Darkness to ensure quality… but sometimes reality just plays against this.
3. by dockerizing an app you reduce the knowledge that is required to run your company’s projects. You cut the crap to “app start”, “app stop”, “app seed”, “app …”. Again, reduce unproductive time.

In my company we run plenty of data un-intensive, highly customized, hybrid technology projects. We wrap those with Docker and we host (almost) all of them in a single VM host that had never crashed so far. We are fairly happy with it and we were rewarded with some vacation time due to the unproductive time that we managed to cut away.

We’re still learning and I am sure we will experience our own WTF moments, still we don’t feel we have enough knowledge to run strong assertions like “Docker is a blessing for the humanity” or, like your article does, “Docker is just shit and you should avoid to mess up with it”.

On a last and less important note, I would like to share with you how you broke my feelings by stating out lout that I run a meaningless job by taking care of meaningless apps. Yes, you guessed right: most of the time I’m a front-end engineer, and I like it. I know my stuff is stateless, but IT IS STILL THE ONLY INTERFACE YOUR PRECIOUS CUSTOMERS HAVE TO YOUR PRECIOUS STATEFUL BACKEND. Me and you can go along together very well when we will finally understand that we are working together to pull out an outstanding service from hardly good enough requirements. Let’s stop blaming each other to be more or less important, let us blame the economy, global warming and our managers all together 🙂

LikeLike

Reply
- thehftguy says:
  
  6 November 2016 at 16:11
  
  The article is listing decisions and actions one after the other. In the real world, they happened over an entire year.
  
  We couldn’t predict that the registry would be abandoned, that the only way to get “that critical bugfix” would be to update to docker v1.next, that a rare race condition would kill host from time to time once they get enough activity and so on…
  
  We could have been better prepared for cleaning images, for picking OS, for using custom non default filesystems. Or wait no, we couldn’t because there was no resources on the internet nor the official documentation to warn/advise about that.
  
  I’m pretty sure I never used the work ‘meaningless’ anywhere. And it was no accident to never say that 😉
  
  P.S. One final assertion: “You don’t mess with docker, docker messes with you” 😀
  
  LikeLike
  
  Reply
Gil Mendes says:

6 November 2016 at 11:49

Reblogged this on Gil Mendes and commented:
Pareceu-me um artigo interessante para partilhar. Ainda não gostei muito da ideia de colocar serviços a correr em containers e este artigo veio ajudar-me ainda mais a não o querer fazer.

LikeLike

Reply
Sergei Egorov (@bsideup) says:

6 November 2016 at 16:08

> At the time it wasn’t possible to run a container [in the background]

http://stackoverflow.com/a/25268154/1826422
`docker run -d`, Aug 2014

> there wasn’t any command to see what was running, debug or ssh into the container

https://blog.docker.com/2014/10/docker-1-3-signed-images-process-injection-security-options-mac-shared-directories/

`docker exec`, October 2014

> The only way to clean space is to run this hack, preferably in cron every day:

WTF? Seems like a standart way to perform some cleanups in a Linux-way

> AUFS is entirely gone

Yeah, this is why storage is pluggable

LikeLiked by 1 person

Reply
Javier Ramirez (@frjaraur) says:

6 November 2016 at 17:27

I am sorry, but your experience sounds unreal 😉 I have been working with Docker for more than a year and I found some issues and always get a workaround or a new version fixed, and yes… if you want a fully supported environment… you should pay as you will do with RedHat OSes and Openshift environment.
Docker is not perfect and is growing fast (sometimes too fast), but that’s what we want today… There are tons of user requests for new improvements….And of course…many changes means errors and wrong decisions…. you can always stay running 3.10 kernels 😉
I pay if I have a Production environment that could run commercially supported versions….And never use internet repositories for your Production applications :P, don’t update things without a Development and Q&A environments… it’s a simple rule that most people follow.

LikeLiked by 3 people

Reply
- mister says:
  
  7 November 2016 at 17:44
  
  Why is his experience “unreal”? Because you haven’t seen it for yourself? That’s a naive and short-sighted view.
  
  Your experience seems to work well despite some issues and workarounds however–how does your current workload stack up against a HFT environment like his??
  
  No really, how does it?
  
  This blog experience is about Docker use in Production. Your suggestion to pay for support doesn’t mean that you won’t have the same issues.
  
  Here are my suggestions for reading this blog: Check your feelings at the door. Sometimes war stories are better than documentation.
  
  LikeLiked by 1 person
  
  Reply
  - frjaraur says:
    
    7 November 2016 at 20:10
    
    I am sorry but been always at war is not the only option, there are best practices that will take you for a peaceful environment.
    From your post I have to understand that you haven’t any bug or issue on any other application or software in production, you are the luckiest man I have known. I have been working in production for more than fifteen years… And yes there are many wars always… Not just with docker. Issues are not avoided with support (on every little thing in my datacenter) but response times are a better… Don’t you think so😎?
    Keep calm and still dockering, and if we want better software, changes and errors have to be made… That’s the way we all learn…
    Javier R.
    
    LikeLike
    
    Reply
    - mister says:
      
      8 November 2016 at 00:18
      
      Not correct. Also, insisting on making an assumption won’t make your point any more valid either.
      
      Try sticking to the facts made instead of trying to “calm” somebody down.
      
      LikeLike
      
      Reply
Links 6/11/2016: Vista 10 Plagued With Ads, Linux 4.9 Now in RC4 | Techrights says:

6 November 2016 at 21:58

[…] Docker in Production: A History of Failure […]

LikeLike

Reply
Mihael says:

6 November 2016 at 22:18

“It takes 5 years to make a good and stable software…” It requires people to use and test software to make good software, time is such a big factor as complexity, use-case and popularity. Take a look at BTRFS…

LikeLike

Reply
Desarrollador, pon contenedores en tu vida – Modesto San Juan says:

6 November 2016 at 23:10

[…] Mucho se ha escrito sobre el tema y mucho se seguirá escribiendo. También se ha escrito de lo malo que es docker y de lo no tan malo que […]

LikeLike

Reply
Fachliches Ziel says:

6 November 2016 at 23:30

[…] die verschiedenste Dienste (Datenbanken, Webserver, etc.) beherbergen können. Für einen breiten stabilen Betrieb wird man allerdings noch ein paar Jahre warten […]

LikeLike

Reply
bmullan says:

7 November 2016 at 02:32

Have you had any looks into what LXD might be able to provide you?
http://www.linuxcontainers.org

They are eystem containers vs application containers but the REST api & cli enable provisioning & mgmt ofboth local & remote containers. As of the 2.3 release a couple months ago VxLAN is supported.

And combined with Juju & Openstack there as re strong mgmt tools.

Checkout some of the solutions popping up on the LXD subreddit for clustering etc.

LikeLike

Reply
develop2devop says:

7 November 2016 at 03:06

Docker in production here, and the stories mostly check out. No backports for critical bugfixes paired backwards incompatible minor version bumps, an unreliable and honestly unworkable registry system (if you’re under the illusion that ‘latest’ tag is just a silly but misguided quality of life feature, you should read the registry code surrounding looking up images), and missing features due to form over function design. On that last note, ‘docker cp’ being introduced in version 1.8, begrudgingly, still boggles my mind.

After the investment Redhat put into devicemapper and Redhat Atomic, I believe the rumors about a possible hostile fork. And to be honest, at this point I would welcome it.

LikeLiked by 1 person

Reply
oschad says:

7 November 2016 at 08:03

I agree with the general bottom line that it looks like that the docker guys has never written a daemon before or a important system tool/library at all.

You’ve missed some problems: logging. The docker daemon has a endless log buffer. Yes, it allocates all your system memory, if your log backend or the docker daemon itself is too slow to deliver the log messages.

So the machine just dies somewhen, if a developer uses a wrong or just log intensive log setting.

I’ve never seen such a crappy log implementation ever. So just don’t use the docker daemon as a logging tool. It will break your production somewhen.

LikeLike

Reply
Elka Y. says:

7 November 2016 at 08:42

docker with privileged mode is the lord of unstability when executed in parallel.

LikeLike

Reply
Steffen Yount says:

7 November 2016 at 08:42

@thehftguy – Have you tried running your Docker images using Joyent’s Triton?

Triton’s lightweight Docker containers rely on the Solaris Zones container implementation, so they should be truly isolated unlike the containers currently provided in Linux based solutions.

LikeLike

Reply
- thehftguy says:
  
  7 November 2016 at 18:51
  
  I’m afraid I can’t do that. We’re on Linux, gotta live with Docker.
  
  None of the issues are related to the concept of containers, but to the specific implementation (docker). It sure would be a different experience on Solaris/BSD with zones/jails.
  
  LikeLike
  
  Reply
  - Steffen Yount says:
    
    9 November 2016 at 01:13
    
    My point was that Triton should provide a stable ecosystem for running Docker images.
    
    And it could provide that winning combination of kernel + distribution + docker version + filesystem you’ve been looking for.
    
    The cool thing about it is that Joyent has built-out a complete ABI compatibility layer for Linux binaries and they’ve also implemented the Docker Remote APIs.
    
    So with Triton, you deploy the same Docker images you’ve already built for Linux, using the same container scheduler (e.g: Docker Compose, Marathon, Kubernetes), calling the same Docker Remote APIs, and as far as your build and deploy infrastructure is concerned you are effectively running “Docker on Linux”.
    
    But it’s better than “Docker on Linux” since you’re actually running on a mature and robust tech stack with:
    1. truly isolated bare metal containers (Zones)
    2. isolated network stacks (Zones)
    3. mature integrated filesystem support (Zones + ZFS)
    4. live and post-mortem debugging (DTrace)
    
    The upshot is that Docker instances hosted in Triton can’t destroy everything they touch, and when they do crash you have the ability to debug what’s gone wrong.
    
    You may want to give it a look.
    
    LikeLike
    
    Reply
jillesvangurp says:

7 November 2016 at 09:56

Docker works alright if you don’t insist on running out of date kernels and set it up right. I’ve experienced kernel panics with older kernels (e.g. centos 6.5 ran some ancient kernel) but not recently with Ubuntu 14.04 and 16.04.

Given recent attention given to fixing kernel panics resulting from using Docker and other stuff, I’d recommend using whatever has a more recent kernel since those kernel devs do tend to fix a ton of issues between releases. Personally, I’ve been pretty happy with ubuntu 14.04 and 16.04. That OS is popular enough that there are a lot of people using it with Docker.

Running Docker on a 3.x kernel is IMHO asking for trouble and doing that on an OS that is relatively impopular doubly so. Debian is nice if you run a simple lamp stack but anything else you are basically on your own doing the testing and integration. Docker is evolving fast and so is the kernel. Updating one without the other is pointless. If you are going to run Docker, there shouldn’t be much else running on the box and you have essentially no excuse for not picking something more optimal.

LikeLike

Reply
- thehftguy says:
  
  7 November 2016 at 19:03
  
  Yep. I’d go even further and advise to _only_ use docker with CoreOS. It’s the only OS reasonably kept up to date for that purpose (on the long term).
  
  I think it’s unfair to call CentOS 6 and Debian 8 “old OS”. That’s what’s used in most production systems. If Docker can’t run on that, it should be marked as an “experimental-only tool” and have warning messages. [I personally started experimenting with Docker when CentOS was the latest release of CentOS.]
  
  LikeLike
  
  Reply
mrg says:

7 November 2016 at 14:53

Enterprise running over 100 different services, over 1000 containers, including Cassandra, Mongodb, Redis, full CI pipeline and a dozen other services requiring persistent storage since docker 0.6.2. We did encounter some of the issues you mentioned, but we simply fixed the problems and moved along. We would never go back to pre-docker world.

Learn first how your tools works, study its design, prove it to yourself that it works and you’ll be happy. This rule applies to everything what goes into production.

LikeLike

Reply
- thehftguy says:
  
  7 November 2016 at 19:06
  
  Two questions for you then:
  What OS/kernel/docker versions you are using?
  Did you personally setup and fix the docker issues or was is done by another group/team in your organisation?
  
  LikeLike
  
  Reply
- Harris says:
  
  11 January 2017 at 13:46
  
  Why can’t you be a little more specific, for other people who might be interested in the solutions you found?
  
  LikeLike
  
  Reply
Jagger Foo says:

7 November 2016 at 18:41

Hello TheHFTGuy,

You should be cheered for this article – the discussion it has spawn and your post-article comments.

The idea behind docker is promising, but the experience in the real world may be disappointing for some, Your article has some really significant takeaways for anyone just entering into the docker world like myself.

Thanks

LikeLiked by 1 person

Reply
javier ramirez says:

8 November 2016 at 20:22

As you mention that’s “your history”…. and after reading all comments and having my own experience with docker … that’s all …

LikeLike

Reply
Glenn West says:

14 November 2016 at 05:45

actually Im not sure at all its “docker” that the problem. Half of this is working very close to upstream, and not having something that is “supported”. Your storage issues are solved by the storage management of kubernetes. Also alot of your backporting etc would have been solved using RHEL. Your generating close to a billion a year, yet you spend zero on having stable systems? Expecting the upstream to support u in hours? Really hard to believe, coming from brokerage background. In every “linux” tree, there a reliability tree. This is true of debian/ubuntu family, or the fedora/centos/rhel family. Docker is fast changing and has alot of innovation, yes, but it can run in production, and I’ve seen lots of people doing it, in fully supportable fashion.

LikeLike

Reply
Bruce Lee says:

14 November 2016 at 05:53

Very insightful article.

Running docker 1.8 on top of devicemapper and I have to run image deletion script ALL THE TIME. I was hoping upgrading the infrastructure to ubuntu 16 / overlay2 would fix it, but after reading your blog, I’m having second doubts. I’m having moderate success with some machines on docker 1.9, Ubuntu 14 machines providing an LVM device to docker directly.

Also can’t stand all the replies suggesting switching to completely new technologies, or claiming they have no issues with docker, but will not reply with their environment.

LikeLike

Reply
Marius says:

15 November 2016 at 08:42

centos7 + docker1.9 + thinpool (direct lvm) + kernel 3.10.0-327 Stable enough.
centos7 + docker1.12 + overlay on ext4 + kernel 4.2 or higher. Stable i’d say.
The thing is you need to find the right combination for your work.

LikeLike

Reply
Roman says:

17 November 2016 at 08:35

haven’t used it myself, but this may be of interest to solve the zombie container problem: https://github.com/spotify/docker-gc

LikeLiked by 1 person

Reply
- thehftguy says:
  
  17 November 2016 at 18:31
  
  We know very well this project and it is ignored on purpose.
  
  It parses the docker text output [which is meant to be human readable, not programmatically readable] with complex regex & awk. It’s a very dirty solution, that has proven to be unreliable at various point in times. There is no guarantee that it works in the present nor that it will keep working in the future.
  
  We prefer the single line hack we gave. It’s infinitely more simple and it’s clear that it’s a hack.
  
  LikeLike
  
  Reply
Harold Naparst says:

21 November 2016 at 23:13

You asked about 47 times for people to give you a stable setup, but no one did. I have experienced a lot of the problems you mentioned for the reasons you mentioned, and it is all because of trying to run an old kernel, or the wrong distro, or the wrong storage manager. I have tried CoreOS (which is oriented towards Fleet), Ubuntu, CentOS (out of date), as well as Devicemapper (avoid).

I am not running a large shop, and I expect to get flamed for saying this, but I run the latest Gentoo kernel (OK to use stable or later), with Overlay2. I use Docker 1.12 or later.

The Gentoo developers will tell you if your kernel options are appropriate for the version of docker you want to run. People may scoff, but I cannot tell you the number of times I have tried to use other distributions and failed for the sorts of reasons you discuss. With Gentoo, it just works. New AMIs are released about twice a month.

Thanks for the post.

LikeLike

Reply