Docker in Production: An Update


The previous article Docker in Production: A History of Failure was quite a hit.

After long discussions, hundreds of feedbacks, thousands of comments, meetings with various individuals and major players, more experimentation and more failures, it’s time for an update on the situation.

We’ll go over the lessons learned from all the recent interactions and articles, but first, a reminder and a bit of context.

Disclaimer: Intended Audience

The large amount of comments made it clear that the world is divided in 10 kind of people:

1) The Amateur

Running mostly test and side projects with no real users. May think that using Ubuntu beta is the norm and call anything “stable” obsolete.

I dont always make workin code but when I do it works on my machine
Can’t blame him. It worked on his machine.

2) The Professional

Running critical systems for a real business with real users, definitely accountable, probably get a phone call when shit hits the fan.

one does not simply say well it worked on my machine
Didn’t work on the machine that served his 586 million customers.

Other) The Aerospace Guy

For the record: I was in aerospace before I was in finance.

plane crash
Forgot a semi-colon? 100 people died.

What Audience Are You?

There is a fine line between these worlds and they clash pretty hard when they ever meet. Obviously, they have very different standards and expectations.

One of the reason I love finance is because that it has a great culture of risk. It doesn’t mean to be risk-averse contrary to a popular belief. It means to evaluate potential risks and potential gains and weight them against each other.

You should take a minute to think about your standards. What do you expect to achieve with Docker? What do you have to lose if it crashes all systems it’s running on and corrupt the mounted volumes? These are important factor to drive your decisions.

What pushed me to publish the last article was a conversation with a guy from a random finance company, just asking my thoughts about Docker, because he was considering to consider it. Among other things, this company -and this guy in particular- manages systems that handle trillions of dollars, including the pensions of millions of Americans.

Docker is nowhere ready to handle my mother’s pension, how could anyone ever think that??? Well, it seemed the Docker experience wasn’t documented enough.

What Do You Need to Run Docker?

As you should be aware by know, Docker is highly sensitive to the kernel, the host and the filesystem it’s using. Pick the wrong combination and you’re talking kernel panic, filesystem corruption, Docker daemon lock down, etc…

I had time to collect feedback on various operating conditions and test a couple myself.

We’ll go over the results of the research, what has been registered to work, not work, experience intermittent failures, or blow up entirely in epic proportions.

Spoiler Alert: There is nothing that’s guaranteed to work.

Disclaimer: Understand the Risks and the Consequences

I am biased toward my own standards (as a professional who has to handle real money) and following the feedback I got (with a bias toward reliable sources known for operating real world systems).

For instance, if a combination of operating system and filesystem is marked as “no-go: registered catastrophic filesystem failure with full volume data loss“. It is not production ready (for me) but it might be production-ready for a student who has to do a one-off exercise in a vagrant virtual machine.

You may or may not experience the issues mentioned. There are mentioned because there are definitely people who encountered them, and if you try the same environment, you are on the right path to become one of them.

The worst that can -and usually- happen with Docker is that it seems okay during the proof of concepts and you’ll only begin to notice and understand issues far down the line, when you cannot easily move away from it.

CoreOS

CoreOS is an operating that can only run containers and is exclusively intended to run containers.

Last article, the conclusion was that it might be the only operating system that may be able to run Docker. This may or may not be accurate.

We abandoned the idea of running CoreOS.

First, the main benefit of Docker is to unify dev and production. Having a separate OS in production only for containers totally ruins this point.

Second, Debian (we were on Debian) announced the next major release for Q1 2017. It takes a lot of effort to understand and migrate everything to CoreOS, with no guarantee of success. It’s wiser to just wait for the next Debian.

CentOS/RHEL

CentOS/RHEL 6

Docker on CentOS/RHEL 6 is no-go: known filesystem failures, full volume data loss

  1. Various known issues with the devicemapper driver.
  2. Critical issues with LVM volumes in combination with devicemapper causing data corruption, container crash, and docker daemon freeze requiring hard reboot to fix.
  3. The Docker packages are not maintained on this distribution. There are numerous critical bug fixes that were released in the CentOS/RHEL 7 packages but were not back ported to the CentOS/RHEL 6 packages.
ship crash shipt it revert
The only way to migrate to Docker in a big company still running on RHEL 6 => Don’t do it. EMERGENCY ABORT before it’s too late!

CentOS/RHEL 7

Originally running the kernel 3, RedHat has been back porting the kernel 4 features into it, which is mandatory for running Docker.

It caused problems at time because Docker failed to detect the custom kernel version and the available features on it, thus it cannot set proper system settings and fails in various mysterious ways. Every time this happens, this can only be resolved by Docker publishing a fix on feature detection for specific kernels, which is neither a timely nor systematic process..

There are various issues with the usage of LVM volumes, depends on the version.

Otherwise, it’s a mixed bag. Your mileage may vary.

As of CentOS 7.0, RedHat recommended some settings but I can’t find the page on their website anymore. Anyway, there are a tons of critical bugfixes in later version so you MUST update to the latest version.

As of CentOS 7.2, RedHat recommends and supports exclusively XFS and they give special flags for the configuration. AUFS doesn’t exist, OverlayFS is officially considered unstable, BTRFS is beta (technology preview).

The RedHat employees are admitting themselves that they struggle pretty hard to get docker working in proper conditions, which is a major problem because they gotta resell it as part of their OpenShift offering. Try making a product on an unstable core.

If you like playing with fire, it looks like that’s the OS of choice.

Note that for once, it is a case where you surely wants to have RHEL and not CentOS, meaning timely updates and helpful support at your disposal.

Debian

Debian 8 jessie (stable)

A major cause of the issues we experienced was because our production OS was Debian stable, as explained in the previous article.

Basically, Debian froze the kernel to a version that doesn’t support anything Docker needs and the few components that are present are rigged with bugs.

Docker on Debian is major no-go: There is a wide range of bugs in the AUFS driver (but not only), usually crashing the host, potentially corrupting the data, and that’s just the tip of the iceberg.

Docker is 100% guaranteed suicide on Debian 8 and it’s been since the inception of Docker a few years ago. It’s killing me no one ever documented this earlier.

I wanted to show you a graph of AWS instances going down like dominoes but I didn’t have a good monitoring and drawing tool to do that, so instead I’ll illustrate with a piano chart that looks the same.

docker-crash-illustrated
Typical docker cascade failure in our test systems.

Typical Docker cascading failure on our test systems. A test slave crashes… the next one retries two minutes later… and dies too. This specific cascade took 6 tries to go past the bug, slightly more than usual, but nothing fancy.

You should have CloudWatch alarms to restart dead hosts automatically and send a crash notifications.

Fancy: You can also have a CloudWatch alarm to automatically send a customized issue report to your regulator whenever there is an issue persisting more than 5 minutes.

Not to brag but we got quite good at containing Docker. Forget about Chaos Monkey, that’s child play, try running trading systems handling billions of dollars on Docker [1].

[1] Please don’t do that. That’s a terrible idea.

Debian 9 stretch

Debian stretch is planned to become the stable edition in 2017. (Note: might be released as I write and edit this article).

It will feature the kernel 4.9 which is the latest one, that will also happen to be a LTS kernel.

At the time of release, Debian Stretch will be the most up to date stable operating system and it will allegedly have all the shiny things necessary to run Docker (until the Docker requirements change again).

It may resolve a lot of the issues and it may make a tons of new ones.

We’ll see how it goes.

Ubuntu

Ubuntu has always been more up to date than the regular server distributions.

Sadly, I am not aware of any serious companies than run on Ubuntu. This has been a source of much misunderstanding in the docker community because dev and amateur bloggers try things on the latest Ubuntu (not even the LTS [1]) yet it’s utterly non representative of production systems in the real world (RHEL, CentOS, Debian or one of the exotic Unix/BSD/Solaris).

I cannot comment on the LTS 16 as I do not use it. It’s the only distribution to have Overlay2 and ZFS available, that gives some more options to be tried and maybe find something working?

The LTS 14 is a definitive no-go: Too old, don’t have the required components.

[1] I received quite a few comments and unfriendly emails of people saying to “just” use the latest Ubuntu beta. As if migrating all live systems, changing distribution and running on a beta platform that didn’t even exist at the time was an actual solution.

AWS Container Service

AWS has an AMI dedicated to running Docker. It is based on an Ubuntu.

As confirmed by internal sources, they experienced massive troubles to get Docker working in any decent condition

Ultimately, they released am AMI for it, running a custom OS with a custom docker package with custom bug fixes and custom backports. They went and are still going through extensive efforts and testing to keep things together.

If you are locked-in on Docker and running on AWS, your only salvation might be to let AWS handles it for you.

Google Container Service

Google offers containers as a service, but more importantly, as confirmed by internal sources, their offering is 100% NOT Dockerized.

Google merely exposes a Docker interface, all the containers are run on internal google containerization technologies, that cannot possibly suffer from all the Docker implementation flaws.

That is a huge label of quality: Containers without docker.

Don’t get me wrong. Containers are great as a concept, the problem is not the theoretical aspect, it’s the practical implementation and tooling we have (i.e. Docker) which are experimental at best.

If you really want to play with Docker (or containers) and you are not operating on AWS, that leaves Google as the single strongest choice, better yet, it comes with Kubernetes for orchestration, making it a league of its own.

That should still be considered experimental and playing with fire. It just happens that it’s the only thing that may deliver the promises and also the only thing that comes with containers AND orchestration.

OpenShift / Cloud Foundry

It’s not possible to build a stable product on a broken core, yet both Pivotal and RedHat are trying.

From the feedback I had, they are both struggling pretty hard to mitigate the Docker issues, with variable success. Your mileage may vary.

Considering that they both appeal to large companies, who have quite a lot to lose, I’d really question the choice of going for that route (i.e. anything build on top of Docker).

You should try the regular clouds instead: AWS or Google or Azure. Using virtual machines and some of the hosted services will achieve 90% of what Docker does, 90% of what Docker doesn’t do, and it’s dependable. It’s also a better long-term strategy.

Chances are that you want to do OpenShift / Cloud Foundry because you can’t do public cloud. Well, that’s a tough spot to be in. (Good luck with that. Please write a blog in reply to talk about your experience).

Summary

  • CentOS/RHEL: Russian roulette
  • Debian: Jumping off a plane naked
  • Ubuntu: Not sure
  • CoreOS: Not worth the effort
  • AWS Containers: Your only salvation if you are locked-in with Docker and on AWS
  • Google Containers: The only practical way to run Docker that is not entirely insane.
  • Cloud Foundry: Not sure. Depends how good the support and engineers can manage?
  • OpenShift: Same as Cloud Foundry.

A Business Perspective

Docker has no business model and no way to monetize. It’s fair to say that they are releasing to all platforms (Mac/Windows) and integrating all kind of features (Swarm) as a desperate move to 1) not let any competitor have any distinctive feature 2) get everyone to use docker and docker tools 3) lock customers completely in their ecosystem 4) publish a ton of news, articles and releases in the process, increasing hype 5) justify their valuation.

It is extremely tough to execute an expansion both horizontally and vertically to multiple products and markets. (Ignoring whether that is an appropriate or sustainable business decision, which is a different aspect).

In the meantime, the competitors, namely Amazon, Microsoft, Google, Pivotal and RedHat all compete in various ways and make more money on containers than Docker does, while CoreOS is working an OS (CoreOS) and competing containerization technology (Rocket).

That’s a lot of big names with a lot of firepower directed to compete intensively and decisively against Docker. They have zero interest whatsoever to let Docker locks anyone. If anything, they individually and collectively have an interest in killing Docker and replacing it with something else.

Let’s call that the war of containers. We’ll see how it plays out.

Currently, Google is leading the way, they already killed Docker (GKE runs on internal google technology, not Docker) and they are the only one to provide out of the box orchestration (Kubernetes).

Conclusion

Did I say that Docker is an unstable toy project?

Invariably some people will say that the issues are not real or in the past. They are not in the past, the challenges and the issues are very current and very real. There is definite proof and documentation that Docker has suffered from critical bugs making it plain unusable on ALL major distributions, bugs that ran rampant for years, some still present as of today.

If you look for any combination of “docker + version + filesystem + OS” on Google, you’ll find a trail of issues with various impact going back all the way to docker birth. It’s a mystery how something could fail that bad for that long and no one writes about it. (Actually, there are a few articles, they were just lost under the mass of advertisement and quick evaluations). The last software to achieve that level of expectation with that level of failure was MongoDB.

I didn’t manage to find anyone on the planet using Docker seriously AND successfully AND without major hassle. The experiences mentioned in this article were acquired by blood, the blood of employees and companies who learned Docker the hard way while every second of downtime was a $1000 loss.

Hopefully, you can learn from our past, as to not repeat it.

mistake - it could be that the purpose of your life is only to serve as a warning to others

If you were wondering whether you should have adopted docker years ago => The answer is hell no, you dodged a bullet. You can tell that to your boss. (It’s still not that much useful today if you don’t proper have orchestration around it, which is itself an experimental subject).

If you are wondering whether you should adopt it now… while what you run is satisfactory and you have any considerations for quality => The reasonable answer is to wait until RHEL 8 and Debian 10. No rush. Things need to mature and the packages ain’t gonna move faster than the distributions you’ll run them on.

If you like to play with fire => Full-on Google Container Engine on Google Cloud. Definitive high risk, probable high reward.

Would this article have more credibility if I linked numerous bug reports, screenshots of kernel panics, personal charts of system failures over the day, relevant forum posts and disclosed private conversations? Probably.

Do I want to spend yet-another hundred hours to dig that off, once again? Nope. I’d rather spend my evening on Tinder than Docker. Bye bye Docker.

Moving On

Back to me. My action plan to lead the way on Containers and Clouds had a major flaw I missed out, the average tenure in tech companies is still not counted in yearS, thus the year 2017 began by being poached.

Bad news: No more cloud and no more Docker where I am going. Meaning no more groundbreaking news and you are on your own to figure it out.

Good news: No more toying around with billions dollars of other people’s money… since I am moving up by at least 3 orders of magnitude! I am moderately confident that my new immediate playground includes the pensions of a few millions of Americans, including a lot of people who read this blog.

docker your pension fund 100% certified not dockeri
Rest assured: Your pension is in good hands! =D
Advertisements
graylog architecture overview

250 GB/day of logs with Graylog: Lessons Learned


Architecture

graylog-architecture
Graylog Architecture
  • Load Balancer: Load balancer for log input (syslog, kafka, GELF, …)
  • Graylog: Logs receiver and processor + Web interface
  • ElasticSearch: Logs storage
  • MongoDB: Configuration, user accounts and sessions storage

Costs Planning

Hardware requirements

  • Graylog: 4 cores, 8 GB memory (4 GB heap)
  • ElasticSearch: 8 cores, 60 GB memory (30 GB heap)
  • MongoDB: 1 core, 2 GB memory (whatever comes cheap)

AWS bill

 + $ 1656 elasticsearch instances (r3.2xlarge)
 + $  108   EBS optimized option
 + $ 1320   12TB SSD EBS log storage
 + $  171 graylog instances (c4.xlarge)
 + $  100 mongodb instances (t2.small :D)
===========
 = $ 3355
 x    1.1 premium support
===========
 = $ 3690 per month on AWS

GCE bill

 + $  760 elasticsearch instances (n1-highmem-8)
 + $ 2040 12 TB SSD EBS log storage
 + $  201 graylog instances (n1-standard-4)
 + $   68 mongodb (g1-small :D)
===========
 = $ 3069 per month on GCE

GCE is 9% cheaper in total. Admire how the bare elasticsearch instances are 55% cheaper on GCE (ignoring the EBS flag and support options).

The gap is diminished by SSD volumes being more expensive on GGE than AWS ($0.17/GB vs $0.11/GB). This setup is a huge consumer of disk space. The higher disk pricing is eating part of the savings on instances.

Note: The GCE volume may deliver 3 times the IOPS and throughput of its AWS counterpart. You get what you pay for.

Capacity Planning

Performances (approximate)

  • 1600 log/s average, over the day
  • 5000 log/s sustained, during active hours
  • 20000 log/s burst rate

Storage (as measured in production)

  • 138 906 326 logs per day (averaged over the last 7 days)
  • 2200 GB used, for 9 days of data
  • 1800 bytes/log in average

Our current logs require 250 GB of space per day. 12 TB will allow for 36 days of log history (at 75% disk usage).

We want 30 days of searchable logs. Job done!

Competitors

ELK

Dunno, never seen it, never used it. Probably a lot of the same.

Splunk Licensing

The Splunk licence is based on the volume ingested in GB/day. Experience has taught us that we usually get what we pay for, therefore we love to pay for great expensive tools (note: ain’t saying splunk is awesome, don’t know, never used it). In the case of Splunk vs ELK vs Graylog. It’s hard to justify the enormous cost against two free tools which are seemingly okay.

We experienced a DoS an afternoon, a few weeks after our initial small setup: 8000 log/s for a few hours while we were planning for 800 log/s.

A few weeks later, the volume suddenly went up from 800 log/s to 4000 log/s again. This time because debug logs and postgre performance logs were both turned on in production. One team was tracking an Heisenbug while another team felt like doing some performance analysis. They didn’t bother to synchronise.

These unexpected events made two things clear. First, Graylog proved to be reliable and scalable during trial by fire. Second, log volumes are unpredictable and highly variable. A volume-based licensing is a highway to hell, we are so glad to not have had to put up with it.

Judging by the information on Splunk website, the license for our current setup would be in the order of $160k a year. OMFG!

How about the cloud solutions?

One word  : No.
Two words: Strong No.

The amount of sensitive information and private user data available in logs make them the ultimate candidate for not being outsourced, at all, ever.

No amount of marketing from SumoLogic is gonna change that.

Note: We may to be legally forbidden to send our logs data to a third party. Even thought that would take a lawyer to confirm or deny it for sure.

Log management explained

Feel free to read “Graylog” as “<other solution>”. They’re all very similar with most of the same pros and cons.

What Graylog is good at

  1. debugging & postmortem
  2. security and activity analysis
  3. regulations

Good: debugging & postmortem

Logs allow to dive into what happened millisecond by millisecond. It’s the first and last resort tool when it comes to debugging issues in production.

That’s the main reason logs are critical in production. We NEED the logs to debug issues and keep the site running.

Good: activity analysis

Logs give an overview of the activity and the traffic. For instance, where are most frontend requests coming from? who connected to ssh recently?

Good: regulations

When we gotta have searchable logs and it’s not negotiable, we gotta have searchable logs and it’s not negotiable. #auditing

What Graylog is bad at

  1. (non trivial) analytics
  2. graphing and dashboards
  3. metrics (ala. graphite)
  4. alerting

Bad: (non trivial) Analytics

Facts:

1) ElasticSearch cannot do join nor processing (ala mapreduce)
2) Log fields have weak typing
3) [Many] applications send erroneous or shitty data (e.g. nginx)

Everyone knows that an HTTP status code is an integer. Well, not for nginx. It can log an upstream_status_code ‘200‘ or ‘‘ or ‘503, 503, 503‘. Searching nginx logs is tricky and statistics are failing with NaN errors (Not a Number).

Elasticsearch itself has weak typing. It tries to detect field types automatically with variable success (i.e. systematic failure when receiving ambiguous data, defaulting to string type).

The only workaround around is to write field pre/post processors to sanitize inputs but it’s cumbersome when there are unlimited applications and fields each requiring a unique correction.

In the end, the poor input data can break simple searches. The inability to do joins prevents from running complex queries at all.

It would be possible to do analytics by sanitizing log data daily and saving the result to BigQuery/RedShift but it’s too much effort. We better go for a dedicated analytics solution, with a good data pipeline (i.e. NOT syslog).

Lesson learnt: Graylog doesn’t replace a full fledged analytics service.

Bad: Graphing and dashboards

Graylog doesn’t support many kind of graphs. It’s either “how-many-logs-per-minute” or “see-most-common-values-of-that-field” in the past X minutes. (There will be more graphs as the product mature, hopefully). We could make dashboards but we’re lacking interesting graphs to put into them.

edit: graylog v2 is out, it adds automatic geolocation of IP addresses and a map visualization widget.

Bad: Metrics and alerting

Graylog is not meant to handle metrics. It doesn’t gather metrics. The graphs and dashboards capabilities are too limited to make anything useful even if metrics were present. The alerting capability is [almost] non existent.

Lesson learnt: Graylog does NOT substitute to a monitoring system. It is not in competition with datadog and statsd.

Special configuration

ElasticSearch field data

indices.fielddata.cache.size: 20%

By design, field data are loaded in memory when needed and never evicted. They will fill the memory until OutOfMemory exception. It’s not a bug, it’s a feature.

It’s critical to configure a cache limit to stop that “feature“.

Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker.html

ElasticSearch shards are overrated

elasticsearch_shards = 1
elasticsearch_replicas = 1

Shards allow to split an index logically into shards [a shard is equivalent to a virtual index]. Operations on an index are transparently distributed and aggregated across its shards. This architecture allows to scale horizontally by distributing shards across nodes.

Sharding makes sense when a system is designed to use a single [big] index. For instance, a 50 GB index for http://www.apopularforum.com can be split in 5 shards of 10GB and run on a 5 nodes cluster. (Note that a shard MUST fit in the java heap for good performances.)

Graylog (and ELK) have a special mode of operation (inherent to log handling) in where new indices are created periodically. Thus, there is no need to shard each individual index because the architecture is already sharded on a higher level (across indices).

Log retention MUST be based on size

Retention = retention criteria * maximum number of indexes in the cluster.

e.g. 1GB per index * 1000 indices =  1TB of logs are retained

The retention criteria can be a maximum time period [per index], a maximum size [per index], or a maximum document count [per index].

The ONLY viable retention criteria is to limit by maximum index size.

The other strategies are unpredictable and unreliable. Imagine a “fixed rotation every 1 hour” setting, the storage and memory usage of the index will vary widely at 2-3am, at daily peak time, and during a DDoS.

mongodb and small files

smallfiles: true

mongodb is used for storing settings, user accounts and tokens. It’s a small load that can be accommodated by small instances.

By default, mongodb is preallocating journals and database files. Running an empty database takes 5GB on disk (and indirectly memory for file caching and mmap).

The configuration to use smaller files (e.g. 128MB journal instead of 1024MB) is critical to run on small instances with little memory and little disk space.

elasticsearch is awesome

elasticsearch is the easiest database to setup and run in a cluster.

It’s easy to setup, it rebalances automatically, it shards, it scales, it can add/remove nodes at anytime. It’s awesome.

Elasticsearch drops consistency in favour of uptime. It will continue to operate in most circumstances (in ‘yellow’ or ‘red’ state, depending whether replica are available for recovering data) and try to self heal. In the meantime, it ignores the damages and works with a partial view.

As a consequence, elasticsearch is unsuitable for high-consistency use cases (e.g. managing money) which must stop on failure and provide transactional rollback. It’s awesome for everything else.

mongodb is the worst database in the universe

There are extensive documentation about mongodb fucking up, being unreliable and destroying all data.

We came to a definitive conclusion after wasting spending lots of time with mongodb, in a clustered setup, in production. All the shit about mongodb is true.

We stopped counting the bugs, the configuration issues, and the number of times the cluster got deadlocked or corrupted (sometimes both).

Integrating with Graylog

The ugly unspoken truth of log management is that having a solution in place is only 20% of the work. Then most of the work is integrating applications and systems into it.Sadly, it has to be done one at a time.

JSON logs

The way to go is JSON logs. JSON format is clean, simple and well defined.

Reconfigure applications libraries to send JSON messages. Reconfigure middleware to log JSON messages.

nginx

log_format json_logs '{ '
 '"time_iso": "$time_iso8601",'

 '"server_host": "$host",'
 '"server_port": "$server_port",'
 '"server_pid": "$pid",'

 '"client_addr": "$remote_addr",'
 '"client_port": "$remote_port",'
 '"client_user": "$remote_user",'

 '"http_request_method": "$request_method",'
 '"http_request_uri": "$request_uri",'
 '"http_request_uri_normalized": "$uri",'
 '"http_request_args": "$args",'
 '"http_request_protocol": "$server_protocol",'
 '"http_request_length": "$request_length",'
 '"http_request_time": "$request_time",'

 '"ssl_protocol": "$ssl_protocol",'
 '"ssl_session_reused": "$ssl_session_reused",'

 '"http_header_cf_ip": "$http_cf_connecting_ip",'
 '"http_header_cf_country": "$http_cf_ipcountry",'
 '"http_header_cf_ray": "$http_cf_ray",'

 '"http_response_size": "$bytes_sent",'
 '"http_response_body_size": "$body_bytes_sent",'

 '"http_content_length": "$content_length",'
 '"http_content_type": "$content_type",'

 '"upstream_server": "$upstream_addr",'
 '"upstream_connect_time": "$upstream_connect_time",'
 '"upstream_header_time": "$upstream_header_time",'
 '"upstream_response_time": "$upstream_response_time",'
 '"upstream_response_length": "$upstream_response_length",'
 '"upstream_status": "$upstream_status",'

 '"http_status": "$status",'
 '"http_referer": "$http_referer",'
 '"http_user_agent": "$http_user_agent"'
 ' }';
access_log syslog:server=127.0.0.1,severity=notice json_logs;
 error_log syslog:server=127.0.0.1 warn;

syslog-ng

We use syslog-ng to deliver system logs to Graylog.

options {
 # log with microsecond precision
 ts-format(iso);
 frac-digits(6);

 # detect dead TCP connection
 mark-freq(5);
 
 # DNS failover
 time-reopen(10);
 dns-cache-expire(30);
 dns-cache-expire-failed(30);
}
destination d_graylog {
 # DNS balancing
 syslog("graylog-server.internal.brainshare.com" transport("tcp") port(1514));
};

Conclusion

It is perfectly normal to spend 10-20% of the infrastructure costs in monitoring.

Graylog is good. Elasticsearch is awesome. mongodb sucks. Splunk costs an arm (or two). Nothing new in the universe.

From now on and forward, applications should log messages in JSON format. That’s the best way we’ll be able to extract meaningful information out of them.

GCE vs AWS in 2016: Why you shouldn’t use Amazon


Foreword

This story relates my experience at a typical web startup. We are running hundreds of instances on AWS, and we’ve been doing so for some time, growing at a sustained pace.

Our full operation is in the cloud: webservers, databases, micro-services, git, wiki, BI tools, monitoring… That includes everything a typical tech company needs to operate.

We have a few switches and a router left in the office to provide internet access and that’s all, no servers on-site.

The following highlights many issues encountered day to day on AWS so that [hopefully] you don’t do the same mistakes we’ve done by picking AWS.

What does the cloud provide?

There are a lot of clouds: GCE, AWS, Azure, Digital Ocean, RackSpace, SoftLayer, OVH, GoDaddy… Check out our article Choosing a Cloud Provider: AWS vs GCE vs SoftLayer vs DigitalOcean vs …

We’ll focus only on GCE and AWS in this article. They are the two majors, fully featured, shared infrastructure, IaaS offerings.

They both provide everything needed in a typical datacenter.

Infrastructure and Hardware:

  • Get servers with various hardware specifications
  • In multiple datacenters across the planet
  • Remote and local storage
  • Networking (VPC, subnets, firewalls)
  • Start, stop, delete anything in a few clicks
  • Pay as you go

Additional Managed Services (optional):

  • SQL Database (RDS, Cloud SQL)
  • NoSQL Database (DynamoDB, Big Table)
  • CDN (CloudFront, Google CDN)
  • Load balancer (ELB, Google Load Balancer)
  • Long term storage (S3, Google Storage)

Things you must know about Amazon

GCE vs AWS pricing: Good vs Evil

Real costs on the AWS side:

  • Base instance plus storage cost
  • Add provisioned IOPS for databases (normal EBS IO are not reliable enough)
  • Add local SSD (675$ per 800 GB + 4 CPU + 30 GB. ALWAYS ALL together)
  • Add 10% on top of everything for Premium Support (mandatory)
  • Add 10% for dedicated instances or dedicated hosts (if subject to regulations)

Real costs on the GCE side:

  • Base instance plus storage cost
  • Enjoy fast and dependable IOPS out-of-the-box on remote SSD volumes
  • Add local SSD (82$ per 375 GB, attachable to any existing instance)
  • Enjoy automatic discount for sustained usage (~30% for instances running 24/7)

AWS IO are expensive and inconsistent

EBS SSD volumes: IOPS, and P-IOPS

We are forced to pay for Provisioned-IOPS whenever we need dependable IO.

The P-IOPS are NOT really faster. They are slightly faster but most importantly they have a lower variance (i.e. 90%-99.9% latency). This is critical for some workload (e.g. databases) because normal IOPS are too inconsistent.

Overall, P-IOPS can get very expensive and they are pathetic compared to what any drive can do nowadays (720$/month for 10k P-IOPS, in addition to $0.14 per GB).

Local SSD storage

Local SSD storage is only available via the i2 instances family which are the most expensive instances on AWS (and over all clouds).

There is no granularity possible. CPU, memory and SSD storage amount all DOUBLE between the few i2.xxx instance types available. They grow in powers of 4CPU + 30GB memory + 800 GB SSD and the multiplier is $765/month.

These limitations make local SSD storage expensive to use and special to manage.

AWS Premium Support is mandatory

The premium support is +10% on top of the total AWS bill (i.e. EC2 instances + EBS volumes + S3 storage + traffic fees + everything).

Handling spikes in traffic

ELB cannot handle sudden spikes in traffic. They need to be scaled manually by support beforehand.

An unplanned event is a guaranteed 5 minutes of unreachable site with 503 errors.

Handling limits

All resources are artificially limited by a hardcoded quota, which is very low by default. Limits can only be increased manually, one by one, by sending a ticket to the support.

I cannot fully express the frustration when trying to spawn two c4.large instances (we already got 15) only to fail because “limit exhaustion: 15 c4.large in eu-central region“. Message support and wait for one day of back and forth email. Then try again and fail again because “limit exhaustion: 5TB of EBS GP2 in eu-central region“.

This circus goes on every few weeks, sometimes hitting 3 limits in a row. There are limits for all resources, by region, by availability zone, by resource types and by resource specifics criteria.

Paying guarantees a 24h SLA to get a reply to a limit ticket. The free tiers might have to wait for a week (maybe more), being unable to work in the meantime. It is an absurd yet very real reason pushing for premium support.

Handling failures on the AWS side

There is NO log and NO indication of what’s going on in the infrastructure. The support is required whenever something wrong happens.

For example. An ELB started dropping requests erratically. After contacting support, they acknowledged to have no idea what’s going on and took action “Thank you for your request. One of the ELB was acting weird, we stopped it and replaced it with a new one“.

The issue was fixed. Sadly, they don’t provide any insight or meaningful information. This is a strong pain point for debugging and planning future failures.

Note: We are barraging further managed service from being introduced in our stack. At first they were tried because they were easy to setup (read: limited human time and a bit of curiosity). They soon proved to be causing periodic issues while being impossible to debug and troubleshoot.

ELB are unsuitable to many workloads

[updated paragraph after comments on HN]

ELB are only accessible with a hostname. The underlying IPs have a TTL of 60s and can change at any minute.

This makes ELB unsuitable for all services requiring a fixed IP and all services resolving the IP only once at startup.

ELB are impossible to debug when they fail (they do fail), they can’t handle sudden spike and the CloudWatch graphs are terrible. (Truth be told. We are paying Datadog $18/month per node to entirely replace CloudWatch).

Load balancing is a core aspect of high-availability and scalable design. Redundant load balancing is the next one. ELB are not up to the task.

The alternative to ELB is to deploy our own HAProxy in pairs with VRRP/keepalived. It takes multiple weeks to setup properly and deploy in production.

By comparison, we can achieve that with google load balancers in a few hours. A Google load balancer can have a single fixed IP. That IP can go from 1k/s to 10k/s requests instantly without loosing traffic. It just works.

Note: Today, we’ve seen one service in production go from 500 requests/s to 15000 requests/s in less than 3 seconds. We don’t trust an ELB to be in the middle of that

Dedicated Instances

Dedicated instances are Amazon EC2 instances that run in a virtual private cloud (VPC) on hardware that’s dedicated to a single customer. Your Dedicated instances are physically isolated at the host hardware level from your instances that aren’t Dedicated instances and from instances that belong to other AWS accounts.

Dedicated instances/hosts may be mandatory for some services because of legal compliance, regulatory requirements and not-having-neighbours.

We have to comply to a few regulations so we have a few dedicated options here and there. It’s 10% on top of the instance price (plus a $1500 fixed monthly fee per region).

Note: Amazon doesn’t explain in great details what dedicated entails and doesn’t commit to anything clear. Strangely, no regulators pointed that out so far.

Answer to HN comments: Google doesn’t provide “GCE dedicated instances”. There is no need for it. The trick is that regulators and engineers don’t complain about not having something which is non-existent, they just live without it and our operations get simpler.

Reserved Instances are bullshit

A reservation is attached to a specific region, an availability zone, an instance type, a tenancy, and more. In theory the reservation can be edited, in practice that depends on what to change. Some combinations of parameters are editable, most are not.

Plan carefully and get it right on the first try, there is no room for errors. Every hour of a reservation will be paid along the year, no matter whether the instance is running or not.

For the most common instance types, it takes 8-10 months to break even on a yearly reservation. Think of it as gambling game in a casino. A right reservation is -20% and a wrong reservation is +80% on the bill. You have to be right MORE than 4/5 times to save any money.

Keep in mind that the reserved instances will NOT benefit from the regular price drop happening every 6-12 months. If there is a price drop early on, you’re automatically loosing money.

Critical Safety Notice: 3 years reservation is the most dramatic way to loose money on AWS. We’re talking potential 5 digits loss here, per click. Do not go this route. Do not let your co-workers go this route without a warning. 

What GCE does by comparison is a PURELY AWESOME MONTHLY AUTOMATIC DISCOUNT. Instances hours are counted at the end of every month and discount is applied automatically (e.g. 30% for instances running 24/7). The algorithm also accounts for multiple started/stopped/renewed instances, in a way that is STRONGLY in your favour.

Reserving capacity does not belong to the age of Cloud, it belongs to the age of data centers.

AWS Networking is sub-par

Network bandwidth allowance is correlated with the instance size.

The 1-2 cores instances peak around 100-200 Mbps. This is very little in a world more and more connected where so many things rely on the network.

Typical things experiencing slow down because of the rate limited networking:

  • Instance provisioning, OS install and upgrade
  • Docker/Vagrant image deployment
  • sync/sftp/ftp file copying
  • Backups and snapshots
  • Load balancers and gateways
  • General disk read/writes (EBS is network storage)

Our most important backup takes 97 seconds to be copied from the production host to another site location. Half time is saturating the network bandwidth (130 Mbps bandwidth cap), half time is saturating the EBS volume on the receiving host (file is buffered in memory during initial transfer then 100% iowait, EBS bandwidth cap).

The same backup operation would only take 10-20 seconds on GCE with the same hardware.

Cost Comparison

This post wouldn’t be complete without an instance to instance price comparison.

In fact, it is so important that it was split to dedicated article: Google Cloud is 50% cheaper than AWS.

Hidden fees everywhere + unreliable capabilities = human time wasted in workarounds

Capacity planning and day to day operations

Capacity planning is unnecessary hard with the not-scalable resources, unreliable performances capabilities, insufficient granularity, and hidden constraints everywhere. Planning cost is a nightmare.

Every time we have to add an instance. We have to read the instances page, pricing page, EBS page again. There are way too many choices, some of which being hard to change latter. That could be printed on papers and cover a4x7 feet table. By comparison it takes only 1 page both-sided to pick an appropriate instance from Google.

Optimizing usage is doomed to fail

The time taken to optimizing reserved instance is a similar cost to the savings done.

Between CPU count, memory size, EBS volume size, IOPS, P-IOPS. Everything is over-provisioned on AWS. Partly because there are too many things to follow and optimize for a human being, partly as workaround against the inconsistent capabilities, partly because it is hard to fix later for some instances live in production.

All these issues are directly related to the underlying AWS platform itself, being not neat and unable to scale horizontal cleanly, neither in hardware options, nor in hardware capabilities nor money-wise.

Every time we think about changing something to reduce costs, it is usually more expensive than NOT doing anything (when accounting for engineering time).

Conclusion

AWS has a lot of hidden costs and limitations. System capabilities are unsatisfying and cannot scale consistently. Choosing AWS was a mistake. GCE is always a better choice.

GCE is systematically 20% to 50% cheaper for the equivalent infrastructure, without having to do any thinking or optimization. Last but not least it is also faster, more reliable and easier to use day-to-day.

The future of our company

Unfortunately, our infrastructure on AWS is working and migrating is a serious undertaking.

I learned recently that we are a profitable company, more so than I thought. Looking at the top 10 companies by revenue per employee, we’d be in the top 10. We are stuck with AWS in the near future and the issues will have to be worked around with lots of money. The company is able to cover the expenses and cost optimisation ain’t a top priority at the moment.

There’s a saying “throwing money at a problem“. We shall say “throwing houses at the problem” from now on as it better represents the status quo.

If we get to keep growing at the current pace, we’ll have to scale vertically, and by that we mean “throwing buildings at Amazon” 😀

burning money
The official AWS answer to all their issues: “Get bigger instances”

Choosing the right cloud provider: AWS vs GCE vs Digital Ocean vs OVH


No worries, it’s a lot simpler than it seems. Each cloud provider is oriented toward a different type of customer and usage.

We grouped cloud providers by type. We’ll explain what is the purpose of each type? How do they differ? Which one is the most appropriate per use case? and Which cloud provider is the best in its respective category?

 

 

General Purpose Clouds

Competitors: Amazon AWS, Google Compute Engine, Microsoft Azure

Quick test: A general purpose cloud is the best fit if you answer yes to any of the following questions.

  • Do you run more than 50 virtual machines?
  • Do you spend more than 1000 dollars/month on hosting?
  • Does your infrastructure span multiple datacenters?

When to use: A general purpose cloud is meant to run anything and everything. It can replace a full rack of servers, as much as it can replace an ENTIRE datacenter. It provides the usual infrastructure plus some advanced bits that would be very hard to come by otherwise.

It is the go-to solution for running many heterogeneous applications requiring a variety of hardware. It’s versatility makes it ideal to run an entire operation in the cloud. It’s a perfect fit for an entire tech company, or a [big] tech project.

General purpose clouds make complex infrastructure available at the tip of your fingers:

  • Get servers of various sizes and types of hardware
  • Design your own networking and firewalls (same as in a real datacenter)
  • Group and isolate instances from each other and from the internet
  • Easily go multi-sites, worldwide
  • Order, change or redesign ANYTHING in 60 seconds (while staying put on your chair)

A general purpose cloud is a full ecosystem. It includes equivalents to all the services typically found (and required) in datacenters/enterprise environments:

  • SAN disks (EBS, Google Disks)
  • Scalable Storage and backups (S3, Google Storage, Snapshots)
  • Hardware load balancers (ELB, Google Load Balancer)

Which provider to use: GCE is vastly superior to its competitors. It’s cheaper and easier to manage. If you go cloud, go GCE.

AWS is 25-100% more expensive to run the same infrastructure, in addition to being slower and having fewer capabilities.

Note: We have no experience with Microsoft Azure and cannot comment on it. The few feedbacks we heard so far were rather negative. It may need time to mature.

Cheap Clouds

Competitors: Digital Ocean, Linode

Quick test: A cheap cloud is the best fit if you answer yes to any of the following questions.

  • Do you run less than 5 virtual machines?
  • Do you spend less than 100 dollars/month on hosting?
  • Are you in big triple if you receive a bill double of what you expected?
  • Would you qualify yourself as either an amateur or a hobbyist?

When to use: A cheap cloud is meant to offer proper servers to the masses, “proper” meaning decent hardware and good internet connectivity, at an affordable price. It is simply not possible to get that from a home or an office (note: recycling an old laptop on a broadband line is not a comparable substitute for a proper server).

It is the go-to solution for all basics needs. For examples, professionals running a few simple services with low-to-moderate traffic, agencies in need of a simple hosting to deliver back to the client, amateurs and hobbyists doing experiments.

Generally speaking, it’s the best choice for anyone who is looking for [at most] a couple of servers, especially if the main criteria are “easy to manage” and “good bang for the bucks“.

Cheap clouds make servers affordable and easy to get:

  • Get real servers (server-grade hardware, good internet connectivity)
  • Simple, easy to use, easy to manage and convenient
  • Predictable costs, well-defined capabilities, no bullshit
  • Buy or sell a server in 60 seconds

Which provider to use: The next-generation cheap clouds are DigitalOcean and Linode. Go for Digital Ocean.

blog update 2016-10: Linode suffered from significant downtimes in the past month, similar to the downtimes from last year. These outages are the result of major DDoS attacks targeted against Linode itself (i.e. not one of the customer running on it). We recommend Digital Ocean as a safer choice.

Challengers: There is a truckload of historical and minor players (OVH, GoDaddy, Hetzner, …). They have some similar offerings to the cheap cloud providers, but it’s hidden somewhere in the poor UI trying to accommodate and sell 10 unrelated products and services. They may or may not be worth digging a bit (probably not).

Dedicated Clouds

Competitors: IBM SoftLayer, OVH, Hetzner

When to use: As a rule of thumb, the general purpose clouds are limited to 16 physical cores and 128 GB memory and 8 TB SAN drives, with the price increasing linearly along the specifications (double the memory = double the price). The dedicated clouds can provide much bigger servers and the high-end specs are significantly cheaper.

This is the go-to solution for special tasks running 24/7 that require exotic hardware, especially vertical scaling. Dedicated clouds are only fitted for special purposes.

Special case: We’ve seen people rent a single big dedicated server with vSphere and run numerous virtual machines on it. It allows to do plenty of experimentations at a fixed and fairly reasonable costs.

IBM SoftLayer:

  • Choose the hardware, tailored to the intended workload
  • Ultimate performance (bare-metal, no virtualization)
  • Quad CPU, 96 total cores is an option
  • 1 TB memory, f*** yeah!
  • 24 HDD or SSD drives in a single box

Which provider to useIBM SoftLayer is the only one to offer the next generation of dedicated cloud. Getting servers works the same way as buying servers from the Dell website (select a server enclosure and pick the components) except it’s rented and the price is per month. (Common configurations are available immediately, specialized hardware may need ordering and take a few days).

SoftLayer takes care of the hardware transparently: shipment, delivery, installation, parts, repair, maintenance. It’s like having our own racks and servers… without the hassle of having them.

Challengers: There are a few historical big players (OVH, Hetzner, …). They are running on an antiquated model, providing only a predefined set of boxes with limited choice. They can compare positively to SoftLayer (read: cheaper and not harder to manage/use) when running a few servers with nothing too exotic.

Housing & Collocation

When to use: Never. It’s always a bad decision.

There are 3 kinds of people who do housing on purpose:

  • People who genuinely think it’s cheaper (it is NOT when accounting for time)
  • People who genuinely got their maths wrong (hence thinking it was cheaper =D)
  • Students, amateurs, hobbyists, single server usage and not-for-profit

Let’s ignore the hobbyist. He got a decent server sitting in the garage. He might as well put it into a datacenter with 24h electricity and good internet to tinker around. That’s how he’ll learn. This is the only valid use case for housing.

What’s wrong with housing & collocation:

  • Unproductive time to go back and forth to the datacenters, repeatedly
  • Lost time and health moving tons of hardware (a 2U server is 20-40 kg)
  • Be forced to deal with hardware suppliers (DELL/HP)
  • Burn out, burst in rage and eventually attempt to strangle one colleague after having dealt with supplier bullshit for most of the afternoon (based on a real story)
  • Waste 3 weeks between ordering something and receiving it
  • Cry when something broke and there are no spare parts
  • Cry some more because the parts are end-of-life and can’t be ordered anymore
  • Suffer 100 times what initially expected because of the network and the storage (it’s the most expensive and the most difficult to get right in an infrastructure)
  • Renew the hardware after 3-5 years, hit all the aforementioned issues in a row
  • Be unable to have multiple sites, never go worldwide

These are major pain points to be encountered. Nonetheless it is easy to find cloud vs collocation comparisons not accounting for them and pretending to save $500k per month by buying your own hardware.

Abandoning hardware management has been an awesome life-changing experience. We are never going back to lifting tons of burden in miserable journeys to the mighty datacenter.

Make Your Own Datacenter

When to use: This was the go-to solution for hosting companies and the older internet giants.

The internet giants (Google, Amazon, Microsoft) started at a time when there was no provider available for their needs, let alone at a reasonable cost. They had to craft their own infrastructure to be able to sustain their activity.

Nowadays, they have opened their infrastructure and are offering it for sale to the world.  Top-notch web-scale infrastructure has become an accessible commodity. A tech company doesn’t need its own datacenters anymore, no matter how big it grows.

Cheat Sheet

Run an entire tech company in the cloud, or run only a single [big] project requiring more than 10 servers? Google Compute Engine

Run less than 10 servers, for as little cost as possible? Digital Ocean

Run only beefy servers ( > 100GB RAM) or have special hardware requirements? IBM SoftLayer or OVH

Conclusion

The cloud is awesome. No matter what we want, where and when we want it. There is always a server ready for us at the click of a button (and the typewriting of our credit cards details).

The most surprising thing we encounter daily on these services is to notice how everything is so new. A recurrent “available since XXX” written in a corner of the page, stating it’s only been there for 1-3 years.

These writings are telling a story. The cloud have had enough time to mature and it is ready to be mainstream. Maintaining physical servers is an era from the past.

How to export Amazon EC2 instances to a CSV file


Amazon website is limited to 50 instances per page. Viewing lots of instances is a pain and it doesn’t support exporting to CSV/TSV/Excel/other out of the box. The only fix is to use the CLI.

Requirements

  • An AWS account with access rights to see your servers
  • A pair of AWS keys (Users -> [username] -> Security Credentials -> Create Access Key)
# Install AWS packages
sudo apt-get install -y python python-pip
sudo pip install aws-shell

Listing Instances

# aws-shell will show a wizard to configure your account and region the first time you use it
aws-shell
ec2 describe-instances --output text --query 'Reservations[*].Instances[*].[InstanceId, InstanceType, ImageId, State.Name, LaunchTime, Placement.AvailabilityZone, Placement.Tenancy, PrivateIpAddress, PrivateDnsName, PublicDnsName, [Tags[?Key==`Name`].Value] [0][0], [Tags[?Key==`purpose`].Value] [0][0], [Tags[?Key==`environment`].Value] [0][0], [Tags[?Key==`team`].Value] [0][0] ]' > instances.tsv
# open instances.tsv with Excel
# enjoy

You can modify the command to pick the information you want. Refer to the official AWS command line reference.