Google Cloud is 50% cheaper than AWS


Let’s revisit Google and Amazon pricing since the AWS November 2016 Price Reduction.

We’ll analyse instance costs, for various workloads and usages. All prices are given in dollars per month (720 hours) for servers located in Europe (eu-west-1).

Shared CPU Instances

Shared CPU instances give only a bit of CPU. The physical processor is over allocated and shared with many other instances running on the same host. A shared CPU instance may burst to 100% CPU usage for short periods but it may also be starved of CPU and paused. Note that these instances are cheap but they are not reliable for non-negligible continuous workloads.

google cloud vs aws pricing shared CPU instances

The smallest instances on both cloud is 500MB and a few percent of CPU. That’s the cheapest instance. It’s usable for testing and minimal needs (can’t do much with only 5% of CPU and 500MB).

The infamous t2.small and it’s rival the g1-small are usually the most common instance types in use. They come with 2GB of memory and a bit of CPU. It’s cheap and good enough for many use cases (excluding production and time critical processing, which need dedicated CPU time).

The Cheapest Production Instances

Production instances are all the instances with dedicated CPU time (i.e. everything but the shared CPU instances).

Most services will just run on the cheapest production instance available. That instance is very important because it determines the entry price and the specifications for everything.

google cloud vs aws pricing cheapest production instances

The cheapest production instance on Google Cloud is the n1-standard-1 which gives 1 CPU and 4 GB of memory.

AWS is more complex. The m3.medium is 1 CPU and 4 GB of memory. The c4.large is 2 CPU and 4 GB of memory.

m3/c3 are the previous family generation (pre-2015), using older hardware and an ancient virtualisation technology. c4/m4 are the current generation, it has enhanced networking and reserved bandwidth for EBS, among other system improvements.

Either way, the Google entry-level instance is significantly cheaper than both AWS entry-level instances. There will be a lot of these running, expect massive costs savings by using Google cloud.


I’m a believer that one should optimize for manageability and not raw costs. That means adopting c4/m4 as the standard for deployments (instead of c3/m3).

Given this decision, the smallest production instance on AWS is the c4.large (2 CPU, 4GB memory), a rather big instance when compared to the n1-standard-1 (1 CPU, 4GB memory). Why are we forced to pay for two CPUs as the minimal choice on AWS? That does set a high base price.

Not only Google is cheaper because it’s more competitive but it also offers more tailored options. The result is a massive 68% discount on the most commonly used production instance.

Personal Note: I would criticize the choice of AWS to discontinue the line of m4.medium instance type (1 CPU).


Instances by usage

A server has 3 dimensions of specifications: CPU performances, memory size and network speed.

Most applications only have a hard requirement in a single dimension. We’ll analyse the pricing separately for each usage pattern.

google cloud vs aws pricing instances by usage

Network Heavy

Typical Consumers: load balancers, file transfers, uploads/downloads, backups and generally speaking everything that uses the network.

What should we order to have  1Gbps and how much will it be?

  • The minimum on Google Cloud is the n1-highcpu-4 instance (4 CPU, 4 GB memory).
  • The minimum on AWS is the c4.4xlarge instance (16 CPU, 30 GB memory).

AWS bandwidth allowance is limited and correlated to the instance size. The big instances -with decent bandwidth- are incredibly expensive.

To give a point of comparison, the c4|m4|r3.large instances have a hard cap at 220 Mbits/s of network traffic (Note: It also applies internally within a VPC).

figure2_7001
Source: Network and cloud storage benchmark in 2015

All Google instances have significantly faster network than the equivalent [and even bigger] AWS instances, to the point where they’re not even playing in the same league.

Google has been designing networks and manufacturing their own equipment for decades. It’s fair to assume than AWS doesn’t have the technology to compete.

CPU

Typical Consumers: web servers, data analysis, simulations, processing and computations.

Google is cheaper per CPU.

Google CPU instances have half the memory of AWS CPU instances[1]. While that could have justified a 10% difference, it doesn’t justify double[2].

Note: The performances per CPU are equivalent on both cloud (though the CPU models and serial numbers may vary).

[1] A sane design decision. Most CPU bound workloads don’t need much memory. (Note: if they do, they can be run on “standard” instances).

[2] Pricing is mostly linked to CPU count. Additional memory is cheap.

Memory

Typical Consumers: database, caches and in-memory workloads.

Google is cheaper per GB of memory.

Google memory instances have 15% less memory than AWS CPU instances. While that could have justified a few percent difference, it sure as hell doesn’t justify double[2].

[2] Pricing is mostly linked to CPU count. Additional memory is cheap.

Local SSD and Scaling Up

There are software that can only scale up, typically SQL databases. A database holding tons of data will require fast local disks and truckloads of memory to operate non-sluggishly.

Scaling up is the most typical use case for beefy dedicated servers, but we’re not gonna rent a single server in another place just for one application. The cloud provider will have to accommodate that need.

Google allows to attach local 400GB SSDs to any instance type ($85 a month per disk).

Some AWS instances comes with small local SSD (16-160GB), you’re out of luck if you need more space than that. The only option to get big local SSD is the special i2 instances family, they have specifications in powers of 800GB local SSD + 4 CPU + 15 GB RAM (for $655 a month).

The Google SSD model is superior. It’s significantly more modular and cheaper (and more performant but that’s a different topic).

aws-vs-gce-pricing-instances-with-local-ssd
The requirements to fulfil are between parenthesis.

Disk Intensive Load: A job that requires high volume fast disks (i.e. local SSD) but not much memory.

AWS forces you to buy a big instance (i2.xlarge) to get enough SSD space whereas Google allows you to attach a SSD to a small instance (n1-highcpu-4). The lack of flexibility from AWS has a measurable impact, the AWS setup is 406% the costs of the Google setup to achieve the same need.

Database: A typical database. Fast storage and sizeable memory.

Bigger Database: Sometimes there is no choice but to scale up, to whatever resources are commanded by the application.

On AWS (i2.8xlarge) 32 cores, 244GB memory, 2 x 800 GB local SSD in RAID1 (+ 6 SSD unused yet gotta pay for it).

On Google Cloud (n1-highmem-32): 32 cores, 208 GB memory, 4 x 375 GB local SSD in RAID10.

This last number is meant to show that the lack of flexibility of AWS can (and will) snowball quickly. Only a very particular instance can fulfil the requirements, it comes with many cores and 4800 GB of unnecessary local SSD. The AWS bill is $4k (273%) higher than the equivalent setup on Google Cloud.

Custom Instances

Google offers custom machine types. You can pick how much CPU and memory you want, you’ll get that exact instance with a tailored pricing.

It is quite flexible. For instance, we could recreate any instance from AWS on Google Cloud.

Of course, there are physical bounds inherent to hardware (e.g. you can’t have a single core with 100 GB of memory).

Reserved Instances

Reserved Instances are bullshit!

Reserving capacity is a dangerous and antiquated pricing model that belongs to the era of the datacenter.

The numbers given in this article do not account for any AWS reservation. However, they all account for Google sustained use discount (30% automatic discount on instances that ran for the entire month).

If your infrastructure is so small that you can reserve all your 4 instances upfront, you should reconsider why you use AWS in the first place. There are more appropriate and cheaper options available.

If your infrastructure is big enough that you have dozens of servers (or thousands), you should already be aware that:

  1. Long term commitment is a huge risk. Most people underestimate it.
  2. Predictions are always off. Most people are overconfident in their predictions.
  3. You are no exception to most people.
  4. Reservation is a mess when having many AWS accounts (dev, staging, prod).
  5. Anything that is testing/transient is too short-lived to be reserved.
  6. Less than 50% of reservable stuff can actually be reserved (margin for change/error).

Most people managers are stubborn. If you your manager is stubborn and really insists on reserving instances, you should bet exclusively on “1 year full upfront“.

fishing with gr
Safety Warning: There is no confirmation button when you purchase reserved instances. You can absolutely spend $73185 without seeing nor confirming an invoice.

Conclusion

google cloud vs aws pricing summary relative costs

AWS was the first generation of cloud, Google is the second. The second generation is always better because it can learn from the mistakes of the first and it doesn’t have the old legacy to support.

2016 should be remembered as the year Google became a better choice than AWS. If 50% cheaper is not a solid argument, I don’t know what is.


References:

Cloud Storage Performance, a benchmark with graphs on network performance.

Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network, A Google Research Paper, the story on what powers their internal network.

Amazon does everything wrong, and Google does everything right, A message by an employee from Amazon than Google, not directly relevant but still a good read.

What is docker and how it’s used


Docker is a packaging and deployment system. It allows you to package an application as a “docker image“, then deploy it easily on some servers with a single “docker start <image>” command.

Packaging an application

Packaging an application without Docker

built pipeline without docker
The Standard Build Pipeline
  1. A developer pushes a change
  2. The CI sees that new code is available. It rebuilds the project, runs the tests and generates a package
  3. The CI saves all files in”dist/*” as build artifacts

The application is available for download from “ci.internal.mycompany.com/<project>/<build-id>/dist/installer.zip

Packaging an application with Docker

build pipeline with docker
The Build Pipeline with Docker
  1. A developer pushes a change
  2. The CI sees that new code is available it. It rebuilds the project, runs the tests and generates a docker image
  3. The docker image is saved to the docker registry

The application is available for download as a docker image named “auth:latest” from the registry “docker-registry.internal.mycompany.com”.

You need a CI pipeline

A CI pipeline requires a source code repository (GitLab, GitHub, VisualSVN Server) and a continuous integration system (Jenkins, GitLab CI, TeamCity). Docker also needs a docker registry.

A functional CI pipeline is a must-have for any software development project. It will ensure that your application(s) are automatically re-run, re-tested and re-packaged on every change.

The developers gotta write scripts to build their application, to run tests and to generate packages. Only the developers of an application can do that because they are the only ones to have the knowledge about how things are supposed/expected to work.

Generally speaking, the CI jobs should mostly consist into calling external scripts, like “./build.sh && ./tests.sh”. The scripts themselves must be part of the source code, they’ll evolve with the application.

You need to know your applications

Please answer the following questions:

  • What does the application need to be built?
  • What’s the command/script to build it?
  • What does the application need to run?
  • What configuration file is needed and where to put it?
  • What’s the command to start/stop the application?

You need to be able to answer all these questions, for all the applications you’re writing and managing.

If you don’t know the answers, you have a problem and Docker is NOT the solution. You gotta figure out how things works and write documentation! (Better hope the guys who were in charge are still working here and gave a thought about all that).

If you know the answers, then you’re good. You know what has to be done. Whether it will be executed by bash, ansible, DockerFile, spec or zip is just an implementation detail.

Deploying an application

Deploying an application without Docker

  1. Download the application
  2. Setup dependencies, services and configuration files
  3. Start the application
# ansible pseudo code 
hosts: hosts_auth
serial: 1 #rolling deploy, one server at a time
become: yes
 
tasks:
  name: instance is removed from the load balancer
  elb_instance:
    elb_name: auth
    instance_id: "{{ ansible_ansible_id }}"
    state: absent
 
  name: service is stopped
  service:
    name: auth
    state: stopped
 
  name: existing application is deleted
  file:
    path: /var/lib/auth/
    match: "*"
    recursive: yes
    state: absent
 
  name: application is deployed
  unarchive: 
    url: https://ci.internal.mycompany.com/auth/last/artifacts/installer.zip
    destination:: /var/lib/auth
 
  name: virtualenv is setup
  pip:
    requirements: /var/lib/auth/requirements.txt
    virtualenv: /var/lib/auth/.venv
 
  name: application configuration is updated
  template:
    src: auth.conf
    dst: /etc/mycompany/auth/auth.conf

  name: service configuration is updated
  template: 
    src: auth.service
    dst: /etc/init.d/mycompany-auth
 
  name: service is started
  service:
    name: auth
    state: running
 
  name: instance is added to the load balancer
  elb_instance:
    elb_name: auth
    instance_id: "{{ ansible_ansible_id }}"
    state: present

Deploying an application with Docker

  1. Create a configuration file
  2. Start the docker image with the configuration file
# ansible pseudo code
hosts: hosts_auth
serial: 1 #rolling deploy, one server at a time
become: yes
 
tasks:
  name: instance is removed from the load balancer
  elb_instance:
    elb_name: auth
    instance_id: "{{ ansible_ansible_id }}"
    state: absent
 
  name: container is stopped
  docker:
    name: auth
    state: stopped
 
  name: configuration is updated
  template: 
    src: auth.conf
    dst: /etc/mycompany/auth/auth.conf

  name: container is started
  docker:
    name: auth
    image: docker-registry.internal.mycompany.com/auth:latest
    state: started
    mount:
      /etc/mycompany/auth/auth.conf:/etc/mycompany/auth/auth.conf
    port:
      8101:8101
 
  name: instance is added to the load balancer
  elb_instance:
    elb_name: auth
    instance_id: "{{ ansible_ansible_id }}"
    state: present

Notable differences

With docker, the python setup/virtualenv and the service configuration is done during the image creation rather than during the deployment. (The commands are the same, they’re just done in an earlier build stage).

The configuration files are deployed on the host and mounted inside Docker. It would be possible to bake the configuration file into the image but some configurations might only be determined at deployment time and we’d rather not store secrets in the image.

Infrastructure

Docker is only a packaging and deployment tool.

Docker doesn’t handle auto scaling, it doesn’t have service discovery, it doesn’t reconfigure load balancers, it doesn’t move containers when servers fail.

Orchestration systems (notably Kubernetes) are supposed to help with that. Currently, they are quite experimental and very difficult to setup [beyond a proof of concept]. The lack of proper orchestration will limit Docker to only be a hype packaging & deployment tool for the foreseeable future.

Docker [even with Kubernetes] needs an existing environment to run, including servers and networks. It ain’t gonna install and configure itself either.

All of that has to be done manually. Order servers in the cloud. Create OS images with Packer. Configure VPC and networking with Terraform. Setup the servers and systems with Ansible. Install and deploy the applications (including docker images) with Ansible.

Cheat Sheet

  1. Figure out what is required and how to build the applications
  2. Write build, test and packaging scripts
  3. Document that in the README
  4. Setup a CI system
  5. Configure automatic builds after every change
  6. Figure out the application dependencies and how to run it
  7. Add that to the README
  8. Write deploy and setup scripts (with Ansible or Salt)

Conclusion

Packaging and deploying applications is a real and challenging job. A Debian package has some good practices and standards to follow whereas Docker comes with no good practices and no rules whatsoever. Docker is a [marketing] success in part because it gives the illusion that the task is easy, with a sense of coolness.

In practise though, it is hard and there is no way around it. You’ll have to figure out your needs and decide on a practical way to deploy and package your applications that will be tailored just for you. Docker is not the solution to the problem, it’s just a random tool among many others, that may or may not help you.

It’s fair to say that the docker ecosystem is infinitely complex and has a long learning curve. If you have neat applications with clear and limited dependencies, they should be relatively manageable and docker can’t make it any easier. On the contrary, it has the potential to make it harder.

Docker shines to package applications with complex messy dependencies (typical NodeJS and Ruby environments). The dependency hell is taken away from the host and moved into the image and the image creation scripts.

Docker is handsome for dev and test environments. It allows to run multiple applications easily on the same host, isolated from each other. Better yet, some applications have conflicting dependencies and would be impossible to run on a single host otherwise.

You should investigate a configuration management system (Ansible) if you don’t already have one. It will help you to manage, configure and setup [numerous] remote servers, à la SSH on steroid. It’s way more general and practical than Docker (and you’re gonna need it to install docker and deploy images anyway).

Reminder: In spite of the practical use cases, docker should be considered as a beta tool not quite ready for serious production.

Docker in Production: A History of Failure


Introduction

My first encounter with docker goes back to early 2015. Docker was experimented with to find out whether it could benefit us. At the time it wasn’t possible to run a container [in the background] and there wasn’t any command to see what was running, debug or ssh into the container. The experiment was quick, Docker was useless and closer to an alpha prototype than a release.

Fast forward to 2016. New job, new company and docker hype is growing like mad. Developers here have pushed docker into production projects, we’re stuck with it. On the bright side, the run command finally works, we can start, stop and see containers. It is functional.

We have 12 dockerized applications running in production as we write this article, spread over 31 hosts on AWS (1 docker app per host [note: keep reading to know why]).

The following article narrates our journey with Docker, an adventure full of dangers and unexpected turns.

so it begins, the greatest fuck up of our time

Production Issues with Docker

Docker Issue: Breaking changes and regressions

We ran all these versions (or tried to):

1.6 => 1.7 => 1.8 => 1.9 => 1.10 => 1.11 => 1.12

Each new version came with breaking changes. We started on docker 1.6 early this year to run a single application.

We updated 3 months later because we needed a fix only available in later versions. The 1.6 branch was already abandoned.

The versions 1.7 and 1.8 couldn’t run. We moved to the 1.9 only to find a critical bug on it two weeks later, so we upgraded (again!) to the 1.10.

There are all kind of subtle regressions between Docker versions. It’s constantly breaking unpredictable stuff in unexpected ways.

The most tricky regressions we had to debug were network related. Docker is entirely abstracting the host networking. It’s a big mess of port redirection, DNS tricks and virtual networks.

Bonus: Docker was removed from the official Debian repository last year, then the package got renamed from docker.io to docker-engine. Documentation and resources predating this change are obsolete.

Docker Issue: Can’t clean old images

The most requested and most lacking feature in Docker is a command to clean older images (older than X days or not used for X days, whatever). Space is a critical issue given that images are renewed frequently and they may take more than 1GB each.

The only way to clean space is to run this hack, preferably in cron every day:

docker images -q -a | xargs --no-run-if-empty docker rmi

It enumerates all images and remove them. The ones currently in use by running containers cannot be removed (it gives an error). It is dirty but it gets the job done.

The docker journey begins with a clean up script. It is an initiation rite every organization has to go through.

Many attempts can be found on the internet, none of which works well. There is no API to list images with dates, sometimes there are but they are deprecated within 6 months. One common strategy is to read date attribute from image files and call ‘docker rmi‘ but it fails when the naming changes. Another strategy is to read date attributes and delete files directly but it causes corruption if not done perfectly, and it cannot be done perfectly except by Docker itself.

Docker Issue: Kernel support (or lack thereof)

There are endless issues related to the interactions between the kernel, the distribution, docker and the filesystem

We are using Debian stable with backports, in production. We started running on Debian Jessie 3.16.7-ckt20-1 (released November 2015). This one suffers from a major critical bug that crashes hosts erratically (every few hours in average).

Linux 3.x: Unstable storage drivers

Docker has various storage drivers. The only one (allegedly) wildly supported is AUFS.

The AUFS driver is unstable. It suffers from critical bugs provoking kernel panics and corrupting data.

It’s broken on [at least] all “linux-3.16.x” kernel. There is no cure.

We follow Debian and kernel updates very closely. Debian published special patches outside the regular cycle. There was one major bugfix to AUFS around March 2016. We thought it was THE TRUE ONE FIX but it turned out that it wasn’t. The kernel panics happened less frequently afterwards (every week, instead of every day) but they were still loud and present.

Once during this summer there was a regression among a major update, that brought back a previous critical issue. It started killing CI servers one by one, with 2 hours in average between murders. An emergency patch was quickly released to fix the regression.

There were multiple fixes to AUFS published along the year 2016. Some critical issues were fixed but there are many more still left. AUFS is unstable on [at least] all “linux-3.16.x” kernels.

  • Debian stable is stuck on kernel 3.16. It’s unstable. There is nothing to do about it except switching to Debian testing (which can use the kernel 4).
  • Ubuntu LTS is running kernel 3.19. There is no guarantee that this latest update fixes the issue. Changing our main OS would be a major disruption but we were so desperate that we considered it for a while.
  • RHEL/CentOS-6 is on kernel 2.x and RHEL/CentoS-7 is on kernel 3.10 (with many later backports done by RedHat).

Linux 4.x: The kernel officially dropped docker support

It is well-known that AUFS has endless issues and it’s regarded as dead weight by the developers. As a long-standing goal, the AUFS filesystem was finally dropped in kernel version 4.

There is no unofficial patch to support it, there is no optional module, there is no backport whatsoever, nothing. AUFS is entirely gone.

[dramatic pause]

.

.

.

How does docker work without AUFS then? Well, it doesn’t.

[dramatic pause]

.

.

.

So, the docker guys wrote a new filesystem, called overlay.

OverlayFS is a modern union filesystem that is similar to AUFS. In comparison to AUFS, OverlayFS has a simpler design, has been in the mainline Linux kernel since version 3.18 and is potentially faster.” — Docker OverlayFS driver

Note that it’s not backported to existing distributions. Docker never cared about [backward] compatibility.

Update after comments: Overlay is the name of both the kernel module to support it (developed by linux maintainers) and the docker storage driver to use it (part of docker, developed by docker). They are two different components [with a possible overlap of history and developers]. The issues seem mostly related to the docker storage driver, not the filesystem itself.

The debacle of Overlay

A filesystem driver is a complex piece of software and it requires a very high level of reliability. The long time readers will remember the Linux migration from ext3 to ext4. It took time to write, more time to debug and an eternity to be shipped as the default filesystem in popular distributions.

Making a new filesystem in 1 year is an impossible mission. It’s actually laughable when considering that the task is assigned to Docker, they have a track record of unstability and disastrous breaking changes, exactly what we don’t want in a filesystem.

Long story short. That did not go well. You can still find horror stories with Google.

Overlay development was abandoned within 1 year of its initial release.

[dramatic pause]

.

.

.

Then comes Overlay2.

The overlay2 driver addresses overlay limitations, but is only compatible with Linux kernel 4.0 [or later] and docker 1.12” — Overlay vs Overlay2 storage drivers

Making a new filesystem in 1 year is still an impossible mission. Docker just tried and failed. Yet they’re trying again! We’ll see how it turns out in a few years.

Right now it’s not supported on any systems we run. We can’t use it, we can’t even test it.

Lesson learnt: As you can see with Overlay then Overlay2. No backport. No patch. No retro compatibility. Docker only moves forward and breaks things. If you want to adopt Docker, you’ll have to move forward as well, following the releases from docker, the kernel, the distribution, the filesystems and some dependencies.

Bonus: The worldwide docker outage

On 02 June 2016, at approximately 9am (London Time). New repository keys are pushed to the docker public repository.

As a direct consequence, any run of “apt-get update” (or equivalent) on a system configured with the broken repo will fail with an error “Error https://apt.dockerproject.org/ Hash Sum mismatch

This issue is worldwide. It affects ALL systems on the planet configured with the docker repository. It is confirmed on all Debian and ubuntu versions, independent of OS and docker versions.

All CI pipelines in the world which rely on docker setup/update or a system setup/update are broken. It is impossible to run a system update or upgrade on an existing system. It’s impossible to create a new system and install docker on it.

After a while. We get an update from a docker employee: “To give an update; I raised this issue internally, but the people needed to fix this are in the San Francisco timezone [8 hours difference with London], so they’re not present yet.

I personally announce that internally to our developers. Today, there is no Docker CI and we can’t create new systems nor update existing systems which have a dependency on docker. All our hope lies on a dude in San Francisco, currently sleeping.

[pause waiting for the fix, that’s when free food and drinks come in handy]

An update is posted from a Docker guy in Florida at around 3pm (London Time). He’s awake, he’s found out the issue and he’s working on the fix.

Keys and packages are republished later.

We try and confirm the fix at around 5pm (London Time).

That was a 7 hours interplanetary outage because of Docker. All that’s left from the outage is a few messages on a GitHub issue. There was no postmortem. It had little (none?) tech news or press coverage, in spite of the catastrophic failure.

Docker Registry

The docker registry is storing and serving docker images.

Automatic CI build  ===> (on success) push the image to ===> docker registry
Deploy command <=== pull the image from <=== docker registry

There is a public registry operated by docker. As an organization, we also run our own internal docker registry. It’s a docker image running inside docker on a docker host (that’s quite meta). The docker registry is the most used docker image.

There are 3 versions of the docker registry. The client can pull indifferently from any:

Docker Registry Issue: Abandon and Extinguish

The docker registry v2 is as a full rewrite. The registry v1 was retired soon after the v2 release.

We had to install a new thing (again!) just to keep docker working. They changed the configuration, the URLs, the paths, the endpoints.

The transition to the registry v2 was not seamless. We had to fix our setup, our builds and our deploy scripts.

Lesson learnt: Do not trust on any docker tool or API. They are constantly abandoned  and extinguished.

One of the goal of the registry v2 is to bring a better API. It’s documented here, a documentation that we don’t remember existed 9 months ago.

Docker Registry Issue: Can’t clean images

It’s impossible to remove images from the docker registry. There is no garbage collection either, the doc mentions one but it’s not real. (The images do have compression and de-duplication but that’s a different matter).

The registry just grows forever. Our registry can grow by 50 GB per week.

We can’t have a server with an unlimited amount of storage. Our registry ran out of space a few times, unleashing hell in our build pipeline, then we moved the image storage to S3.

Lesson learnt: Use S3 to store images (it’s supported out-of-the-box).

We performed a manual clean-up 3 times in total. In all cases we had to stop the registry, erase all the storage and start a new registry container. (Luckily, we can re-build the latest docker images with our CI).

Lesson learnt: Deleting any file or folder manually from the docker registry storage WILL corrupt it.

To this day, it’s not possible to remove an image from the docker registry. There is no API either. (One of the point of the v2 was to have a better API. Mission failed).

Docker Issue: The release cycle

The docker release cycle is the only constant in the Docker ecosystem:

  1. Abandon whatever exists
  2. Make new stuff and release
  3. Ignore existing users and retro compatibility

The release cycle applies but is not limited to: docker versions, features, filesystems, the docker registry, all API…

Judging by the past history of Docker, we can approximate that anything made by Docker has a half-life of about 1 year, meaning that half of what exist now will be abandoned [and extinguished] in 1 year. There will usually be a replacement available, that is not fully compatible with what it’s supposed to replace, and may or may not run on the same ecosystem (if at all).

We make software not for people to use but because we like to make new stuff.” — Future Docker Epitaph

The current status-quo on Docker in our organization

Growing in web and micro services

Docker first came in through a web application. At the time, it was an easy way for the developers to package and deploy it. They tried it and adopted it quickly. Then it spread to some micro services, as we started to adopt a micro services architecture.

Web applications and micro services are similar. They are stateless applications, they can be started, stopped, killed, restarted without thinking. All the hard stuff is delegated to external systems (databases and backend systems).

The docker adoption started with minor new services. At first, everything worked fine in dev, in testing and in production. The kernel panics slowly began to happen as more web services and web applications were dockerized. The stability issues became more prominent and impactful as we grew.

A few patches and regressions were published over the year. We’ve been playing catchup & workaround with Docker for a while now. It is a pain but it doesn’t seem to discourage people from adopting Docker. Support and demand is still growing inside the organisation.

Note: None of the failures ever affected any customer or funds. We are quite successful at containing Docker.

Banned from the core

We have some critical applications running in Erlang, managed by a few guys in the ‘core’ team.

They tried to run some of their applications in Docker. It didn’t work. For some reasons, Erlang applications and docker didn’t go along.

It was done a long time ago and we don’t remember all the details. Erlang has particular ideas about how the system/networking should behave and the expected load was in thousands of requests per second. Any unstability or incompatibility could justify an outstanding failure. (We know for sure now that the versions used during the trial suffered from multiple major unstability issues).

The trial raised a red flag. Docker is not ready for anything critical. It was the right call. The later crashes and issues managed to confirm it.

We only use Erlang for critical applications. For example, the core guys are responsible for a payment system that handled $96,544,800 in transaction this month. It includes a couple of applications and databases, all of which are under their responsibilities.

Docker is a dangerous liability that could put millions at risk. It is banned from all core systems.

Banned from the DBA

Docker is meant to be stateless. Containers have no permanent disk storage, whatever happens is ephemeral and is gone when the container stops. Containers are not meant to store data. Actually, they are meant by design to NOT store data. Any attempt to go against this philosophy is bound to disaster.

Moreover. Docker is locking away processes and files through its abstraction, they are unreachable as if they didn’t exist. It prevents from doing any sort of recovery if something goes wrong.

Long story short. Docker SHALL NOT run databases in production, by design.

It gets worse than that. Remember the ongoing kernel panics with docker?

A crash would destroy the database and affect all systems connecting to it. It is an erratic bug, triggered more frequently under intensive usage. A database is the ultimate IO intensive load, that’s a guaranteed kernel panic. Plus, there is another bug that can corrupt the docker mount (destroying all data) and possibly the system filesystem as well (if they’re on the same disk).

Nightmare scenario: The host is crashed and the disk gets corrupted, destroying the host system and all data in the process.

Conclusion: Docker MUST NOT run any databases in production, EVER.

Every once in a while, someone will come and ask “why don’t we put these databases into docker?” and we’ll tell some of our numerous war stories, so far, no-one asked twice.

Note: We started going over our Docker history as an integral part of our on boarding process. That’s the new damage control philosophy, kill the very idea of docker before it gets any chance to grow and kill us.

A Personal Opinion

Docker is gaining momentum, there is some crazy fanatic support out there. The docker hype is not only a technological liability any more, it has evolved into a sociological problem as well.

The perimeter is controlled at the moment, limited to some stateless web applications and micro services. It’s unimportant stuff, they can be dockerized and crash once a day, I do not care.

So far, all people who wanted to use docker for important stuff have stopped after a quick discussion. My biggest fear is that one day, a docker fanatic will not listen to reason and keep pushing. I’ll be forced to barrage him and it might not be pretty.

Nightmare scenario: The future accounting cluster revamp, currently holding $23M in customer funds (the M is for million dollars). There is already one guy who genuinely asked the architect “why don’t you put these databases into docker?“, there is no word to describe the face of the architect.

My duty is to customers. Protecting them and their money.

Surviving Docker in Production

gif-what-docker-pretends-to-be
What docker pretends to be.
gif-what-docker-really-is
What docker really is.

Follow releases and change logs

Track versions and change logs closely for kernel, OS, distributions, docker and everything in between. Look for bugs, hope for patches, read everything with attention.

ansible '*' -m shell -a "uname -a"

Let docker crash

Let docker crash. self-explanatory.

Once in a while, we look at which servers are dead and we force reboot them.

Have 3 instances of everything

High availability require to have at least 2 instances per service, to survive one instance failure.

When using docker for anything remotely important, we should have 3 instances of it. Docker die all the time, we need a margin of error to support 2 crashes in a raw to the same service.

Most of the time, it’s CI or test instances that crash. (They run lots of intensive tests, the issues are particularly outstanding). We’ve got a lot of these. Sometimes there are 3 of them crashing in a row in an afternoon.

Don’t put data in Docker

Services which store data cannot be dockerized.

Docker is designed to NOT store data. Don’t go against it, it’s a recipe for disaster.

On top, there are current issues killing the server and potentially destroying the data so that’s really a big no-go.

Don’t run anything important in Docker

Docker WILL crash. Docker WILL destroy everything it touches.

It must be limited to applications which can crash without causing downtime. That means mostly stateless applications, that can just be restarted somewhere else.

Put docker in auto scaling groups

Docker applications should be run in auto-scaling groups. (Note: We’re not fully there yet).

Whenever an instance is crashed, it’s automatically replaced within 5 minutes. No manual action required. Self healing.

Future roadmap

Docker

The impossible challenge with Docker is to come with a working combination of kernel + distribution + docker version + filesystem.

Right now. We don’t know of ANY combination that is stable (Maybe there isn’t any?). We actively look for one, constantly testing new systems and patches.

Goal: Find a stable ecosystem to run docker.

It takes 5 years to make a good and stable software, Docker v1.0 is only 28 months old, it didn’t have time to mature.

The hardware renewal cycle is 3 years, the distribution release cycle is 18-36 months. Docker didn’t exist in the previous cycle so systems couldn’t consider compatibility with it. To make matters worse, it depends on many advanced system internals that are relatively new and didn’t have time to mature either, nor reach the distributions.

That could be a decent software in 5 years. Wait and see.

Goal: Wait for things to get better. Try to not go bankrupt in the meantime.

Use auto scaling groups

Docker is limited to stateless applications. If an application can be packaged as a Docker Image, it can be packaged as an AMI. If an application can run in Docker, it can run in an auto scaling group.

Most people ignore it but Docker is useless on AWS and it is actually a step back.

First, the point of containers is to save resources by running many containers on the same [big] host. (Let’s ignore for a minute the current docker bug that is crashing the host [and all running containers on it], forcing us to run only 1 container per host for reliability).

Thus containers are useless on cloud providers. There is always an instance of the right size. Just create one with appropriate memory/CPU for the application. (The minimum on AWS is t2.nano which is $5 per month for 512MB and 5% of a CPU).

Second, the biggest gain of containers is when there is a complete orchestration system around them to automatically manage creation/stop/start/rolling-update/canary-release/blue-green-deployment. The orchestration systems to achieve that currently do not exist. (That’s where Nomad/Mesos/Kubernetes will eventually come in, there are not good enough in their present state).

AWS has auto scaling groups to manage the orchestration and life cycle of instances. It’s a tool completely unrelated to the Docker ecosystem yet it can achieve a better result with none of the drawbacks and fuck-ups.

Create an auto-scaling group per service and build an AMI per version (tip: use Packer to build AMI). People are already familiar with managing AMI and instances if operations are on AWS, there isn’t much more to learn and there is no trap. The resulting deployment is golden and fully automated. A setup with auto scaling groups is 3 years ahead of the Docker ecosystem.

Goal: Put docker services in auto scaling groups to have failures automatically handled.

CoreOS

Update after comments: Docker and CoreOS are made by separate companies.

To give some slack to Docker for once, it requires and depends on a lot of new advanced system internals. A classic distribution cannot upgrade system internals outside of major releases, even if it wanted to.

It makes sense for docker to have (or be?) a special purpose OS with an appropriate update cycle. It may be the only way to have a working bundle of kernel and operating system able to run Docker.

Goal: Trial the CoreOS ecosystem and assess stability.

In the grand scheme of operations, it’s doable to separate servers for running containers (on CoreOS) from normal servers (on Debian). Containers are not supposed to know (or care) about what operating systems they are running.

The hassle will be to manage the new OS family (setup, provisioning, upgrade, user accounts, logging, monitoring). No clue how we’ll do that or how much work it might be.

Goal: Deploy CoreOS at large.

Kubernetes

One of the [future] major breakthrough is the ability to manage fleets of containers abstracted away from the machines they end up running on, with automatic start/stop/rolling-update and capacity adjustment,

The issue with Docker is that it doesn’t do any of that. It’s just a dumb container system. It has the drawbacks of containers without the benefits.

There are currently no good, battle tested, production ready orchestration system in existence.

  • Mesos is not meant for Docker
  • Docker Swarm is not trustworthy
  • Nomad has only the most basic features
  • Kubernetes is new and experimental

Kubernetes is the only project that intends to solve the hard problems [around containers]. It is backed by resources that none of the other projects have (i.e. Google have a long experience of running containers at scale, they have Googley amount of resources at their disposal and they know how to write working software).

Right now, Kubernetes is young & experimental and it’s lacking documentation. The barrier to entry is painful and it’s far from perfection. Nonetheless, it is [somewhat] working and already benefiting a handful of people.

In the long-term, Kubernetes is the future. It’s a major breakthrough (or to be accurate, it’s the final brick that is missing for containers to be a major [r]evolution in infrastructure management).

The question is not whether to adopt Kubernetes, the question is when to adopt it?

Goal: Keep an eye on Kubernetes.

Note: Kubernetes needs docker to run. It’s gonna be affected by all docker issues. (For example, do not try Kubernetes on anything else than CoreOS).

Google Cloud: Google Container Engine

As we said before, there is no known stable combination of OS + kernel + distribution + docker version, thus there is no stable ecosystem to run Kubernetes on. That’s a problem.

There is a potential workaround: Google Container Engine. It is a hosted Kubernetes (and Docker) as a service, part of Google Cloud.

Google gotta solve the Docker issues to offer what they are offering, there is no alternative. Incidentally, they might be the only guys who can find a stable ecosystem around Docker, fix the bugs, and sell that ready-to-use as a cloud managed service. We might have a shared goal for once.

They already offer the service so that should mean that they already worked around the Docker issues. Thus the simplest way to have containers working in production (or at-all) may be to use Google Container Engine.

Goal: Move to Google Cloud, starting with our subsidiaries not locked in on AWS. Ignore the rest of the roadmap as it’s made irrelevant.

Google Container Engine: One more reason why Google Cloud is the future and AWS is the past (on top of 33% cheaper instances with 3 times the network speed and IOPS, in average).


Why docker is not yet succeeding in production, July 2015, from the Lead Production Engineer at Shopify.

Docker is not ready for primetime, August 2016.

Docker in Production: A retort, November 2016, a response to this article.


Disclaimer (please read before you comment)

A bit of context missing from the article. We are a small shop with a few hundreds servers. At core, we’re running a financial system moving around multi-million dollars per day (or billions per year).

It’s fair to say that we have higher expectations than average and we take production issues rather (too?) seriously.

Overall, it’s “normal” that you didn’t experience all of these issues if you’re not using docker at scale in production and/or if you didn’t use it for long.

I’d like to point out that these are issues and workarounds happening over a period of [more than] a year, summarized all together in a 10 minutes read. It does amplify the dramatic and painful aspect.

Anyway, whatever happened in the past is already in the past. The most important section is the Roadmap. That’s what you need to know to run Docker (or use auto scaling groups instead).

How to present a GitHub project for your resume


Introduction

Companies ask for a GitHub profile. Recruiters ask for a GitHub profile. The question “Do you contribute to open-source?” is now one of the most common questions asked in phone screens.

If people want a GitHub, we shall give them a GitHub. This article will explain how to present a GitHub project for use in a resume.

The given advice can be red from two point of views. As a candidate, it is what to write to introduce and present a software (not necessary on GitHub). As an interviewer (or a fellow developer), it is what to look for to judge the experience of the developer(s) and the quality of a software.

submit application form with a github link
When having a GitHub is mandatory, just like having a name.

Link to a specific project

Put a link to your GitHub in your resume and every application forms you have to fill.

That link must send directly to a project. Never link to the root of your GitHub profile, it doesn’t show anything useful and it’s hard to navigate from there.

It means that you must have ONE project to show. A single demonstration project is enough, don’t need more.

This project will be the “landing page” in web buzzword. This is the first page the employer will see. They will rarely go past it (and they shouldn’t have to) so the page should be a good enough by itself. If they go past it, it’s only because the page grabbed their interests and they wanted to see more.

We’ll write the project page to give a good first impression and show off skills as a software engineer.

Project Structure

A software project can be judged in 5 seconds by looking at the directory structure.

An inexperienced developer is easy to spot. His project doesn’t have any structure. Files are either in unpredictable places or all in the top directory.

There is one project structure to rule them all. There MUST be separate directories for source, test, libraries, compiled binaries, etc…

Whether the naming convention will be “doc” or “docs” is an unimportant detail. For example, here are Simple Folder Structure Conventions for GitHub projects:

.
├── build                   # Compiled files (alternatively `dist`)
├── docs                    # Documentation files (alternatively `doc`)
├── src                     # Source files (alternatively `lib` or `app`)
├── test                    # Automated tests (alternatively `spec` or `tests`)
├── tools                   # Tools and utilities
├── LICENSE
└── README.md
software project structure
A well-organized project

Have a README

Have a README to:

  • Describe the purpose of the project
  • Screenshots/videos
  • Usage
  • Link to the installer/webpage

Have screenshots in the readme

A picture is worth a thousand words.

People are not going to install the application just to see it. Give them screenshots.

Have videos in the readme

A picture is worth a thousand words. A video is worth a thousand pictures.

There is nothing better than a video when it comes to giving a demonstration or showing off an application.

snake game preview animated gif
Great demo from a random snake project on GitHub

Note: GitHub does not allow to embed video files in the readme, use animated gif instead.

Link to a website or an installer

Link to the site if it’s a web application project. Of course, a web application should be running somewhere and publicly accessible, that’s the point of web applications.

Link to the installer if it’s a desktop application project. It’s unlikely that the user will install it but that looks professional, that’s how desktop applications are distributed after all.

Integrate GitHub tools

GitHub has a rich ecosystem of free tools for building, packaging, testing and much more. All these tools are mandatory for professional software development.

It used to be hell to setup the tooling but now everything is readily available for free through GitHub and the setup is dead simple. There is no excuse to not use the tooling.

github-integration-icons

This one is a sample C++ project for a Connect Four. From left to right:

  1. Build on Linux (Travis CI)
  2. Build on Windows (AppVeyor)
  3. Unit tests and coverage analysis (Coveralls)

What about the source code?

Nobody cares about your code. It was quite a shocking moment when I learned this in my programming career. I would take great care in polishing my code only to find out nobody actually cares. It’s not the code that counts, it’s the product. ” — Source

A paragraph to explain the purpose of the application is 10 times faster than guessing it. A quick start video of an [non-trivial] application is 100 times faster than figuring it out. A design diagram is 1000 times faster than reverse engineering the application. All of these could be achieved by reading the source code, at the cost of orders of magnitude more time and headache. It’s extremely slow and difficult to read code (or should we say to decode code). It should only ever be a last resort.

Lesson #1: Noone cares about your source code. Noone is gonna read it.

Lesson #2: Don’t expect people to read it. Don’t force them to.

What if I don’t have big projects to show?

Good. Smaller projects are easier to show, easier to explain and easier to understand for the interviewer. For instance, everyone can grasp a good old Connect Four.

It is not a trivial project despite what it looks like at first. Write a decent UI, put some colors, allow a two players option, add a “hint” to show the best next move, add an AI to play against.

While the game is conceptually simple there is a lot of work to turn it into a good and polished software. That leaves plenty of depth to talk about in a face-to-face interview.

Did you know that the first player in a connect four game always wins? [if playing perfectly] Did you know that the second player can always draw the game, if the first player doesn’t take the middle position as his first move?

Source: A Knowledge-based Approach of Connect-Four, The Game is Solved: White Wins, Victor Allis

Do interviewers really look at GitHub?

As a matter of fact: No, they don’t.

github traffic statistics
GitHub Traffic Statistics

We’ve done the tests. Here are the statistics after sending a bunch of resumes. The 3 views are from myself, I accessed the project while writing this article, without being authenticated to GitHub. Oops, my bad.

From personal experience from the last time I looked for a job. After a dozen of phone interviews (1 dev per call) and a couple of on-sites (4 to 7 devs per on-site), there was only 1 visit to my profile.

Conclusion: Noone cares about GitHub. Noone is gonna read it. Everyone is gonna ask for it nonetheless, cause it’s hype.

Bonus: Since noone will check the link they’re given, you too can refuse to participate in the GitHub masquerade by only linking to the ultimate hello world repository. (Worst case scenario: Just talk about this blog if the interviewers spot the trickery. They love candidates who read blogs).

Cheat Sheet

  1. Structure the project
  2. Have a README
  3. Write a paragraph to explain the purpose of the project
  4. Put screenshots and videos
  5. Distribute an installer (desktop app) or give the website (web app)
  6. Integrate development tools (CI, unit test, packager, etc…)

That’s good practices for software projects. It’s not limited to GitHub.

analytics pipeline architecture overview

Building an Analytics Pipeline in 2016: The Ultimate Guide


Introduction

Imagine a B2C startup. It’s small but profitable and growing.

To be fair it’s doing rather well. New users are registering daily. Revenue growth over 1000% year to year, putting the company straight in a spot comparable to the top 10 companies by revenue per employee.

How did we do that? Who are our main users? Where did they come from? Many questions to which we have no clue. Truth be told, we got this far with zero analytics, zero insights.

It’s past time to track people and their every moves.

i dont know
CEO to Marketing: “Why did we have 1187 paid sign up yesterday? Is it the new TV ad from our main competitor?”

Some numbers

We recently setup a log management solution so we have some numbers for sure.

3.591M HTTP requests per day on our frontends (cached and static contents are not served by these servers). Let’s consider this as page views and say that we want to track every page view

That’s 3.591M views per day, for which we want:

  • IP
  • city
  • country
  • user id
  • page visited
  • referer
  • affiliate source (if any)
  • device
  • operating system
  • date

How much storage does that take?

Some of the string fields can be more than 100 bytes. We’ll add more fields later (when we’ll figure what important stuff we forgot). Indexes and metadata take space on top of the actual data.

As a rule of thumb, let’s assume that each record is 1k on disk.

Thus the analytics data would take 3.6 GB per day (or 1314 GB pear year).

That’s a naive extrapolation. A non-naive plan would account for our traffic growing 5% month-to-month.

When accounting for our sustained growth, we’ll be generating 6.14 GB per day in one year from now. (At which point, the current year’s history will be consuming 1714 GB)

That quick estimation gives a rough approximation on the future volume of data. We’ll want to track more events in the future (e.g. sign-ups, deposits, withdrawals, cancellation), that shouldn’t affect the order of magnitude because page views are the most frequent actions by far. Let’s keep things simple, with a sane target.

Real Life Story

We remember the first attempt of the company at analytics. One dev decided to do analytics single-handedly, for real this time[1]! His first move was to create a new AWS instance with 50GB of disk and install PostgreSQL.

There wasn’t any forethought about what he was doing, the actual needs or the future capacity. A typical case of “just use PostgreSQL“.

In retrospect, that thing was bound to catastrophic failure (again! [1]) within the first month of going live and it was killed during the first design review, for good.

Then we started taking analytics seriously, as the hard problem that it is. We’ll summarize everything we’ve learnt on the way.

[1] That’s not the first attempt at analytics in the company.

analytics pipeline architecture overview
What does an analytics pipeline looks like after 1000 hours heads in

Storage

Storage is a critical component of the analytics.

Spoiler alert: Expect a database of some sort.

What are the hard limits of SQL databases?

As always, the first choice is to take a look at SQL databases.

Competitors:

These numbers are hard limitations, at which point the database will stop accepting writes (and potentially destroy existing data). That gives a definitive indication of when RDBMS are out of their league. As a rule of thumb, it’s time to ditch open-source SQL databases when going over 1 TB.

Notice that the paid databases have significantly higher limits, they have smart storage engines splitting data across files (among other optimizations). Most of the open-source free databases are storing each table as a single file, suffering from the filesystem limitations plus additional hardcoded limitations of the software.

We need a system supporting sharding and replication. It’s critical to manage the sheer volume of data, to not suffer from a single point of failure, and (less important) to improve performances.

For once, relational databases are not the right tool for the job. Let’s look past them.

Note: We are not saying it’s impossible to achieve something with one of these SQL databases, just that it’s not worth the effort.

NoSQL Databases

Competitors: ElasticSearch, Cassandra, MongoDB[1], DynamoDB, BigTable.

The newer generation of NoSQL databases are easier to administer and to maintain. We can add resources and adjust capacity without downtime. When one instance fails, the cluster keeps working and we’re NOT paged at 3 AM. Any these NoSQL databases would be okay, they are similar to each other.

However to support horizontal scaling, these NoSQL databases had to drop “JOINS” support. Joins are mandatory to run complex queries and discover interesting things. That is a critical feature for analytics.

Thus NoSQL databases are not [the best] fit for the purpose of analytics. We need something with horizontal scaling AND joins. Let’s look further.

Note: We are not saying it’s impossible to achieve something with one of these NoSQL databases, just that it’s not worth the effort.

[1] Just kidding about MongoDB. Never use it. It’s poorly designed and too unreliable.

Data Warehouse Databases

Competitors: Hadoop, RedShift, BigQuery

There is a new generation of databases for “data warehouse“. They are meant to store and analyse truckload of data. Exactly what we want to do.

They have particular properties and limitations compared to traditional SQL and NoSQL databases:

  • Data can only be appended in batch jobs
  • Real time queries are not supported

RedShift interface is (mostly) standard SQL, BigQuery interface is a variation of SQL.

Note: Hadoop is a very different beast. It’s meant for Petabyte scale and it’s a lot more complex to setup and use. We’ll ignore Hadoop here.

Database Choice

The right tool for the job is RedShift or BigQuery.

We’re planning to run that thing on AWS so we’ll refer to RedShift storage for the rest of the article.

Client vs Server side analytics

Events are coming from various sources. A common question is client vs server side analytics, which one to do?

The answer is both! They are complimentary.

analytics trackers
Sources: Various trackers, API and services

Client side analytics

It means that events are sent from the customer system, from the customer address. The most common example are JavaScript trackers, they run in the browser, in the customer environment.

The issue with client side scripts is that they run in the client environment and we can’t control it. First, a lot of customers are blocking trackers [1], we won’t receive any information about them. Second, the tracker endpoint must be publicly open, anyone can reverse engineer and flood it with meaningless data [2].

On the other hand, client side scripts are easy to do and they can get some information (e.g. mouse clicks) that are not available by any other means. So we should do client side analytics.

[1] 45% of users had blockers last year. It’s over 50% this year.

[2] As trivial as curling one million times “thesite.com/analytics.js?event=signup&email=bob@mail.com

Server side analytics

It means that events are sent from our servers. For instance, when a customer registers an account, one of our application will receive the request and create the account in our database, this application could send a sign-up event to the analytics service.

Analytics services provide API for developers in the most common languages (Java, python, ruby…) to send events directly from the applications.

Server side analytics have higher quality data and they don’t suffer from poor internet connections.

It’s practical to track specifics events at the place where they happen. For instance, all our applications (website, android and iOS) are calling a single “account management microservice“. We can add one line to that service to track accounts at critical stages (signed-up, confirmed email, added an address).

Integrations

In the end, all analytics should be available in one place: Our new analytics system.

A good analytics system should import data directly from the most common services. In particular we want to import analytics from MailChimp and ZenDesk.

Events Aggregator

This service is responsible for receiving and aggregating events.

It has to be reliable and scale. It is responsible for providing APIs (client-side and server-side) and supporting third-party integrations. It saves events to the storage engine (need RedShift/BigQuery support).

This is the central (and difficult) point of the design.

Segment

The uncontested SaaS leader.

Historically, it was built as an abstraction API allowing to send analytics events to different services (Google Analytics, MixPanel, KissMetrics). It evolved into a complete platform, with hundreds of pluggable components (input source, storage engine and miscellaneous services).

Pros:

  • No maintenance required
  • Fully featured
  • Support more than 100 inputs/outputs out of the box
  • Cheap (for us)

Cons:

  • Bad privacy policy (they sort-of reserve the right to resell everything)
  • It forces sending all data to a third-party
  • Possible regulations and privacy issues

If you’re starting with analytics, you should begin with Segment. You can see and query data right away. You can add other blocks later as your understanding of analytics improve and your needs evolve.

SnowPlow Analytics

The uncontested open-source free on-premise leader.

SnowPlow itself is an event pipeline. It comes with a bunch of API to send events (to one side of the pipeline). The output is written to RedShift (the other side of the pipeline).

As an open-source on-premise solution. We have to deploy and maintain the “pipeline” ourselves. The full guide is on GitHub.

In practice, that “pipeline” is a distributed system comprising 3-6 different applications written in different languages running on different platforms (keywords: elastic beanstalk, scala, kafka, hadoop and some more). It’s a clusterfuck and we are on our own to put it together and make it work. We found the barrier of entry to SnowPlow to be rather high.

snowplow architecture
SnowPlow Architecture

Sadly, SnowPlow is alone in its market (on premise). There are no equivalent paid tools to do the exact same thing with a better architecture and an easier setup. We are cornered here. Either deal with the SnowPlow monster or go with a competitor (which are all cloud services).

Pros:

  • Free (as in no money)
  • On-premise
  • Keep your data to yourself

Cons:

  • A clusterfuck to setup and maintain
  • Unclear capabilities[1] and roadmap

[1] Some critical components are marked as “not ready for production” in the documentation (as of September 2016).

Alooma

Alooma is a recent challenger that fits in a gap between the other players.

It comes with API and common integrations. It outputs data to RedShift.

Alooma itself is a real-time queuing system (based on kafka). Trackers, databases and scripts are components with an input and/or an output. They are arranged into the queuing system to form a complete pipeline. Fields and types can be mapped and converted automatically.

What makes Alooma special:

  1. Real time visualization of the queues
  2. Write custom python scripts to filter/transform fields[0]
  3. Automatic type mapping [0]
  4. Replay capabilities [0]
  5. Queue incoming messages on errors, resume processing later [0]
  6. Data is in-transit. It is not stored in Alooma [1]
  7. Clear data ownership and confidentiality terms

Under some jurisdictions, Alooma is not considered as “a third-party with whom you are sharing private personal identifiable information” because it doesn’t store data[1]. That means less legalese to deal with.

The topic of this post is building an Analytics pipeline. Technically speaking, it will always be a distributed queuing system (the best middleware for that purpose being Kafka) with trackers as input and database as output, plus special engineering to handle the hard problems[0].

That’s exactly what Alooma is selling. They made the dirty work and expose it with limited abstraction. It’s easy to understand and to integrate with. See the the 5 minute quick start video.

Pros:

  • Simple. Essential features only. Limited abstraction.
  • No maintenance required
  • Modular
  • Special middle ground between the other solutions

Cons:

  • Limited integrations (only the most common at the moment)

A word on aggregators

SaaS is cheaper, easier to use and require no maintenance from us.

But we’d rather not go for SaaS because we don’t want to give all our data to a third-party. Especially private customer information (real name, address, email…). Especially when the service has clauses in the order of “We reserve the right to use, access and resell data to anyone for any purpose”.

On premise keeps the privacy and the control.

But all the on premise solutions are free open-source tools. We’d rather not go for that because it takes too much effort to deploy it and keep it running in production. Especially when the documentation is half-arsed and the software is only half-tested and missing major features.

There is no silver bullet here. We’ll have to compromise and find a mix of solutions to make something out.

Visualization

We have the data. We want to look at cute graphs and dashboards.

Some great tools emerged recently. We have solid options here.

Looker

The on-premise leader.

Unanimous positives reviews. One of the next unicorn to look for.

The main page has good screenshots. Try and see for yourself.

looker integrations

Looker is on-premise. We can open the firewall between the looker instances and our critical databases to run queries right away (security note: make a slave with a read-only account). There is no need to send any data to external actors.

ChartIO

The SaaS leader.

Same thing as Looker but in the Cloud.

See the 1 hour training video.

It can query many databases and services (including RedShift). The integrations require special access rights, the worst case scenario is to have the database accessible over a public IP (security note: lock down access to specific client IP with a firewall). There is a hard limitations on what can be reasonably opened to ChartIO.

Periscope

The cheap open-source free tools, as in do it yourself.

It’s just in the list for posterity. Not good enough. We’d rather spend money on Looker.

Final results

We have all the building blocks. Let’s play Lego!

analytics pipeline architecture overview
Components overview

Best in class externalized analytics pipeline

externalized analytics pipeline segment chartio
Best in class fully outsourced analytics

Special trick: No RedShift required. Segment stores everything in an internal SQL database and ChartIO can interface directly with it.

This solution has a very low price to entry, it’s easy to get going and it can evolve gradually.

Pros:

  • Very easy to setup and get started
  • Many integrations and possibilities
  • Modular, start slowly and evolve over time
  • No hardware or software to maintain

Cons:

  • Everything is externalized
  • Give all your data to third party

Pricing (approximate):

Segment is priced per unique user per month. The pricing increases linearly with the number of user, starting fairly low.

ChartIO used to be $99/month for startup, then $499. Not sure what it is now. Gotta speak to sales.

Note: Segment alone is enough to have a working solution. You can ignore ChartIO entirely if can live without the great visualizations (or can’t afford it).

Best in class (kinda) on-premise analytics solution

on premise analytics pipeline alooma redshift looker
Best in class (kinda) on-premise analytics

This solution is advised to bigger companies. It’s more expensive and requires more efforts upfront. The pricing doesn’t grow linearly with the amount of unique users making it advantageous for high volume sites. Looker can query production databases and make cross referencing right away, as it is on-premise sitting next to them[3][4].

Pros:

  • Easy to setup and get started
  • Modular, start slowly and evolve analytics over time
  • No hardware or software to maintain
  • Cover more advanced use cases and run special queries
  • Query from internal databases out-of-the box[3]
  • Analyse sensitive data without having to share them[4]

Cons:

  • Need ALL the components up before it’s usable
  • The price to entry is too high for small companies

Pricing (approximate):

Alooma. To quote a public conversation from the author “Alooma pricing varies greatly. Our customers are paying anywhere between $1000 and $15000 per month. Because the variance is so big, we prefer to have a conversation before providing a quote. There is a two weeks free trial though, to test things out“.

RedShift. The minimum is $216/month for an instance with 160GB of storage. The next bump is $684 for an instance with 2TB of storage. Then it goes on linearly by adding instances. (One instance is a hard minimum, think of it as the base price). Add a few percent for bandwidth and S3.

Looker is under “entreprisey” pricing. They announced a $65k/year standard price list the last time we talked to them. Expect more or less zeros depending on the size of your company. Prepare your sharks to negotiate.

Cheap open-source on-premise analytics solution

Each open-source tool taken separately is inferior to the paid equivalent in terms of features, maintenance, documentation AND polish. The combination of all of them is sub-par but we are presenting it anyway for the sake of history.

SnowPlow analytics + Luigi => redshift => periscope

Small company or lone man with no money and no resources? Forget about this stack and go for segment.com instead. Segment is two orders of magnitude easier to get going, it will save much time and give higher returns quicker. Your analytics can evolve gradually around Segment later (if necessary) as it is extremely modular.

Big company or funded startup in growth stage? Forget about this stack. The combined cost of hardware plus engineering time is more expensive than paying for the good tools right away. Not to mention that the good tools are better.

Personal Note: By now, it should be clear that we are biased again cheap open-source software. Please stop doing that and make great software that is worth paying for instead!

Conclusion

Analytics. Problem solved.

What was impossible 10 years ago and improbable 5 years ago is readily available today. In 5 years from now, people will laugh at how trivial analytics are.

Assembling the pipeline is half the road. The next step is to integrate existing systems with it. Well, time for us to get back to work.

Thank you for reading. Comments, questions and information are welcome.


References:

Streaming Messages from Kafka into RedShift in near Real-Time (Yelp Blog), the long journey of building a custom analytics pipeline at Yelp, similar to what building Alooma in-house would be.

Buffer’s New Data Architecture: How Redshift, Hadoop and Looker Help Us Analyze 500 Million Records in Seconds (Buffer Blog).

Building Out the SeatGeek Data Pipeline (SeatGeek Blog), The solution: Looker, RedShift, and Luigi.

Building Analytics at 500px (500px blog) + The discussion on Hacker News (Hacker News Comments), the discussion is mixing users and founders of various solutions, some of which are not discussed here.

Why we witches from mixpanel and segment to kiss metrics, information about other analytics services, that can complement what we recommend here.

HAProxy vs nginx: Why you should NEVER use nginx for load balancing!


Load balancers are the point of entrance to the datacenter. They are on the critical path to access anything and everything.

That give them some interesting characteristics. First, they are the most important thing to monitor in an infrastructure. Second, they are in a unique position to give insights not only about themselves but also about every service that they are backing.

There are two popular open-source software load balancers: HAProxy and nginx. Let’s see how they compare in this regard.

Enable monitoring on the load balancers

The title is self explanatory. It should be systematic for everything going to production.

  1. Install something new
  2. Enable stats and monitoring stuff
  3. Enable logs

Enabling nginx status page

Edit /etc/nginx/nginx.conf:

server {
    listen 0.0.0.0:6644;
    access_log off;
    
    allow 127.0.0.0/8;
    allow 10.0.0.0/8;
    deny all;
    
    location / {
         stub_status on;
    }
}

Enabling HAProxy stats page

Edit /etc/haproxy/haproxy.cfg:

listen stats 0.0.0.0:6427
    mode http
    maxconn 10
    no log
    
    acl network_allowed src 127.0.0.0/8
    acl network_allowed src 10.0.0.0/8
    tcp-request connection reject if !network_allowed
    
    stats enable
    stats uri /

Collecting metrics from the load balancer

There are standard monitoring solutions: datadog, signalfx, prometheus, graphite… [2]

These tools gather metrics from applications, servers and infrastructure. They allow to explore the metrics, graph them and send alerts.

Integrating the load balancers into our monitoring system is critical. We need to know about active clients, requests/s, error rate, etc…

Needless to say, the monitoring capabilities will be limited by what information is measured and provided by the load balancer.

[2] Sorted by order of awesomeness. Leftmost is better.

Metrics available from nginx

nginx provide only 7 different metrics.

Nginx only gives the sum, over all sites. It is NOT possible to get any number per site nor per application.

Active connections: The current number of active client connections
    including Waiting connections.
accepts: The total number of accepted client connections. 
handled: The total number of handled connections. Generally, the 
    parameter value is the same as accepts unless some resource
    limits have been reached (for example, the worker_connections limit). 
requests: The total number of client requests. 
Reading: The current number of connections where nginx is reading the
    request header. 
Writing: The current number of connections where nginx is writing the
    response back to the client. 
Waiting: The current number of idle client connections waiting for a request.

Source: https://nginx.org/en/docs/http/ngx_http_stub_status_module.html

Metrics available from haproxy

HAProxy provide 61 different metrics.

The numbers are given globally, per frontend and per backend (whichever makes sense). They are available on a human readable web page and in a raw CSV format.

0. pxname [LFBS]: proxy name
1. svname [LFBS]: service name (FRONTEND for frontend, BACKEND for backend,
any name for server/listener)
2. qcur [..BS]: current queued requests. For the backend this reports the
number queued without a server assigned.
3. qmax [..BS]: max value of qcur
4. scur [LFBS]: current sessions
5. smax [LFBS]: max sessions
6. slim [LFBS]: configured session limit
7. stot [LFBS]: cumulative number of connections
8. bin [LFBS]: bytes in
9. bout [LFBS]: bytes out
[...]
32. type [LFBS]: (0=frontend, 1=backend, 2=server, 3=socket/listener)
33. rate [.FBS]: number of sessions per second over last elapsed second
34. rate_lim [.F..]: configured limit on new sessions per second
35. rate_max [.FBS]: max number of new sessions per second
36. check_status [...S]: status of last health check, one of:
37. check_code [...S]: layer5-7 code, if available
38. check_duration [...S]: time in ms took to finish last health check
39. hrsp_1xx [.FBS]: http responses with 1xx code
40. hrsp_2xx [.FBS]: http responses with 2xx code
41. hrsp_3xx [.FBS]: http responses with 3xx code
42. hrsp_4xx [.FBS]: http responses with 4xx code
43. hrsp_5xx [.FBS]: http responses with 5xx code
44. hrsp_other [.FBS]: http responses with other codes (protocol error)
[...]

Source: http://www.haproxy.org/download/1.5/doc/configuration.txt

Monitoring the load balancer

The aforementioned metrics are used to generate a status on the running systems.

First, we’ll see what kind of status page is provided out-of-the-box by each load balancer. Then we’ll dive into third-party monitoring solutions.

nginx status page

The 7 nginx metrics are displayed on a human readable web page, accessible at 127.0.0.1:6644/

nginx-status-page
Nginx Status Page

No kidding. This is what nginx considers a “status page“. WTF?!

It doesn’t display what applications are load balanced. It doesn’t display what servers are online (is there anything even running???). There is nothing to see on that page and it won’t help to debug any issue, ever.

HAProxy stats page

For comparison, let’s see the HAProxy monitoring page, accessible at 127.0.0.1:6427

haproxy-status-page
HAProxy Stats Page

Here we can see which servers are up or down, how much bandwidth is used, how many clients are connected and much more. That’s what monitoring is meant to be.

As an experienced sysadmin once told me: “This page is the most important thing in the universe.” [1]

Whenever something goes wonky. First, you open http://www.yoursite.com in a browser to see how bad it’s broken. Second, you open the HAProxy stats page to find what is broken. At this point, you’ve spot the source of the issue 90% of the time.

[0] This is especially true in environments where there is limited monitoring available, or worse, no monitoring tools at all. The status page is always here ready to help (and if it’s not, it’s only a few config lines away).

Integrating nginx with monitoring systems

All we can get are the 7 metrics from the web status page, of which only the requests/s is noteworthy. It’s not exposed in an API friendly format and it’s impossible to get numbers per site. The only hack we can do is parse the raw text, hopping no spacing will change in future versions.

Given that nginx doesn’t expose any useful information, none of the existing monitoring tools can integrate with it. When there is nothing to get, there is nothing to display and nothing to alert on.

Note: Some monitoring tools actually pretend to support nginx integrations. It means that they parse the text and extract the request/s number. That’s all they can get.

Integrating HAProxy with monitoring systems

In additional to the nice human readable monitoring page, all the HAProxy metrics are available in a CSV format. Tools can (and do) take advantage of it.

For instance, this is the default HAProxy dashboard provided by Datadog:

haproxydash
Datadog pre-made dashboard for HAProxy

Source: http://docs.datadoghq.com/integrations/haproxy/

A Datadog agent installed on the host gathers the HAProxy metrics periodically. The metrics can be graphed, the graphs can be arranged into dashboards (this one is an example), and last but not least we can configure automatic alerts.

The HAProxy stats page gives the current status (at the time the page is generated) whereas the monitoring solution saves the history and allows for debugging back in time.

Why does nginx have no monitoring?

All monitoring capabilities are missing from nginx on purpose. They are not and will never be available for free. Period.

If you are already locked-in by nginx and you need a decent monitoring page and a JSON API for integrating, you will have to pay for the “Nginx Plus” edition. The price starts at $1900 per server per year.

See: https://www.nginx.com/products/pricing/

Conclusion: Avoid nginx at all costs

Load balancers are critical points of transit and the single most important things to monitor in an infrastructure.

Nginx stripped all monitoring features for the sake of money, while pretending to be open-source.

Being left entirely blind on our operations is not acceptable. Stay away from nginx. Use HAProxy instead.

graylog architecture overview

250 GB/day of logs with Graylog: The good, the bad and the ugly


Architecture

graylog-architecture
Graylog Architecture
  • Load Balancer: Load balancer for log input (syslog, kafka, GELF, …)
  • Graylog: Logs receiver and processor + Web interface
  • ElasticSearch: Logs storage
  • MongoDB: Configuration, user accounts and sessions storage

Costs Planning

Hardware requirements

  • Graylog: 4 cores, 8 GB memory (4 GB heap)
  • ElasticSearch: 8 cores, 60 GB memory (30 GB heap)
  • MongoDB: 1 core, 2 GB memory (whatever comes cheap)

AWS bill

 + $ 1656 elasticsearch instances (r3.2xlarge)
 + $  108   EBS optimized option
 + $ 1320   12TB SSD EBS log storage
 + $  171 graylog instances (c4.xlarge)
 + $  100 mongodb instances (t2.small :D)
===========
 = $ 3355
 x    1.1 premium support
===========
 = $ 3690 per month on AWS

GCE bill

 + $  760 elasticsearch instances (n1-highmem-8)
 + $ 2040 12 TB SSD EBS log storage
 + $  201 graylog instances (n1-standard-4)
 + $   68 mongodb (g1-small :D)
===========
 = $ 3069 per month on GCE

GCE is 9% cheaper in total. Admire how the bare elasticsearch instances are 55% cheaper on GCE (ignoring the EBS flag and support options).

The gap is diminished by SSD volumes being more expensive on GGE than AWS ($0.17/GB vs $0.11/GB). This setup is a huge consumer of disk space. The higher disk pricing is eating part of the savings on instances.

Note: The GCE volume may deliver 3 times the IOPS and throughput of its AWS counterpart. You get what you pay for.

Capacity Planning

Performances (approximate)

  • 1600 log/s average, over the day
  • 5000 log/s sustained, during active hours
  • 20000 log/s burst rate

Storage (as measured in production)

  • 138 906 326 logs per day (averaged over the last 7 days)
  • 2200 GB used, for 9 days of data
  • 1800 bytes/log in average

Our current logs require 250 GB of space per day. 12 TB will allow for 36 days of log history (at 75% disk usage).

We want 30 days of searchable logs. Job done!

Competitors

ELK

Dunno, never seen it, never used it. Probably a lot of the same.

Splunk Licensing

The Splunk licence is based on the volume ingested in GB/day. Experience has taught us that we usually get what we pay for, therefore we love to pay for great expensive tools (note: ain’t saying splunk is awesome, don’t know, never used it). In the case of Splunk vs ELK vs Graylog. It’s hard to justify the enormous cost against two free tools which are seemingly okay.

We experienced a DoS an afternoon, a few weeks after our initial small setup: 8000 log/s for a few hours while we were planning for 800 log/s.

A few weeks later, the volume suddenly went up from 800 log/s to 4000 log/s again. This time because debug logs and postgre performance logs were both turned on in production. One team was tracking an Heisenbug while another team felt like doing some performance analysis. They didn’t bother to synchronise.

These unexpected events made two things clear. First, Graylog proved to be reliable and scalable during trial by fire. Second, log volumes are unpredictable and highly variable. A volume-based licensing is a highway to hell, we are so glad to not have had to put up with it.

Judging by the information on Splunk website, the license for our current setup would be in the order of $160k a year. OMFG!

How about the cloud solutions?

One word  : No.
Two words: Strong No.

The amount of sensitive information and private user data available in logs make them the ultimate candidate for not being outsourced, at all, ever.

No amount of marketing from SumoLogic is gonna change that.

Note: We may to be legally forbidden to send our logs data to a third party. Even thought that would take a lawyer to confirm or deny it for sure.

Log management explained

Feel free to read “Graylog” as “<other solution>”. They’re all very similar with most of the same pros and cons.

What Graylog is good at

  1. debugging & postmortem
  2. security and activity analysis
  3. regulations

Good: debugging & postmortem

Logs allow to dive into what happened millisecond by millisecond. It’s the first and last resort tool when it comes to debugging issues in production.

That’s the main reason logs are critical in production. We NEED the logs to debug issues and keep the site running.

Good: activity analysis

Logs give an overview of the activity and the traffic. For instance, where are most frontend requests coming from? who connected to ssh recently?

Good: regulations

When we gotta have searchable logs and it’s not negotiable, we gotta have searchable logs and it’s not negotiable. #auditing

What Graylog is bad at

  1. (non trivial) analytics
  2. graphing and dashboards
  3. metrics (ala. graphite)
  4. alerting

Bad: (non trivial) Analytics

Facts:

1) ElasticSearch cannot do join nor processing (ala mapreduce)
2) Log fields have weak typing
3) [Many] applications send erroneous or shitty data (e.g. nginx)

Everyone knows that an HTTP status code is an integer. Well, not for nginx. It can log an upstream_status_code ‘200‘ or ‘‘ or ‘503, 503, 503‘. Searching nginx logs is tricky and statistics are failing with NaN errors (Not a Number).

Elasticsearch itself has weak typing. It tries to detect field types automatically with variable success (i.e. systematic failure when receiving ambiguous data, defaulting to string type).

The only workaround around is to write field pre/post processors to sanitize inputs but it’s cumbersome when there are unlimited applications and fields each requiring a unique correction.

In the end, the poor input data can break simple searches. The inability to do joins prevents from running complex queries at all.

It would be possible to do analytics by sanitizing log data daily and saving the result to BigQuery/RedShift but it’s too much effort. We better go for a dedicated analytics solution, with a good data pipeline (i.e. NOT syslog).

Lesson learnt: Graylog doesn’t replace a full fledged analytics service.

Bad: Graphing and dashboards

Graylog doesn’t support many kind of graphs. It’s either “how-many-logs-per-minute” or “see-most-common-values-of-that-field” in the past X minutes. (There will be more graphs as the product mature, hopefully). We could make dashboards but we’re lacking interesting graphs to put into them.

edit: graylog v2 is out, it adds automatic geolocation of IP addresses and a map visualization widget.

Bad: Metrics and alerting

Graylog is not meant to handle metrics. It doesn’t gather metrics. The graphs and dashboards capabilities are too limited to make anything useful even if metrics were present. The alerting capability is [almost] non existent.

Lesson learnt: Graylog does NOT substitute to a monitoring system. It is not in competition with datadog and statsd.

Special configuration

ElasticSearch field data

indices.fielddata.cache.size: 20%

By design, field data are loaded in memory when needed and never evicted. They will fill the memory until OutOfMemory exception. It’s not a bug, it’s a feature.

It’s critical to configure a cache limit to stop that “feature“.

Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker.html

ElasticSearch shards are overrated

elasticsearch_shards = 1
elasticsearch_replicas = 1

Shards allow to split an index logically into shards [a shard is equivalent to a virtual index]. Operations on an index are transparently distributed and aggregated across its shards. This architecture allows to scale horizontally by distributing shards across nodes.

Sharding makes sense when a system is designed to use a single [big] index. For instance, a 50 GB index for http://www.apopularforum.com can be split in 5 shards of 10GB and run on a 5 nodes cluster. (Note that a shard MUST fit in the java heap for good performances.)

Graylog (and ELK) have a special mode of operation (inherent to log handling) in where new indices are created periodically. Thus, there is no need to shard each individual index because the architecture is already sharded on a higher level (across indices).

Log retention MUST be based on size

Retention = retention criteria * maximum number of indexes in the cluster.

e.g. 1GB per index * 1000 indices =  1TB of logs are retained

The retention criteria can be a maximum time period [per index], a maximum size [per index], or a maximum document count [per index].

The ONLY viable retention criteria is to limit by maximum index size.

The other strategies are unpredictable and unreliable. Imagine a “fixed rotation every 1 hour” setting, the storage and memory usage of the index will vary widely at 2-3am, at daily peak time, and during a DDoS.

mongodb and small files

smallfiles: true

mongodb is used for storing settings, user accounts and tokens. It’s a small load that can be accommodated by small instances.

By default, mongodb is preallocating journals and database files. Running an empty database takes 5GB on disk (and indirectly memory for file caching and mmap).

The configuration to use smaller files (e.g. 128MB journal instead of 1024MB) is critical to run on small instances with little memory and little disk space.

elasticsearch is awesome

elasticsearch is the easiest database to setup and run in a cluster.

It’s easy to setup, it rebalances automatically, it shards, it scales, it can add/remove nodes at anytime. It’s awesome.

Elasticsearch drops consistency in favour of uptime. It will continue to operate in most circumstances (in ‘yellow’ or ‘red’ state, depending whether replica are available for recovering data) and try to self heal. In the meantime, it ignores the damages and works with a partial view.

As a consequence, elasticsearch is unsuitable for high-consistency use cases (e.g. managing money) which must stop on failure and provide transactional rollback. It’s awesome for everything else.

mongodb is the worst database in the universe

There are extensive documentation about mongodb fucking up, being unreliable and destroying all data.

We came to a definitive conclusion after wasting spending lots of time with mongodb, in a clustered setup, in production. All the shit about mongodb is true.

We stopped counting the bugs, the configuration issues, and the number of times the cluster got deadlocked or corrupted (sometimes both).

Integrating with Graylog

The ugly unspoken truth of log management is that having a solution in place is only 20% of the work. Then most of the work is integrating applications and systems into it.Sadly, it has to be done one at a time.

JSON logs

The way to go is JSON logs. JSON format is clean, simple and well defined.

Reconfigure applications libraries to send JSON messages. Reconfigure middleware to log JSON messages.

nginx

log_format json_logs '{ '
 '"time_iso": "$time_iso8601",'

 '"server_host": "$host",'
 '"server_port": "$server_port",'
 '"server_pid": "$pid",'

 '"client_addr": "$remote_addr",'
 '"client_port": "$remote_port",'
 '"client_user": "$remote_user",'

 '"http_request_method": "$request_method",'
 '"http_request_uri": "$request_uri",'
 '"http_request_uri_normalized": "$uri",'
 '"http_request_args": "$args",'
 '"http_request_protocol": "$server_protocol",'
 '"http_request_length": "$request_length",'
 '"http_request_time": "$request_time",'

 '"ssl_protocol": "$ssl_protocol",'
 '"ssl_session_reused": "$ssl_session_reused",'

 '"http_header_cf_ip": "$http_cf_connecting_ip",'
 '"http_header_cf_country": "$http_cf_ipcountry",'
 '"http_header_cf_ray": "$http_cf_ray",'

 '"http_response_size": "$bytes_sent",'
 '"http_response_body_size": "$body_bytes_sent",'

 '"http_content_length": "$content_length",'
 '"http_content_type": "$content_type",'

 '"upstream_server": "$upstream_addr",'
 '"upstream_connect_time": "$upstream_connect_time",'
 '"upstream_header_time": "$upstream_header_time",'
 '"upstream_response_time": "$upstream_response_time",'
 '"upstream_response_length": "$upstream_response_length",'
 '"upstream_status": "$upstream_status",'

 '"http_status": "$status",'
 '"http_referer": "$http_referer",'
 '"http_user_agent": "$http_user_agent"'
 ' }';
access_log syslog:server=127.0.0.1,severity=notice json_logs;
 error_log syslog:server=127.0.0.1 warn;

syslog-ng

We use syslog-ng to deliver system logs to Graylog.

options {
 # log with microsecond precision
 ts-format(iso);
 frac-digits(6);

 # detect dead TCP connection
 mark-freq(5);
 
 # DNS failover
 time-reopen(10);
 dns-cache-expire(30);
 dns-cache-expire-failed(30);
}
destination d_graylog {
 # DNS balancing
 syslog("graylog-server.internal.brainshare.com" transport("tcp") port(1514));
};

Conclusion

It is perfectly normal to spend 10-20% of the infrastructure costs in monitoring.

Graylog is good. Elasticsearch is awesome. mongodb sucks. Splunk costs an arm (or two). Nothing new in the universe.

From now on and forward, applications should log messages in JSON format. That’s the best way we’ll be able to extract meaningful information out of them.

HackerRank Testing: A glimpse at the company side


HackerRank is an online coding platform. It provides coding tests and questions for companies to screen candidates.

We remember the first time we had to do a test (before joining the company), unsure what were the expectations. Later, we were designing new tests (after joining the company), unsure what to expect from candidates.

We decided to release some insights on our experience, full disclosure. How good people are doing? How the test is evaluated?

Hopefully, that will give everyone a better understanding of what is going on.

Statistics

hr funnel
Last month – 79 candidates

Do or not do, there is no try

We invited 79 people to do the test in the last month… 29% of them never tried.

On the bright side, the more candidates who kick themselves out, the more time we can dedicate to the remaining ones.

You can be a top 71% performer by simply trying! =D

Details

We inaugurated a new test last week and 5 candidates did it over the weekend. They happen to be a representative sample:

  1. Didn’t attempt any of the coding exercises
  2. Answered all coding exercises with “return true” or equivalent algorithm.
  3. Answered exercises not with code but with comments about the train’s Wi-Fi being terrible, especially after the train started moving
  4. Had trouble to solve the SSH-to-our-server exercise without sudo, until he hacked the webserver with a fresh 0-day to elevated his privileges.
  5. Answered all simple questions with simple algorithms, didn’t finish the hard one.

Three failed and two passed. It’s self-evident who is who.

Highest bang for the buck

There is no other form of screening that can scale as well as HackerRank. It is also the fairest interview process since it never discriminates on age, race, years of experience, school or anything.

Designing the test takes a few day.

We pay $5 per invitation and the correction takes 5-15 minutes.

Hall of Shame

Internet is required to complete the test

One candidate tried to do the test on a laptop, in a moving train, over the train’s Wi-Fi. It didn’t go well and he sent us a long email to complaint right after the test.

On the bright side, he wrote long comments in English. On the dark side, he didn’t code any of the simple things (not requiring internet or any documentation) and all the writings prove the internet connection was not that bad.

We considered about giving him a second chance and then we just dropped the case after much confusion and more emails.

Did he think that internet is unnecessary to access http://www.hackerrank.com? Is the connectivity usually good in train? Does he do the same thing for Skype interviews? We don’t know and we’ll never know. We are still puzzled to this day.

We’ve added a note to our introductory email to clarify: “Internet access is required, for the whole duration of the test“.

“return true” is NOT the ultimate answer to everything

We are seeing a lot of stupid answers. Probably just to grab some points.

Class Solution {
    // str : firstname|lastname|phonenumber|address|zipcode|country
    bool filter(String str) {
        return true;
    }
}
int max(int array[], int size) {
    return array[0];
}

Booleans are about 50-50 by the law of probability, integers can get lucky with 0 or -1, arrays with the first or last element.

Passing 50% of tests is good value for the time invested but it won’t survive a code review. (Not to mention that 80% of the point could be on the harder test cases).

Tip and tricks for candidates

Complexity

As a candidate, you cannot see the unit tests content, the edge cases or the complexity expected.

The question gives bounds on the input size. The title and tags gives a hint about the expected solution (e.g. dynamic programming). Read that wisely.

64 bits integer

Many questions require 64 bits integers but it’s NEVER mentioned. Go for 64 bits integers as default whenever there is an array with thousands of integers and some additions (e.g. all trading-like and number-crunching questions).

Unit Tests

The unit tests are NOT ordered in ascending difficulty and they may have limited variety.

For instance, if there are 8 tests (excluding examples), that could be 4 tests with 64 bits results + 6 tests with 50 MB of input data + 1 test with a single number.

A slight difference in complexity or an unhandled edge case may turn around many tests.

Timeouts

A test case has between 1 and 5 seconds to be run (depending on the language). A “timeout error” on a test means that it didn’t finish in the given time and was terminated. Gotta write faster code.

All your code is reviewed

On the recruiter interface, we can see the code that was submitted, we have the input and the output of all test cases. Including errors and partial output.

We review everything, we evaluate algorithms, we evaluate complexity, we read comments, we consider special hacks/tricks, we check edge cases.

Points

HackerRank gives points per question and per unit test successful. We get a general sense of completion when we open the review windows “x/300 points” but ultimately the decision comes down to the code review.

Time Spent

We have an overview of the time spent on the test.

hr test time report
1-4: MCQ question, 5-8: coding exercise, total: 60 minutes

HackerRank is simple

Whatever a test contains, the candidate will usually advance to the next round if he can answer some of the coding exercises.

A developer should be able to code some solutions to some [simple] problems. That’s exactly what HackerRank is testing.

HackerRank is good for everyone

Once in a while there is a company with a crazy impossible test that is rejecting everyone. The company would do the same thing if it were face-to-face. You just avoided an awkward 4h on-site interview.

Sample Test

There is only one important thing to do before attempting a test. Try the the sample test  to familiarize yourself with the platform and ensure everything is working.

Conclusion

Recruiting takes a huge amount of effort on everyone involved. HackerRank’s purpose is to save a lot of time and effort by weeding out people earlier [especially utterly unqualified people]. Most of these would fail in the same way in a phone or face-to-face interview.

It’s good and it’s extremely effective. It can replace the initial phone screen.

 

Cracking the HackerRank Test: 100% score made easy


Foreword

It’s well known that most programmers wannabes can’t code their way out of a paper bag. As a consequence, the tech industry is pushing for longer, harder and evermore extreme screening.

The whiteboard interview has been the standard for a while, followed by puzzles [now abandoned], then FizzBuzz.

The latest fad is HackerRank. It’s introducing automated programming tests to be done by the candidate before he’s allowed to talk to anyone in the company.

A lot of very good companies are using HackerRank as a pre-screening tool. If we can’t avoid it, we gotta embrace it.

What to find in a HackerRank test?

There are 3 types of questions to be encountered in a test:

  • Multiple Choice Questions: “What is the time complexity to find an element in a red and black tree?” -A- -B- -C- -D-
  • Coding Exercise: “<Long description of a problem to be solved>, <input data format>, <output data format>.” Start coding a solution.
  • SudoRank Exercise: “Your ssh credentials are tester:QWERTUIOP@123.98.45.76 <long description of what’s wrong with that server>.” SSH to the server and start fixing.

Any amount of any question can be put together, in any order, to make a complete test. A company should give some indications on what to expect in its test.

HackerRank provides a library of hundreds of questions and exercises ready to use. It’s also possible for a company to write their own (and recommended).

Defeating Multiple Choice Question

The majority of the multiple choice questions can be solved by an appropriate Google search. Usually on the title, sometimes on a few select words from the text.

hr question dropping privileges
Select Text => Right Click => Quick Search
hr google dropping privileges
Google has spoken! => all in favour of setuid()

Defeating Coding Exercises

Searching for a 10 lines long paragraph in Google is not an acceptable option. Not to mention that the HackerRank website disables copy/paste in the description area.

The workaround is to search for the title of the exercise. A title uniquely identifies a question on HackerRank. It will be mentioned in related solutions and blog posts. Perfect for being indexed by Google.

hr question lonely integer
Select Text => Right Click => Quick Search
hr google lonely integer.png
Touché!

The first result is the question, the second result is the solution. Well, that was easy.

Bonus: That google solution is actually wrong… yet it gives all the points.

// [boilerplate omitted]
int main() {
    int N;
    cin >> N;
    int tmp, result = 0;
    for (int i = 0; i < N; i++) {         cin >> tmp;
        result ^= tmp;
    }
    cout << result;
    return 0;
}

This solution is only correct if duplicated numbers are in pairs. All the HackerRank unit tests happen to fit this criteria by pure coincidence [cf. Addendum].

Originally, we put this simple question at the beginning of a test for warm-up. We received that answer from a candidate in our first batch of applicants. It was quite puzzling, what are the odds that someone would come up with an algorithm that convoluted if given only the text from the question? A quick investigation quickly revealed the source.


Addendum:

The “Lonely Integer” question is worded slightly differently in the public HackerRank site and the private HackerRank library but the input, output and unit tests are the same. Hence why the solution is off but works. HackerRank is obviously copying questions from the community into the professional library. That’s another copy-cat busted!


Recruiter Insights: Cheating brought to the next level

We have a lot of candidates coming from recruiters. How are they comparing to candidates from other sources?

Let’s see the statistics on a hard question [i.e. dynamic programming trading algorithm].

hr insights stock maximize distribution
Distribution over all attempts, by all companies (log scale). 1234 zero vs 303 full score.

Most candidates get 0 points: ran out of time, unable to answer, wrong algorithm, or incomplete/partial solutions (i.e. good start but not enough to pass any unit test yet).

We wanted to show the same distribution over our pool of candidates but HackerRank doesn’t provide that graph anymore. It used to.

Anyway, we remember approximate numbers. The distribution for our candidates is about 50/50% on each extreme. That’s significantly better than the 75/17% from the general population. We can correlate that number with the time spent on the question and a visual code review.

The result is that candidates coming from recruiters perform better, especially on hard exercises. In fact it is unbelievable how much better they perform! (Special kudos to the guys who are able to solve a problem -with a perfectly optimized solution- in less time than it would take to actually read it =D).

The conclusion is plain and simple: Our recruiters give away the test to the candidates.

everybody does it, it's just that nobody talkes about it
Do recruiters give away your questions? Of course!

Lesson learnt:

  • For candidates: Remember to ask the recruiter for support before the test.
  • For recruiters: Remember to coach the candidate for the test and instruct him to write down changes (if any).
  • For companies: Beware high-score candidates coming from recruiters! In particular, don’t calibrate scoring based on extremes scores from a few cheaters.

Challenge: How long does it take you to solve a trading challenge? [dynamic programming, medium difficulty].

Custom HackerRank Tests

Companies can write custom exercises and they should. It’s hard and it requires particular skills but it is definitely worthwhile.

It is the only effective solution against Google, if done carefully. (It’s actually surprisingly  difficult to make exercises that are both simple AND not easily found with Google on 1000 tutorials and coding forums).

Sadly, it won’t help against recruiters. (Excluding the first batch of candidates who will be sacrificed as scouts).

Conclusion

Did we just ruin HackerRank pre-screening? Of course not! There is a never-ending supply of bozos unable to tell the difference between Internet and Internet Explorer.

We could write a book teaching the answers to 90% of programming interviews problems, yet 99% of job seekers would never read it. Hell, it’s been written for a while and it had no impact whatsoever.

Only the handful of devs following blogs/news or searching for “What is HackerRank?” will be able to come better prepared.

If anything, this article makes HackerRank better and more relevant. Now a test is about looking for help on Google and fixing subtly broken snippets of unindented code written in the wrong language.

HackerRank is finally screening for capabilities relevant to the job!

GCE vs AWS in 2016: Why you should NEVER use Amazon!


Foreword

This story relates my experience at a typical web startup. We are running hundreds of instances on AWS, and we’ve been doing so for some time, growing at a sustained pace.

Our full operation is in the cloud: webservers, databases, micro-services, git, wiki, BI tools, monitoring… That includes everything a typical tech company needs to operate.

We have a few switches and a router left in the office to provide internet access and that’s all, no servers on-site.

The following highlights many issues encountered day to day on AWS so that [hopefully] you don’t do the same mistakes we’ve done by picking AWS.

What does the cloud provide?

There are a lot of clouds: GCE, AWS, Azure, Digital Ocean, RackSpace, SoftLayer, OVH, GoDaddy… Check out our article Choosing a Cloud Provider: AWS vs GCE vs SoftLayer vs DigitalOcean vs …

We’ll focus only on GCE and AWS in this article. They are the two majors, fully featured, shared infrastructure, IaaS offerings.

They both provide everything needed in a typical datacenter.

Infrastructure and Hardware:

  • Get servers with various hardware specifications
  • In multiple datacenters across the planet
  • Remote and local storage
  • Networking (VPC, subnets, firewalls)
  • Start, stop, delete anything in a few clicks
  • Pay as you go

Additional Managed Services (optional):

  • SQL Database (RDS, Cloud SQL)
  • NoSQL Database (DynamoDB, Big Table)
  • CDN (CloudFront, Google CDN)
  • Load balancer (ELB, Google Load Balancer)
  • Long term storage (S3, Google Storage)

Things you must know about Amazon

GCE vs AWS pricing: Good vs Evil

Real costs on the AWS side:

  • Base instance plus storage cost
  • Add provisioned IOPS for databases (normal EBS IO are not reliable enough)
  • Add local SSD (675$ per 800 GB + 4 CPU + 30 GB. ALWAYS ALL together)
  • Add 10% on top of everything for Premium Support (mandatory)
  • Add 10% for dedicated instances or dedicated hosts (if subject to regulations)

Real costs on the GCE side:

  • Base instance plus storage cost
  • Enjoy fast and dependable IOPS out-of-the-box on remote SSD volumes
  • Add local SSD (82$ per 375 GB, attachable to any existing instance)
  • Enjoy automatic discount for sustained usage (~30% for instances running 24/7)

AWS IO are expensive and inconsistent

EBS SSD volumes: IOPS, and P-IOPS

We are forced to pay for Provisioned-IOPS whenever we need dependable IO.

The P-IOPS are NOT really faster. They are slightly faster but most importantly they have a lower variance (i.e. 90%-99.9% latency). This is critical for some workload (e.g. databases) because normal IOPS are too inconsistent.

Overall, P-IOPS can get very expensive and they are pathetic compared to what any drive can do nowadays (720$/month for 10k P-IOPS, in addition to $0.14 per GB).

Local SSD storage

Local SSD storage is only available via the i2 instances family which are the most expensive instances on AWS (and over all clouds).

There is no granularity possible. CPU, memory and SSD storage amount all DOUBLE between the few i2.xxx instance types available. They grow in powers of 4CPU + 30GB memory + 800 GB SSD and the multiplier is $765/month.

These limitations make local SSD storage expensive to use and special to manage.

AWS Premium Support is mandatory

The premium support is +10% on top of the total AWS bill (i.e. EC2 instances + EBS volumes + S3 storage + traffic fees + everything).

Handling spikes in traffic

ELB cannot handle sudden spikes in traffic. They need to be scaled manually by support beforehand.

An unplanned event is a guaranteed 5 minutes of unreachable site with 503 errors.

Handling limits

All resources are artificially limited by a hardcoded quota, which is very low by default. Limits can only be increased manually, one by one, by sending a ticket to the support.

I cannot fully express the frustration when trying to spawn two c4.large instances (we already got 15) only to fail because “limit exhaustion: 15 c4.large in eu-central region“. Message support and wait for one day of back and forth email. Then try again and fail again because “limit exhaustion: 5TB of EBS GP2 in eu-central region“.

This circus goes on every few weeks, sometimes hitting 3 limits in a row. There are limits for all resources, by region, by availability zone, by resource types and by resource specifics criteria.

Paying guarantees a 24h SLA to get a reply to a limit ticket. The free tiers might have to wait for a week (maybe more), being unable to work in the meantime. It is an absurd yet very real reason pushing for premium support.

Handling failures on the AWS side

There is NO log and NO indication of what’s going on in the infrastructure. The support is required whenever something wrong happens.

For example. An ELB started dropping requests erratically. After contacting support, they acknowledged to have no idea what’s going on and took action “Thank you for your request. One of the ELB was acting weird, we stopped it and replaced it with a new one“.

The issue was fixed. Sadly, they don’t provide any insight or meaningful information. This is a strong pain point for debugging and planning future failures.

Note: We are barraging further managed service from being introduced in our stack. At first they were tried because they were easy to setup (read: limited human time and a bit of curiosity). They soon proved to be causing periodic issues while being impossible to debug and troubleshoot.

ELB are unsuitable to many workloads

[updated paragraph after comments on HN]

ELB are only accessible with a hostname. The underlying IPs have a TTL of 60s and can change at any minute.

This makes ELB unsuitable for all services requiring a fixed IP and all services resolving the IP only once at startup.

ELB are impossible to debug when they fail (they do fail), they can’t handle sudden spike and the CloudWatch graphs are terrible. (Truth be told. We are paying Datadog $18/month per node to entirely replace CloudWatch).

Load balancing is a core aspect of high-availability and scalable design. Redundant  load balancing is the next one. ELB are not up to the task.

The alternative to ELB is to deploy our own HAProxy in pairs with VRRP/keepalived. It takes multiple weeks to setup properly and deploy in production.

By comparison, we can achieve that with google load balancers in a few hours. A Google load balancer can have a single fixed IP. That IP can go from 1k/s to 10k/s requests instantly without loosing traffic. It just works.

Note: Today, we’ve seen one service in production go from 500 requests/s to 15000 requests/s in less than 3 seconds. We don’t trust an ELB to be in the middle of that

Dedicated Instances

Dedicated instances are Amazon EC2 instances that run in a virtual private cloud (VPC) on hardware that’s dedicated to a single customer. Your Dedicated instances are physically isolated at the host hardware level from your instances that aren’t Dedicated instances and from instances that belong to other AWS accounts.

Dedicated instances/hosts may be mandatory for some services because of legal compliance, regulatory requirements and not-having-neighbours.

We have to comply to a few regulations so we have a few dedicated options here and there. It’s 10% on top of the instance price (plus a $1500 fixed monthly fee per region).

Note: Amazon doesn’t explain in great details what dedicated entails and doesn’t commit to anything clear. Strangely, no regulators pointed that out so far.

Answer to HN comments: Google doesn’t provide “GCE dedicated instances”. There is no need for it. The trick is that regulators and engineers don’t complain about not having something which is non-existent, they just live without it and our operations get simpler.

Reserved Instances are bullshit

A reservation is attached to a specific region, an availability zone, an instance type, a tenancy, and more. In theory the reservation can be edited, in practice that depends on what to change. Some combinations of parameters are editable, most are not.

Plan carefully and get it right on the first try, there is no room for errors. Every hour of a reservation will be paid along the year, no matter whether the instance is running or not.

For the most common instance types, it takes 8-10 months to break even on a yearly reservation. Think of it as gambling game in a casino. A right reservation is -20% and a wrong reservation is +80% on the bill. You have to be right MORE than 4/5 times to save any money.

Keep in mind that the reserved instances will NOT benefit from the regular price drop happening every 6-12 months. If there is a price drop early on, you’re automatically loosing money.

Critical Safety Notice: 3 years reservation is the most dramatic way to loose money on AWS. We’re talking potential 5 digits loss here, per click. Do not go this route. Do not let your co-workers go this route without a warning. 

What GCE does by comparison is a PURELY AWESOME MONTHLY AUTOMATIC DISCOUNT. Instances hours are counted at the end of every month and discount is applied automatically (e.g. 30% for instances running 24/7). The algorithm also accounts for multiple started/stopped/renewed instances, in a way that is STRONGLY in your favour.

Reserving capacity does not belong to the age of Cloud, it belongs to the age of data centers.

AWS Networking is sub-par

Network bandwidth allowance is correlated with the instance size.

The 1-2 cores instances peak around 100-200 Mbps. This is very little in a world more and more connected where so many things rely on the network.

Typical things experiencing slow down because of the rate limited networking:

  • Instance provisioning, OS install and upgrade
  • Docker/Vagrant image deployment
  • sync/sftp/ftp file copying
  • Backups and snapshots
  • Load balancers and gateways
  • General disk read/writes (EBS is network storage)

Our most important backup takes 97 seconds to be copied from the production host to another site location. Half time is saturating the network bandwidth (130 Mbps bandwidth cap), half time is saturating the EBS volume on the receiving host (file is buffered in memory during initial transfer then 100% iowait, EBS bandwidth cap).

The same backup operation would only take 10-20 seconds on GCE with the same hardware.

Cost Comparison

This post wouldn’t be complete without an instance to instance price comparison.

In fact, it is so important that it was split to dedicated article: A typical cost comparison between GCE an AWS

Hidden fees everywhere + unreliable capabilities = human time wasted in workarounds

Capacity planning and day to day operations

Capacity planning is unnecessary hard with the not-scalable resources, unreliable performances capabilities, insufficient granularity, and hidden constraints everywhere. Planning cost is a nightmare.

Every time we have to add an instance. We have to read the instances page, pricing page, EBS page again. There are way too many choices, some of which being hard to change latter. That could be printed on papers and cover a4x7 feet table. By comparison it takes only 1 page both-sided to pick an appropriate instance from Google.

Optimizing usage is doomed to fail

The time taken to optimizing reserved instance is a similar cost to the savings done.

Between CPU count, memory size, EBS volume size, IOPS, P-IOPS. Everything is over-provisioned on AWS. Partly because there are too many things to follow and  optimize for a human being, partly as workaround against the inconsistent capabilities, partly because it is hard to fix later for some instances live in production.

All these issues are directly related to the underlying AWS platform itself, being not neat and unable to scale horizontal cleanly, neither in hardware options, nor in hardware capabilities nor money-wise.

Every time we think about changing something to reduce costs, it is usually more expensive than NOT doing anything (when accounting for engineering time).

Conclusion

AWS has a lot of hidden costs and limitations. System capabilities are unsatisfying and cannot scale consistently. Choosing AWS was a mistake. GCE is always a better choice.

GCE is systematically 20% to 50% cheaper for the equivalent infrastructure, without having to do any thinking or optimization. Last but not least it is also faster, more reliable and easier to use day-to-day.

The future of our company

Unfortunately, our infrastructure on AWS is working and migrating is a serious undertaking.

I learned recently that we are a profitable company, more so than I thought. Looking at the top 10 companies by revenue per employee, we’d be in the top 10. We are stuck with AWS in the near future and the issues will have to be worked around with lots of money. The company is able to cover the expenses and cost optimisation ain’t a top priority at the moment.

There’s a saying “throwing money at a problem“. We shall say “throwing houses at the problem” from now on as it better represents the status quo.

If we get to keep growing at the current pace, we’ll have to scale vertically, and by that we mean “throwing buildings at Amazon”😀

burning money
The official AWS answer to all their issues: “Get bigger instances”