Foreword

This story relates my experience at a typical web startup. We are running hundreds of instances on AWS, and we’ve been doing so for some time, growing at a sustained pace.

Our full operation is in the cloud: webservers, databases, micro-services, git, wiki, BI tools, monitoring… That includes everything a typical tech company needs to operate.

We have a few switches and a router left in the office to provide internet access and that’s all, no servers on-site.

The following highlights many issues encountered day to day on AWS so that [hopefully] you don’t do the same mistakes we’ve done by picking AWS.

What does the cloud provide?

There are a lot of clouds: GCE, AWS, Azure, Digital Ocean, RackSpace, SoftLayer, OVH, GoDaddy… Check out our article Choosing a Cloud Provider: AWS vs GCE vs SoftLayer vs DigitalOcean vs …

We’ll focus only on GCE and AWS in this article. They are the two majors, fully featured, shared infrastructure, IaaS offerings.

They both provide everything needed in a typical datacenter.

Infrastructure and Hardware:

Get servers with various hardware specifications
In multiple datacenters across the planet
Remote and local storage
Networking (VPC, subnets, firewalls)
Start, stop, delete anything in a few clicks
Pay as you go

Additional Managed Services (optional):

SQL Database (RDS, Cloud SQL)
NoSQL Database (DynamoDB, Big Table)
CDN (CloudFront, Google CDN)
Load balancer (ELB, Google Load Balancer)
Long term storage (S3, Google Storage)
…

Things you must know about Amazon

GCE vs AWS pricing: Good vs Evil

Real costs on the AWS side:

Base instance plus storage cost
Add provisioned IOPS for databases (normal EBS IO are not reliable enough)
Add local SSD (675$ per 800 GB + 4 CPU + 30 GB. ALWAYS ALL together)
Add 10% on top of everything for Premium Support (mandatory)
Add 10% for dedicated instances or dedicated hosts (if subject to regulations)

Real costs on the GCE side:

Base instance plus storage cost
Enjoy fast and dependable IOPS out-of-the-box on remote SSD volumes
Add local SSD (82$ per 375 GB, attachable to any existing instance)
Enjoy automatic discount for sustained usage (~30% for instances running 24/7)

AWS IO are expensive and inconsistent

EBS SSD volumes: IOPS, and P-IOPS

We are forced to pay for Provisioned-IOPS whenever we need dependable IO.

The P-IOPS are NOT really faster. They are slightly faster but most importantly they have a lower variance (i.e. 90%-99.9% latency). This is critical for some workload (e.g. databases) because normal IOPS are too inconsistent.

Overall, P-IOPS can get very expensive and they are pathetic compared to what any drive can do nowadays (720$/month for 10k P-IOPS, in addition to $0.14 per GB).

Local SSD storage

Local SSD storage is only available via the i2 instances family which are the most expensive instances on AWS (and over all clouds).

There is no granularity possible. CPU, memory and SSD storage amount all DOUBLE between the few i2.xxx instance types available. They grow in powers of 4CPU + 30GB memory + 800 GB SSD and the multiplier is $765/month.

These limitations make local SSD storage expensive to use and special to manage.

AWS Premium Support is mandatory

The premium support is +10% on top of the total AWS bill (i.e. EC2 instances + EBS volumes + S3 storage + traffic fees + everything).

Handling spikes in traffic

ELB cannot handle sudden spikes in traffic. They need to be scaled manually by support beforehand.

An unplanned event is a guaranteed 5 minutes of unreachable site with 503 errors.

Handling limits

All resources are artificially limited by a hardcoded quota, which is very low by default. Limits can only be increased manually, one by one, by sending a ticket to the support.

I cannot fully express the frustration when trying to spawn two c4.large instances (we already got 15) only to fail because “limit exhaustion: 15 c4.large in eu-central region“. Message support and wait for one day of back and forth email. Then try again and fail again because “limit exhaustion: 5TB of EBS GP2 in eu-central region“.

This circus goes on every few weeks, sometimes hitting 3 limits in a row. There are limits for all resources, by region, by availability zone, by resource types and by resource specifics criteria.

Paying guarantees a 24h SLA to get a reply to a limit ticket. The free tiers might have to wait for a week (maybe more), being unable to work in the meantime. It is an absurd yet very real reason pushing for premium support.

Handling failures on the AWS side

There is NO log and NO indication of what’s going on in the infrastructure. The support is required whenever something wrong happens.

For example. An ELB started dropping requests erratically. After contacting support, they acknowledged to have no idea what’s going on and took action “Thank you for your request. One of the ELB was acting weird, we stopped it and replaced it with a new one“.

The issue was fixed. Sadly, they don’t provide any insight or meaningful information. This is a strong pain point for debugging and planning future failures.

Note: We are barraging further managed service from being introduced in our stack. At first they were tried because they were easy to setup (read: limited human time and a bit of curiosity). They soon proved to be causing periodic issues while being impossible to debug and troubleshoot.

ELB are unsuitable to many workloads

[updated paragraph after comments on HN]

ELB are only accessible with a hostname. The underlying IPs have a TTL of 60s and can change at any minute.

This makes ELB unsuitable for all services requiring a fixed IP and all services resolving the IP only once at startup.

ELB are impossible to debug when they fail (they do fail), they can’t handle sudden spike and the CloudWatch graphs are terrible. (Truth be told. We are paying Datadog $18/month per node to entirely replace CloudWatch).

Load balancing is a core aspect of high-availability and scalable design. Redundant load balancing is the next one. ELB are not up to the task.

The alternative to ELB is to deploy our own HAProxy in pairs with VRRP/keepalived. It takes multiple weeks to setup properly and deploy in production.

By comparison, we can achieve that with google load balancers in a few hours. A Google load balancer can have a single fixed IP. That IP can go from 1k/s to 10k/s requests instantly without loosing traffic. It just works.

Note: Today, we’ve seen one service in production go from 500 requests/s to 15000 requests/s in less than 3 seconds. We don’t trust an ELB to be in the middle of that

Dedicated Instances

“Dedicated instances are Amazon EC2 instances that run in a virtual private cloud (VPC) on hardware that’s dedicated to a single customer. Your Dedicated instances are physically isolated at the host hardware level from your instances that aren’t Dedicated instances and from instances that belong to other AWS accounts.”

Dedicated instances/hosts may be mandatory for some services because of legal compliance, regulatory requirements and not-having-neighbours.

We have to comply to a few regulations so we have a few dedicated options here and there. It’s 10% on top of the instance price (plus a $1500 fixed monthly fee per region).

Note: Amazon doesn’t explain in great details what dedicated entails and doesn’t commit to anything clear. Strangely, no regulators pointed that out so far.

Answer to HN comments: Google doesn’t provide “GCE dedicated instances”. There is no need for it. The trick is that regulators and engineers don’t complain about not having something which is non-existent, they just live without it and our operations get simpler.

Reserved Instances are bullshit

A reservation is attached to a specific region, an availability zone, an instance type, a tenancy, and more. In theory the reservation can be edited, in practice that depends on what to change. Some combinations of parameters are editable, most are not.

Plan carefully and get it right on the first try, there is no room for errors. Every hour of a reservation will be paid along the year, no matter whether the instance is running or not.

For the most common instance types, it takes 8-10 months to break even on a yearly reservation. Think of it as gambling game in a casino. A right reservation is -20% and a wrong reservation is +80% on the bill. You have to be right MORE than 4/5 times to save any money.

Keep in mind that the reserved instances will NOT benefit from the regular price drop happening every 6-12 months. If there is a price drop early on, you’re automatically loosing money.

Critical Safety Notice: 3 years reservation is the most dramatic way to loose money on AWS. We’re talking potential 5 digits loss here, per click. Do not go this route. Do not let your co-workers go this route without a warning.

What GCE does by comparison is a PURELY AWESOME MONTHLY AUTOMATIC DISCOUNT. Instances hours are counted at the end of every month and discount is applied automatically (e.g. 30% for instances running 24/7). The algorithm also accounts for multiple started/stopped/renewed instances, in a way that is STRONGLY in your favour.

Reserving capacity does not belong to the age of Cloud, it belongs to the age of data centers.

AWS Networking is sub-par

Network bandwidth allowance is correlated with the instance size.

The 1-2 cores instances peak around 100-200 Mbps. This is very little in a world more and more connected where so many things rely on the network.

Typical things experiencing slow down because of the rate limited networking:

Instance provisioning, OS install and upgrade
Docker/Vagrant image deployment
sync/sftp/ftp file copying
Backups and snapshots
Load balancers and gateways
General disk read/writes (EBS is network storage)

Our most important backup takes 97 seconds to be copied from the production host to another site location. Half time is saturating the network bandwidth (130 Mbps bandwidth cap), half time is saturating the EBS volume on the receiving host (file is buffered in memory during initial transfer then 100% iowait, EBS bandwidth cap).

The same backup operation would only take 10-20 seconds on GCE with the same hardware.

Cost Comparison

This post wouldn’t be complete without an instance to instance price comparison.

In fact, it is so important that it was split to dedicated article: Google Cloud is 50% cheaper than AWS.

Hidden fees everywhere + unreliable capabilities = human time wasted in workarounds

Capacity planning and day to day operations

Capacity planning is unnecessary hard with the not-scalable resources, unreliable performances capabilities, insufficient granularity, and hidden constraints everywhere. Planning cost is a nightmare.

Every time we have to add an instance. We have to read the instances page, pricing page, EBS page again. There are way too many choices, some of which being hard to change latter. That could be printed on papers and cover a4x7 feet table. By comparison it takes only 1 page both-sided to pick an appropriate instance from Google.

Optimizing usage is doomed to fail

The time taken to optimizing reserved instance is a similar cost to the savings done.

Between CPU count, memory size, EBS volume size, IOPS, P-IOPS. Everything is over-provisioned on AWS. Partly because there are too many things to follow and optimize for a human being, partly as workaround against the inconsistent capabilities, partly because it is hard to fix later for some instances live in production.

All these issues are directly related to the underlying AWS platform itself, being not neat and unable to scale horizontal cleanly, neither in hardware options, nor in hardware capabilities nor money-wise.

Every time we think about changing something to reduce costs, it is usually more expensive than NOT doing anything (when accounting for engineering time).

Conclusion

AWS has a lot of hidden costs and limitations. System capabilities are unsatisfying and cannot scale consistently. Choosing AWS was a mistake. GCE is always a better choice.

GCE is systematically 20% to 50% cheaper for the equivalent infrastructure, without having to do any thinking or optimization. Last but not least it is also faster, more reliable and easier to use day-to-day.

The future of our company

Unfortunately, our infrastructure on AWS is working and migrating is a serious undertaking.

I learned recently that we are a profitable company, more so than I thought. Looking at the top 10 companies by revenue per employee, we’d be in the top 10. We are stuck with AWS in the near future and the issues will have to be worked around with lots of money. The company is able to cover the expenses and cost optimisation ain’t a top priority at the moment.

There’s a saying “throwing money at a problem“. We shall say “throwing houses at the problem” from now on as it better represents the status quo.

If we get to keep growing at the current pace, we’ll have to scale vertically, and by that we mean “throwing buildings at Amazon” 😀