GCE vs AWS in 2016: Why you shouldn’t use Amazon


Foreword

This story relates my experience at a typical web startup. We are running hundreds of instances on AWS, and we’ve been doing so for some time, growing at a sustained pace.

Our full operation is in the cloud: webservers, databases, micro-services, git, wiki, BI tools, monitoring… That includes everything a typical tech company needs to operate.

We have a few switches and a router left in the office to provide internet access and that’s all, no servers on-site.

The following highlights many issues encountered day to day on AWS so that [hopefully] you don’t do the same mistakes we’ve done by picking AWS.

What does the cloud provide?

There are a lot of clouds: GCE, AWS, Azure, Digital Ocean, RackSpace, SoftLayer, OVH, GoDaddy… Check out our article Choosing a Cloud Provider: AWS vs GCE vs SoftLayer vs DigitalOcean vs …

We’ll focus only on GCE and AWS in this article. They are the two majors, fully featured, shared infrastructure, IaaS offerings.

They both provide everything needed in a typical datacenter.

Infrastructure and Hardware:

  • Get servers with various hardware specifications
  • In multiple datacenters across the planet
  • Remote and local storage
  • Networking (VPC, subnets, firewalls)
  • Start, stop, delete anything in a few clicks
  • Pay as you go

Additional Managed Services (optional):

  • SQL Database (RDS, Cloud SQL)
  • NoSQL Database (DynamoDB, Big Table)
  • CDN (CloudFront, Google CDN)
  • Load balancer (ELB, Google Load Balancer)
  • Long term storage (S3, Google Storage)

Things you must know about Amazon

GCE vs AWS pricing: Good vs Evil

Real costs on the AWS side:

  • Base instance plus storage cost
  • Add provisioned IOPS for databases (normal EBS IO are not reliable enough)
  • Add local SSD (675$ per 800 GB + 4 CPU + 30 GB. ALWAYS ALL together)
  • Add 10% on top of everything for Premium Support (mandatory)
  • Add 10% for dedicated instances or dedicated hosts (if subject to regulations)

Real costs on the GCE side:

  • Base instance plus storage cost
  • Enjoy fast and dependable IOPS out-of-the-box on remote SSD volumes
  • Add local SSD (82$ per 375 GB, attachable to any existing instance)
  • Enjoy automatic discount for sustained usage (~30% for instances running 24/7)

AWS IO are expensive and inconsistent

EBS SSD volumes: IOPS, and P-IOPS

We are forced to pay for Provisioned-IOPS whenever we need dependable IO.

The P-IOPS are NOT really faster. They are slightly faster but most importantly they have a lower variance (i.e. 90%-99.9% latency). This is critical for some workload (e.g. databases) because normal IOPS are too inconsistent.

Overall, P-IOPS can get very expensive and they are pathetic compared to what any drive can do nowadays (720$/month for 10k P-IOPS, in addition to $0.14 per GB).

Local SSD storage

Local SSD storage is only available via the i2 instances family which are the most expensive instances on AWS (and over all clouds).

There is no granularity possible. CPU, memory and SSD storage amount all DOUBLE between the few i2.xxx instance types available. They grow in powers of 4CPU + 30GB memory + 800 GB SSD and the multiplier is $765/month.

These limitations make local SSD storage expensive to use and special to manage.

AWS Premium Support is mandatory

The premium support is +10% on top of the total AWS bill (i.e. EC2 instances + EBS volumes + S3 storage + traffic fees + everything).

Handling spikes in traffic

ELB cannot handle sudden spikes in traffic. They need to be scaled manually by support beforehand.

An unplanned event is a guaranteed 5 minutes of unreachable site with 503 errors.

Handling limits

All resources are artificially limited by a hardcoded quota, which is very low by default. Limits can only be increased manually, one by one, by sending a ticket to the support.

I cannot fully express the frustration when trying to spawn two c4.large instances (we already got 15) only to fail because “limit exhaustion: 15 c4.large in eu-central region“. Message support and wait for one day of back and forth email. Then try again and fail again because “limit exhaustion: 5TB of EBS GP2 in eu-central region“.

This circus goes on every few weeks, sometimes hitting 3 limits in a row. There are limits for all resources, by region, by availability zone, by resource types and by resource specifics criteria.

Paying guarantees a 24h SLA to get a reply to a limit ticket. The free tiers might have to wait for a week (maybe more), being unable to work in the meantime. It is an absurd yet very real reason pushing for premium support.

Handling failures on the AWS side

There is NO log and NO indication of what’s going on in the infrastructure. The support is required whenever something wrong happens.

For example. An ELB started dropping requests erratically. After contacting support, they acknowledged to have no idea what’s going on and took action “Thank you for your request. One of the ELB was acting weird, we stopped it and replaced it with a new one“.

The issue was fixed. Sadly, they don’t provide any insight or meaningful information. This is a strong pain point for debugging and planning future failures.

Note: We are barraging further managed service from being introduced in our stack. At first they were tried because they were easy to setup (read: limited human time and a bit of curiosity). They soon proved to be causing periodic issues while being impossible to debug and troubleshoot.

ELB are unsuitable to many workloads

[updated paragraph after comments on HN]

ELB are only accessible with a hostname. The underlying IPs have a TTL of 60s and can change at any minute.

This makes ELB unsuitable for all services requiring a fixed IP and all services resolving the IP only once at startup.

ELB are impossible to debug when they fail (they do fail), they can’t handle sudden spike and the CloudWatch graphs are terrible. (Truth be told. We are paying Datadog $18/month per node to entirely replace CloudWatch).

Load balancing is a core aspect of high-availability and scalable design. Redundant load balancing is the next one. ELB are not up to the task.

The alternative to ELB is to deploy our own HAProxy in pairs with VRRP/keepalived. It takes multiple weeks to setup properly and deploy in production.

By comparison, we can achieve that with google load balancers in a few hours. A Google load balancer can have a single fixed IP. That IP can go from 1k/s to 10k/s requests instantly without loosing traffic. It just works.

Note: Today, we’ve seen one service in production go from 500 requests/s to 15000 requests/s in less than 3 seconds. We don’t trust an ELB to be in the middle of that

Dedicated Instances

Dedicated instances are Amazon EC2 instances that run in a virtual private cloud (VPC) on hardware that’s dedicated to a single customer. Your Dedicated instances are physically isolated at the host hardware level from your instances that aren’t Dedicated instances and from instances that belong to other AWS accounts.

Dedicated instances/hosts may be mandatory for some services because of legal compliance, regulatory requirements and not-having-neighbours.

We have to comply to a few regulations so we have a few dedicated options here and there. It’s 10% on top of the instance price (plus a $1500 fixed monthly fee per region).

Note: Amazon doesn’t explain in great details what dedicated entails and doesn’t commit to anything clear. Strangely, no regulators pointed that out so far.

Answer to HN comments: Google doesn’t provide “GCE dedicated instances”. There is no need for it. The trick is that regulators and engineers don’t complain about not having something which is non-existent, they just live without it and our operations get simpler.

Reserved Instances are bullshit

A reservation is attached to a specific region, an availability zone, an instance type, a tenancy, and more. In theory the reservation can be edited, in practice that depends on what to change. Some combinations of parameters are editable, most are not.

Plan carefully and get it right on the first try, there is no room for errors. Every hour of a reservation will be paid along the year, no matter whether the instance is running or not.

For the most common instance types, it takes 8-10 months to break even on a yearly reservation. Think of it as gambling game in a casino. A right reservation is -20% and a wrong reservation is +80% on the bill. You have to be right MORE than 4/5 times to save any money.

Keep in mind that the reserved instances will NOT benefit from the regular price drop happening every 6-12 months. If there is a price drop early on, you’re automatically loosing money.

Critical Safety Notice: 3 years reservation is the most dramatic way to loose money on AWS. We’re talking potential 5 digits loss here, per click. Do not go this route. Do not let your co-workers go this route without a warning. 

What GCE does by comparison is a PURELY AWESOME MONTHLY AUTOMATIC DISCOUNT. Instances hours are counted at the end of every month and discount is applied automatically (e.g. 30% for instances running 24/7). The algorithm also accounts for multiple started/stopped/renewed instances, in a way that is STRONGLY in your favour.

Reserving capacity does not belong to the age of Cloud, it belongs to the age of data centers.

AWS Networking is sub-par

Network bandwidth allowance is correlated with the instance size.

The 1-2 cores instances peak around 100-200 Mbps. This is very little in a world more and more connected where so many things rely on the network.

Typical things experiencing slow down because of the rate limited networking:

  • Instance provisioning, OS install and upgrade
  • Docker/Vagrant image deployment
  • sync/sftp/ftp file copying
  • Backups and snapshots
  • Load balancers and gateways
  • General disk read/writes (EBS is network storage)

Our most important backup takes 97 seconds to be copied from the production host to another site location. Half time is saturating the network bandwidth (130 Mbps bandwidth cap), half time is saturating the EBS volume on the receiving host (file is buffered in memory during initial transfer then 100% iowait, EBS bandwidth cap).

The same backup operation would only take 10-20 seconds on GCE with the same hardware.

Cost Comparison

This post wouldn’t be complete without an instance to instance price comparison.

In fact, it is so important that it was split to dedicated article: Google Cloud is 50% cheaper than AWS.

Hidden fees everywhere + unreliable capabilities = human time wasted in workarounds

Capacity planning and day to day operations

Capacity planning is unnecessary hard with the not-scalable resources, unreliable performances capabilities, insufficient granularity, and hidden constraints everywhere. Planning cost is a nightmare.

Every time we have to add an instance. We have to read the instances page, pricing page, EBS page again. There are way too many choices, some of which being hard to change latter. That could be printed on papers and cover a4x7 feet table. By comparison it takes only 1 page both-sided to pick an appropriate instance from Google.

Optimizing usage is doomed to fail

The time taken to optimizing reserved instance is a similar cost to the savings done.

Between CPU count, memory size, EBS volume size, IOPS, P-IOPS. Everything is over-provisioned on AWS. Partly because there are too many things to follow and optimize for a human being, partly as workaround against the inconsistent capabilities, partly because it is hard to fix later for some instances live in production.

All these issues are directly related to the underlying AWS platform itself, being not neat and unable to scale horizontal cleanly, neither in hardware options, nor in hardware capabilities nor money-wise.

Every time we think about changing something to reduce costs, it is usually more expensive than NOT doing anything (when accounting for engineering time).

Conclusion

AWS has a lot of hidden costs and limitations. System capabilities are unsatisfying and cannot scale consistently. Choosing AWS was a mistake. GCE is always a better choice.

GCE is systematically 20% to 50% cheaper for the equivalent infrastructure, without having to do any thinking or optimization. Last but not least it is also faster, more reliable and easier to use day-to-day.

The future of our company

Unfortunately, our infrastructure on AWS is working and migrating is a serious undertaking.

I learned recently that we are a profitable company, more so than I thought. Looking at the top 10 companies by revenue per employee, we’d be in the top 10. We are stuck with AWS in the near future and the issues will have to be worked around with lots of money. The company is able to cover the expenses and cost optimisation ain’t a top priority at the moment.

There’s a saying “throwing money at a problem“. We shall say “throwing houses at the problem” from now on as it better represents the status quo.

If we get to keep growing at the current pace, we’ll have to scale vertically, and by that we mean “throwing buildings at Amazon” 😀

burning money
The official AWS answer to all their issues: “Get bigger instances”

29 thoughts on “GCE vs AWS in 2016: Why you shouldn’t use Amazon

  1. This article is amateurish, really inaccurate and a bit immature.
    AWS has it’s cons, but also have a lot of advantages over other cloud platforms. the article does not mention that and many other important facts. Very very biased.

    Like

    • Feel free to provide more details. We are always happy to get feedback on the article and we’d love to hear about your experience on these platforms.

      (Note that the comparison only stands against GCE which provides the exact same services and features than AWS.)

      Like

      • If you are only saving 20% on reservations you are doing something very very wrong. (like doing a no upfront, 1 year term). We save 62+% on our reservations. We have standardized on instance classes (C4 and M4 for 95% of our workload) and make extensive use of spot instances where we can (which saves us upwards of 90% in many cases). We use orchestration solutions so we don’t need to manage individual hosts and simply deploy a fleet of m4.4xl instances. Your argument that a 3 year reservation is a terrible idea is incredibly myopic. Annual price reductions tend to be around 5% which means you are sacrificing 62% every year and hoping to make up for it with a 5% reduction … does that make any sense?

        In addition- AWS now offers convertible and regional reservations if you are worried about switching instance types or don’t want to manage reservations on a per AZ basis.

        Furhermore- your argument that the lack of dedicated hardware with GCE means no one cares would be laughed out of the room at the companies I work for. “They don’t offer it” means they don’t get used- not “we don’t have to worry about it”.

        There are plenty of other problems with GCE- e.g. the Google VPN BGP solution is kludgy and GCE doesn’t offer nearly as many services as AWS does.

        Now having said all that- they both have their pros and cons and companies need to go into the selection process with their eyes open. You need to evaluate your workload against each cloud’s offerings and choose the one that makes the most sense for you.

        Like

        • Bullshit! There’s no reservation to save 62% on any C4 or M4, even if you reserve 3 years all upfront.

          If you have a perfect orchestration solution and you can move things around however you like. Good for you. That doesn’t change the fact that reserving is a huge risk for the majority people out there, especially 3 years ahead.

          I won’t comment on convertible. Didn’t see it yet. It’s just one more obstacle to understand reservations and a higher bar to entry. AWS should lower the complexity of their offering, not increase it.

          Like

      • I can’t reply to comment 133 because of a CSS rendering issue (it blocks the reply button for comments that are more than 2 deep!), so I’m replying here.

        As a heads up, fully-upfront standard 3 year reservations do actually give you ~60% discount off of the on-demand price. Specifically, 61% for C4 and 63% for M4 instances. When you get to the point where you both have a predictable base load AND have a few hundred to few thousand instances that make up your base load, reservations make a TON of sense.

        I agree that the GCP pricing model is less confusing for people. As a product, GCP has gotten considerably better and more competitive over the past few years (BigQuery is the shit) and I would use GCP for anything new. A few years back, that was DEFINITELY not the case.

        As far as the future goes, I really think that serverless (a la GCF/Lambda) is going to be huge.

        Like

  2. Hi, there are plenty more instance types on AWS offering local SSD storage. We run a bunch of workloads on them too. This is a good site for checking those out: http://www.ec2instances.info/ . There’s the M3, C3, C1, … families which offer local SSD storage. So the granularity is not that bad, although you make some good points about GCE vs AWS.

    Like

    • These are the past generation instances. No HVM virtualisation, no EBS optimized storage, older software, hardware, older software. It’s terrible to be forced to use them 😦

      There is no more local SSD on the c4 family. The r3 and m4 still have only 32GB SSD (growing with the instance size) which we consider a joke given their limitations and our usage.

      When the local drive is unused, it is mounted automatically to special folders: /swap /tmp (and a few others).

      Like

  3. Could you expand on what you mean by this:

    >ELB and S3 cannot handle sudden spikes in traffic. They need to be scaled manually by support beforehand.
    >An unplanned event is a guaranteed 5 minutes of unreachable site with 503 errors.

    I’ve never heard of needing to scale up S3 or having availability issues?

    Like

  4. It’s rather uncommon to reach the S3 limits and not that well known.
    From what we remember, the way to hit the hardcore limit is to make a fresh new bucket and start pushing data into it, at a fast and sustained pace. The bucket storage is meant to scale slowly over time and it cannot do it in these circumstances, thus becoming unavailable.

    There was a discussion on HN where some experienced AWS users explained a few edge cases to hit S3 limits. Can’t seem to find it back.
    It’s somewhat particular to trigger though, most users should be fine with S3 (not so much with ELB).

    Like

    • S3 under the hood, is just really efficient sharding, and it works by using the beginning of the filename or ‘key’ of the object to decide which shard to place the object in.
      For example, if you have 2 files in S3 that are ‘test/some/file.png’ and ‘test/my/file.png’, these two files will hit the same shard, so you’ll be throttled at 300rps ~ (roughly I can’t remember exactly). If you were to call them by their md5 hashes instead of using s3 as a folder structure, they’d hopefully both have different starting hashes, so would be on different shards, making the rps x2.

      Like

  5. Hey I’m not an AWS fanboy or anything but we do use it heavily. You are either misinformed or purposefully writing a hit piece. This article is 25% totally wrong, 50% somewhat true but exaggerated, and 25% true in my opinion.

    Like

    • And your comment is 100% useless Antonio if you have no real details to offer. This article is full of facts that show exactly how the pricing and performance differ and can be confirmed by several other reports.

      Liked by 1 person

  6. “Every time we think about changing something to reduce costs, it is usually more expensive than NOT doing anything (when accounting for engineering time).”

    Couldn’t some of this be due to your own system architecture and not entirely due to AWS?

    Like

    • It’s 80% because of simple maths, 20% because of hardware limitations and AWS complexity.

      You need time to understand the documentation + time to analyse what to optimize (e.g. reservations) + potential downtime (most things can’t be changed on the fly) + extra time for specifics (e.g. I’ve seen volume snapshots taking 10 hours).

      It’s basic maths. The usual optimization may take some days to perform but it’s only gonna save $100… that’s not worth it. Not doing it.

      Bonus: We’re handling trading systems. Generally speaking, for any company that’s profitable and doing well, it’s risky to change anything in production that is currently working well. If you factor-in the risks, cost savings is [almost] never a good justification for changes.

      Like

  7. Thanks, completely agreed with your article. Same story here, AWS IOPS, network, logging, ELB and instance consistency is suck. We heavily invested in AWS three years ago and evaluated GCE too late to convince our management to migrate. I wouldn’t complain about AWS support too much, it works for us in most cases, and it is probably just because GCE support feels immature when we tried it.

    Like

    • Too early, no experience with it yet. So far, Azure is competing neither on price nor on performances. I personally have no incentive to try.

      Don’t know about the features. Heard they’ve got nice integrations with Microsoft Tools lately and the easiest managed SQL with multi-master support.

      Like

      • thanks for the response, personally been using Azure for nearly a year and they are constantly making updates all the time which is great, plugging holes and creating some more :-). You are right on the portal point it can be quite difficult to work out how things are done and what the relevant options actually do and their impact. Does take a while to find the right route/option if you are starting out.

        What would be nice to see is an actual comparison between the 3 main players in terms of cost and performance etc, we mainly use it for Web Apps at the moment but looking at other parts as well. Main barrier we have at work fully uptaking it is that cost is always mentioned as cloud is very expensive compared to in-house and then security concerns, would be nice to see which ones come out on top

        Like

        • Agree. Azure (and Google) are releasing and improving very quickly. It’s challenging to come up with comparisons that won’t be outdated in a couple of months (let alone years).

          Between us, I’ve had more costs and performances benchmarks coming… but I just got poached by a major financial institution for Christmas. It’s gonna delay my plan till I can get $1M budget approved for cloud testing.

          Like

  8. > They are the two majors, fully featured, shared infrastructure, IaaS offerings.

    Erm, Azure kind of is too. Also, Google’s market share is most likely only a little fraction of Azure’s.

    Like

Post a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s