Cracking the HackerRank Test: 100% score made easy


Foreword

It’s well known that most programmers wannabes can’t code their way out of a paper bag. As a consequence, the tech industry is pushing for longer, harder and evermore extreme screening.

The whiteboard interview has been the standard for a while, followed by puzzles [now abandoned], then FizzBuzz.

The latest fad is HackerRank. It’s introducing automated programming tests to be done by the candidate before he’s allowed to talk to anyone in the company.

A lot of very good companies are using HackerRank as a pre-screening tool. If we can’t avoid it, we gotta embrace it.

What to find in a HackerRank test?

There are 3 types of questions to be encountered in a test:

  • Multiple Choice Questions: “What is the time complexity to find an element in a red and black tree?” -A- -B- -C- -D-
  • Coding Exercise: “<Long description of a problem to be solved>, <input data format>, <output data format>.” Start coding a solution.
  • SudoRank Exercise: “Your ssh credentials are tester:QWERTUIOP@123.98.45.76 <long description of what’s wrong with that server>.” SSH to the server and start fixing.

Any amount of any question can be put together, in any order, to make a complete test. A company should give some indications on what to expect in its test.

HackerRank provides a library of hundreds of questions and exercises ready to use. It’s also possible for a company to write their own (and recommended).

Defeating Multiple Choice Question

The majority of the multiple choice questions can be solved by an appropriate Google search. Usually on the title, sometimes on a few select words from the text.

hr question dropping privileges
Select Text => Right Click => Quick Search
hr google dropping privileges
Google has spoken! => all in favour of setuid()

Defeating Coding Exercises

Searching for a 10 lines long paragraph in Google is not an acceptable option. Not to mention that the HackerRank website disables copy/paste in the description area.

The workaround is to search for the title of the exercise. A title uniquely identifies a question on HackerRank. It will be mentioned in related solutions and blog posts. Perfect for being indexed by Google.

hr question lonely integer
Select Text => Right Click => Quick Search
hr google lonely integer.png
Touché!

The first result is the question, the second result is the solution. Well, that was easy.

Bonus: That google solution is actually wrong… yet it gives all the points.

// [boilerplate omitted]
int main() {
    int N;
    cin >> N;
    int tmp, result = 0;
    for (int i = 0; i < N; i++) {         cin >> tmp;
        result ^= tmp;
    }
    cout << result;
    return 0;
}

This solution is only correct if duplicated numbers are in pairs. All the HackerRank unit tests happen to fit this criteria by pure coincidence [cf. Addendum].

Originally, we put this simple question at the beginning of a test for warm-up. We received that answer from a candidate in our first batch of applicants. It was quite puzzling, what are the odds that someone would come up with an algorithm that convoluted if given only the text from the question? A quick investigation quickly revealed the source.


Addendum:

The “Lonely Integer” question is worded slightly differently in the public HackerRank site and the private HackerRank library but the input, output and unit tests are the same. Hence why the solution is off but works. HackerRank is obviously copying questions from the community into the professional library. That’s another copy-cat busted!


Recruiter Insights: Cheating brought to the next level

We have a lot of candidates coming from recruiters. How are they comparing to candidates from other sources?

Let’s see the statistics on a hard question [i.e. dynamic programming trading algorithm].

hr insights stock maximize distribution
Distribution over all attempts, by all companies (log scale). 1234 zero vs 303 full score.

Most candidates get 0 points: ran out of time, unable to answer, wrong algorithm, or incomplete/partial solutions (i.e. good start but not enough to pass any unit test yet).

We wanted to show the same distribution over our pool of candidates but HackerRank doesn’t provide that graph anymore. It used to.

Anyway, we remember approximate numbers. The distribution for our candidates is about 50/50% on each extreme. That’s significantly better than the 75/17% from the general population. We can correlate that number with the time spent on the question and a visual code review.

The result is that candidates coming from recruiters perform better, especially on hard exercises. In fact it is unbelievable how much better they perform! (Special kudos to the guys who are able to solve a problem -with a perfectly optimized solution- in less time than it would take to actually read it =D).

The conclusion is plain and simple: Our recruiters give away the test to the candidates.

everybody does it, it's just that nobody talkes about it
Do recruiters give away your questions? Of course!

Lesson learnt:

  • For candidates: Remember to ask the recruiter for support before the test.
  • For recruiters: Remember to coach the candidate for the test and instruct him to write down changes (if any).
  • For companies: Beware high-score candidates coming from recruiters! In particular, don’t calibrate scoring based on extremes scores from a few cheaters.

Challenge: How long does it take you to solve a trading challenge? [dynamic programming, medium difficulty].

Custom HackerRank Tests

Companies can write custom exercises and they should. It’s hard and it requires particular skills but it is definitely worthwhile.

It is the only effective solution against Google, if done carefully. (It’s actually surprisingly  difficult to make exercises that are both simple AND not easily found with Google on 1000 tutorials and coding forums).

Sadly, it won’t help against recruiters. (Excluding the first batch of candidates who will be sacrificed as scouts).

Conclusion

Did we just ruin HackerRank pre-screening? Of course not! There is a never-ending supply of bozos unable to tell the difference between Internet and Internet Explorer.

We could write a book teaching the answers to 90% of programming interviews problems, yet 99% of job seekers would never read it. Hell, it’s been written for a while and it had no impact whatsoever.

Only the handful of devs following blogs/news or searching for “What is HackerRank?” will be able to come better prepared.

If anything, this article makes HackerRank better and more relevant. Now a test is about looking for help on Google and fixing subtly broken snippets of unindented code written in the wrong language.

HackerRank is finally screening for capabilities relevant to the job!

Advertisements

GCE vs AWS in 2016: Why you shouldn’t use Amazon


Foreword

This story relates my experience at a typical web startup. We are running hundreds of instances on AWS, and we’ve been doing so for some time, growing at a sustained pace.

Our full operation is in the cloud: webservers, databases, micro-services, git, wiki, BI tools, monitoring… That includes everything a typical tech company needs to operate.

We have a few switches and a router left in the office to provide internet access and that’s all, no servers on-site.

The following highlights many issues encountered day to day on AWS so that [hopefully] you don’t do the same mistakes we’ve done by picking AWS.

What does the cloud provide?

There are a lot of clouds: GCE, AWS, Azure, Digital Ocean, RackSpace, SoftLayer, OVH, GoDaddy… Check out our article Choosing a Cloud Provider: AWS vs GCE vs SoftLayer vs DigitalOcean vs …

We’ll focus only on GCE and AWS in this article. They are the two majors, fully featured, shared infrastructure, IaaS offerings.

They both provide everything needed in a typical datacenter.

Infrastructure and Hardware:

  • Get servers with various hardware specifications
  • In multiple datacenters across the planet
  • Remote and local storage
  • Networking (VPC, subnets, firewalls)
  • Start, stop, delete anything in a few clicks
  • Pay as you go

Additional Managed Services (optional):

  • SQL Database (RDS, Cloud SQL)
  • NoSQL Database (DynamoDB, Big Table)
  • CDN (CloudFront, Google CDN)
  • Load balancer (ELB, Google Load Balancer)
  • Long term storage (S3, Google Storage)

Things you must know about Amazon

GCE vs AWS pricing: Good vs Evil

Real costs on the AWS side:

  • Base instance plus storage cost
  • Add provisioned IOPS for databases (normal EBS IO are not reliable enough)
  • Add local SSD (675$ per 800 GB + 4 CPU + 30 GB. ALWAYS ALL together)
  • Add 10% on top of everything for Premium Support (mandatory)
  • Add 10% for dedicated instances or dedicated hosts (if subject to regulations)

Real costs on the GCE side:

  • Base instance plus storage cost
  • Enjoy fast and dependable IOPS out-of-the-box on remote SSD volumes
  • Add local SSD (82$ per 375 GB, attachable to any existing instance)
  • Enjoy automatic discount for sustained usage (~30% for instances running 24/7)

AWS IO are expensive and inconsistent

EBS SSD volumes: IOPS, and P-IOPS

We are forced to pay for Provisioned-IOPS whenever we need dependable IO.

The P-IOPS are NOT really faster. They are slightly faster but most importantly they have a lower variance (i.e. 90%-99.9% latency). This is critical for some workload (e.g. databases) because normal IOPS are too inconsistent.

Overall, P-IOPS can get very expensive and they are pathetic compared to what any drive can do nowadays (720$/month for 10k P-IOPS, in addition to $0.14 per GB).

Local SSD storage

Local SSD storage is only available via the i2 instances family which are the most expensive instances on AWS (and over all clouds).

There is no granularity possible. CPU, memory and SSD storage amount all DOUBLE between the few i2.xxx instance types available. They grow in powers of 4CPU + 30GB memory + 800 GB SSD and the multiplier is $765/month.

These limitations make local SSD storage expensive to use and special to manage.

AWS Premium Support is mandatory

The premium support is +10% on top of the total AWS bill (i.e. EC2 instances + EBS volumes + S3 storage + traffic fees + everything).

Handling spikes in traffic

ELB cannot handle sudden spikes in traffic. They need to be scaled manually by support beforehand.

An unplanned event is a guaranteed 5 minutes of unreachable site with 503 errors.

Handling limits

All resources are artificially limited by a hardcoded quota, which is very low by default. Limits can only be increased manually, one by one, by sending a ticket to the support.

I cannot fully express the frustration when trying to spawn two c4.large instances (we already got 15) only to fail because “limit exhaustion: 15 c4.large in eu-central region“. Message support and wait for one day of back and forth email. Then try again and fail again because “limit exhaustion: 5TB of EBS GP2 in eu-central region“.

This circus goes on every few weeks, sometimes hitting 3 limits in a row. There are limits for all resources, by region, by availability zone, by resource types and by resource specifics criteria.

Paying guarantees a 24h SLA to get a reply to a limit ticket. The free tiers might have to wait for a week (maybe more), being unable to work in the meantime. It is an absurd yet very real reason pushing for premium support.

Handling failures on the AWS side

There is NO log and NO indication of what’s going on in the infrastructure. The support is required whenever something wrong happens.

For example. An ELB started dropping requests erratically. After contacting support, they acknowledged to have no idea what’s going on and took action “Thank you for your request. One of the ELB was acting weird, we stopped it and replaced it with a new one“.

The issue was fixed. Sadly, they don’t provide any insight or meaningful information. This is a strong pain point for debugging and planning future failures.

Note: We are barraging further managed service from being introduced in our stack. At first they were tried because they were easy to setup (read: limited human time and a bit of curiosity). They soon proved to be causing periodic issues while being impossible to debug and troubleshoot.

ELB are unsuitable to many workloads

[updated paragraph after comments on HN]

ELB are only accessible with a hostname. The underlying IPs have a TTL of 60s and can change at any minute.

This makes ELB unsuitable for all services requiring a fixed IP and all services resolving the IP only once at startup.

ELB are impossible to debug when they fail (they do fail), they can’t handle sudden spike and the CloudWatch graphs are terrible. (Truth be told. We are paying Datadog $18/month per node to entirely replace CloudWatch).

Load balancing is a core aspect of high-availability and scalable design. Redundant load balancing is the next one. ELB are not up to the task.

The alternative to ELB is to deploy our own HAProxy in pairs with VRRP/keepalived. It takes multiple weeks to setup properly and deploy in production.

By comparison, we can achieve that with google load balancers in a few hours. A Google load balancer can have a single fixed IP. That IP can go from 1k/s to 10k/s requests instantly without loosing traffic. It just works.

Note: Today, we’ve seen one service in production go from 500 requests/s to 15000 requests/s in less than 3 seconds. We don’t trust an ELB to be in the middle of that

Dedicated Instances

Dedicated instances are Amazon EC2 instances that run in a virtual private cloud (VPC) on hardware that’s dedicated to a single customer. Your Dedicated instances are physically isolated at the host hardware level from your instances that aren’t Dedicated instances and from instances that belong to other AWS accounts.

Dedicated instances/hosts may be mandatory for some services because of legal compliance, regulatory requirements and not-having-neighbours.

We have to comply to a few regulations so we have a few dedicated options here and there. It’s 10% on top of the instance price (plus a $1500 fixed monthly fee per region).

Note: Amazon doesn’t explain in great details what dedicated entails and doesn’t commit to anything clear. Strangely, no regulators pointed that out so far.

Answer to HN comments: Google doesn’t provide “GCE dedicated instances”. There is no need for it. The trick is that regulators and engineers don’t complain about not having something which is non-existent, they just live without it and our operations get simpler.

Reserved Instances are bullshit

A reservation is attached to a specific region, an availability zone, an instance type, a tenancy, and more. In theory the reservation can be edited, in practice that depends on what to change. Some combinations of parameters are editable, most are not.

Plan carefully and get it right on the first try, there is no room for errors. Every hour of a reservation will be paid along the year, no matter whether the instance is running or not.

For the most common instance types, it takes 8-10 months to break even on a yearly reservation. Think of it as gambling game in a casino. A right reservation is -20% and a wrong reservation is +80% on the bill. You have to be right MORE than 4/5 times to save any money.

Keep in mind that the reserved instances will NOT benefit from the regular price drop happening every 6-12 months. If there is a price drop early on, you’re automatically loosing money.

Critical Safety Notice: 3 years reservation is the most dramatic way to loose money on AWS. We’re talking potential 5 digits loss here, per click. Do not go this route. Do not let your co-workers go this route without a warning. 

What GCE does by comparison is a PURELY AWESOME MONTHLY AUTOMATIC DISCOUNT. Instances hours are counted at the end of every month and discount is applied automatically (e.g. 30% for instances running 24/7). The algorithm also accounts for multiple started/stopped/renewed instances, in a way that is STRONGLY in your favour.

Reserving capacity does not belong to the age of Cloud, it belongs to the age of data centers.

AWS Networking is sub-par

Network bandwidth allowance is correlated with the instance size.

The 1-2 cores instances peak around 100-200 Mbps. This is very little in a world more and more connected where so many things rely on the network.

Typical things experiencing slow down because of the rate limited networking:

  • Instance provisioning, OS install and upgrade
  • Docker/Vagrant image deployment
  • sync/sftp/ftp file copying
  • Backups and snapshots
  • Load balancers and gateways
  • General disk read/writes (EBS is network storage)

Our most important backup takes 97 seconds to be copied from the production host to another site location. Half time is saturating the network bandwidth (130 Mbps bandwidth cap), half time is saturating the EBS volume on the receiving host (file is buffered in memory during initial transfer then 100% iowait, EBS bandwidth cap).

The same backup operation would only take 10-20 seconds on GCE with the same hardware.

Cost Comparison

This post wouldn’t be complete without an instance to instance price comparison.

In fact, it is so important that it was split to dedicated article: Google Cloud is 50% cheaper than AWS.

Hidden fees everywhere + unreliable capabilities = human time wasted in workarounds

Capacity planning and day to day operations

Capacity planning is unnecessary hard with the not-scalable resources, unreliable performances capabilities, insufficient granularity, and hidden constraints everywhere. Planning cost is a nightmare.

Every time we have to add an instance. We have to read the instances page, pricing page, EBS page again. There are way too many choices, some of which being hard to change latter. That could be printed on papers and cover a4x7 feet table. By comparison it takes only 1 page both-sided to pick an appropriate instance from Google.

Optimizing usage is doomed to fail

The time taken to optimizing reserved instance is a similar cost to the savings done.

Between CPU count, memory size, EBS volume size, IOPS, P-IOPS. Everything is over-provisioned on AWS. Partly because there are too many things to follow and optimize for a human being, partly as workaround against the inconsistent capabilities, partly because it is hard to fix later for some instances live in production.

All these issues are directly related to the underlying AWS platform itself, being not neat and unable to scale horizontal cleanly, neither in hardware options, nor in hardware capabilities nor money-wise.

Every time we think about changing something to reduce costs, it is usually more expensive than NOT doing anything (when accounting for engineering time).

Conclusion

AWS has a lot of hidden costs and limitations. System capabilities are unsatisfying and cannot scale consistently. Choosing AWS was a mistake. GCE is always a better choice.

GCE is systematically 20% to 50% cheaper for the equivalent infrastructure, without having to do any thinking or optimization. Last but not least it is also faster, more reliable and easier to use day-to-day.

The future of our company

Unfortunately, our infrastructure on AWS is working and migrating is a serious undertaking.

I learned recently that we are a profitable company, more so than I thought. Looking at the top 10 companies by revenue per employee, we’d be in the top 10. We are stuck with AWS in the near future and the issues will have to be worked around with lots of money. The company is able to cover the expenses and cost optimisation ain’t a top priority at the moment.

There’s a saying “throwing money at a problem“. We shall say “throwing houses at the problem” from now on as it better represents the status quo.

If we get to keep growing at the current pace, we’ll have to scale vertically, and by that we mean “throwing buildings at Amazon” 😀

burning money
The official AWS answer to all their issues: “Get bigger instances”

What’s The Best Cloud Provider in 2016? AWS vs Digital Ocean vs Google Cloud vs OVH


No worries, it’s a lot simpler than it seems. Each cloud provider is oriented toward a different type of customer and usage.

We grouped cloud providers by type. We’ll explain what is the purpose of each type? How do they differ? Which one is the most appropriate per use case? and Which cloud provider is the best in its respective category?

General Purpose Clouds

Competitors: Amazon AWS, Google Compute Engine, Microsoft Azure

Quick test: A general purpose cloud is the best fit if you answer yes to any of the following questions.

  • Do you run more than 50 virtual machines?
  • Do you spend more than 1000 dollars/month on hosting?
  • Does your infrastructure span multiple datacenters?

When to use: A general purpose cloud is meant to run anything and everything. It can replace a full rack of servers, as much as it can replace an ENTIRE datacenter. It provides the usual infrastructure plus some advanced bits that would be very hard to come by otherwise.

It is the go-to solution for running many heterogeneous applications requiring a variety of hardware. It’s versatility makes it ideal to run an entire operation in the cloud. It’s a perfect fit for an entire tech company, or a [big] tech project.

General purpose clouds make complex infrastructure available at the tip of your fingers:

  • Get servers of various sizes and types of hardware
  • Design your own networking and firewalls (same as in a real datacenter)
  • Group and isolate instances from each other and from the internet
  • Easily go multi-sites, worldwide
  • Order, change or redesign ANYTHING in 60 seconds (while staying put on your chair)

A general purpose cloud is a full ecosystem. It includes equivalents to all the services typically found (and required) in datacenters/enterprise environments:

  • SAN disks (EBS, Google Disks)
  • Scalable Storage and backups (S3, Google Storage, Snapshots)
  • Hardware load balancers (ELB, Google Load Balancer)

Which provider to use: Google cloud is vastly superior to its competitors. It’s cheaper, faster and easier to manage. If you go cloud, go Google Cloud.

AWS is twice the price to run the same infrastructure, in addition to being slower and having fewer capabilities. That being said, noone gets fired for buying AWS. Do you run on big company’s money and prefer to play safe with management? or do you run on limited funds and prefer to go with the better platform? That’s the current state of AWS vs Google Cloud.

Note: We have limited experience with Microsoft Azure and will not comment on it.

Cheap Clouds

Competitors: Digital Ocean, Linode

Quick test: A cheap cloud is the best fit if you answer yes to any of the following questions.

  • Do you run less than 5 virtual machines?
  • Do you spend less than 100 dollars/month on hosting?
  • Are you in big triple if you receive a bill double of what you expected?
  • Would you qualify yourself as either an amateur or a hobbyist?

When to use: A cheap cloud is meant to offer proper servers to the masses, “proper” meaning decent hardware and good internet connectivity, at an affordable price. It is simply not possible to get that from a home or an office (note: recycling an old laptop on a broadband line is not a comparable substitute for a proper server).

It is the go-to solution for all basics needs. For examples, professionals running a few simple services with low-to-moderate traffic, agencies in need of a simple hosting to deliver back to the client, amateurs and hobbyists doing experiments.

Generally speaking, it’s the best choice for anyone who is looking for [at most] a couple of servers, especially if the main criteria are “easy to manage” and “good bang for the bucks“.

Cheap clouds make servers affordable and easy to get:

  • Get real servers (server-grade hardware, good internet connectivity)
  • Simple, easy to use, easy to manage and convenient
  • Predictable costs, well-defined capabilities, no bullshit
  • Buy or sell a server in 60 seconds

Which provider to use: The next-generation cheap clouds are DigitalOcean and Linode. Go for Digital Ocean.

blog update 2016-10: Linode suffered from significant downtimes in the past month, similar to the downtimes from last year. These outages are the result of major DDoS attacks targeted against Linode itself (i.e. not one of the customer running on it). We recommend Digital Ocean as a safer choice.

Challengers: There is a truckload of historical and minor players (OVH, GoDaddy, Hetzner, …). They have some similar offerings to the cheap cloud providers, but it’s hidden somewhere in the poor UI trying to accommodate and sell 10 unrelated products and services. They may or may not be worth digging a bit (probably not).

Dedicated Clouds

Competitors: IBM SoftLayer, OVH, Hetzner

When to use: As a rule of thumb, the general purpose clouds are limited to 16 physical cores and 128 GB memory and 8 TB SAN drives, with the price increasing linearly along the specifications (double the memory = double the price). The dedicated clouds can provide much bigger servers and the high-end specs are significantly cheaper.

This is the go-to solution for special tasks running 24/7 that require exotic hardware, especially vertical scaling. Dedicated clouds are only fitted for special purposes.

Special case: We’ve seen people rent a single big dedicated server with vSphere and run numerous virtual machines on it. It allows to do plenty of experimentations at a fixed and fairly reasonable costs.

IBM SoftLayer:

  • Choose the hardware, tailored to the intended workload
  • Ultimate performance (bare-metal, no virtualization)
  • Quad CPU, 96 total cores is an option
  • 1 TB memory, f*** yeah!
  • 24 HDD or SSD drives in a single box

Which provider to useIBM SoftLayer is the only one to offer the next generation of dedicated cloud. Getting servers works the same way as buying servers from the Dell website (select a server enclosure and pick the components) except it’s rented and the price is per month. (Common configurations are available immediately, specialized hardware may need ordering and take a few days).

SoftLayer takes care of the hardware transparently: shipment, delivery, installation, parts, repair, maintenance. It’s like having our own racks and servers… without the hassle of having them.

Challengers: There are a few historical big players (OVH, Hetzner, …). They are running on an antiquated model, providing only a predefined set of boxes with limited choice. They can compare positively to SoftLayer (read: cheaper and not harder to manage/use) when running a few servers with nothing too exotic.

Housing & Collocation

When to use: Never. It’s always a bad decision.

There are 3 kinds of people who do housing on purpose:

  • People who genuinely think it’s cheaper (it is NOT when accounting for time)
  • People who genuinely got their maths wrong (hence thinking it was cheaper =D)
  • Students, amateurs, hobbyists, single server usage and not-for-profit

Let’s ignore the hobbyist. He got a decent server sitting in the garage. He might as well put it into a datacenter with 24h electricity and good internet to tinker around. That’s how he’ll learn. This is the only valid use case for housing.

What’s wrong with housing & collocation:

  • Unproductive time to go back and forth to the datacenters, repeatedly
  • Lost time and health moving tons of hardware (a 2U server is 20-40 kg)
  • Be forced to deal with hardware suppliers (DELL/HP)
  • Burn out, burst in rage and eventually attempt to strangle one colleague after having dealt with supplier bullshit for most of the afternoon (based on a real story)
  • Waste 3 weeks between ordering something and receiving it
  • Cry when something broke and there are no spare parts
  • Cry some more because the parts are end-of-life and can’t be ordered anymore
  • Suffer 100 times what initially expected because of the network and the storage (it’s the most expensive and the most difficult to get right in an infrastructure)
  • Renew the hardware after 3-5 years, hit all the aforementioned issues in a row
  • Be unable to have multiple sites, never go worldwide

These are major pain points to be encountered. Nonetheless it is easy to find cloud vs collocation comparisons not accounting for them and pretending to save $500k per month by buying your own hardware.

Abandoning hardware management has been an awesome life-changing experience. We are never going back to lifting tons of burden in miserable journeys to the mighty datacenter.

Make Your Own Datacenter

When to use: This was the go-to solution for hosting companies and the older internet giants.

The internet giants (Google, Amazon, Microsoft) started at a time when there was no provider available for their needs, let alone at a reasonable cost. They had to craft their own infrastructure to be able to sustain their activity.

Nowadays, they have opened their infrastructure and are offering it for sale to the world.  Top-notch web-scale infrastructure has become an accessible commodity. A tech company doesn’t need its own datacenters anymore, no matter how big it grows.

Cheat Sheet

Run an entire tech company in the cloud, or run only a single [big] project requiring more than 10 servers? Google Compute Engine

Run less than 10 servers, for as little cost as possible? Digital Ocean

Run only beefy servers ( > 100GB RAM) or have special hardware requirements? IBM SoftLayer or OVH

Conclusion

The cloud is awesome. No matter what we want, where and when we want it. There is always a server ready for us at the click of a button (and the typewriting of our credit cards details).

The most surprising thing we encounter daily on these services is to notice how everything is so new. A recurrent “available since XXX” written in a corner of the page, stating it’s only been there for 1-3 years.

These writings are telling a story. The cloud have had enough time to mature and it is ready to be mainstream. Maintaining physical servers is an era from the past.

Stack Overflow Survey Results: Money does buy happiness!


Developer Happiness By Salary

money_buys_happiness

The Stack Overflow Survey asks developers around the world about their current situation.

The answers provided by 46,122 respondents this year finally prove that money can buy happiness. In fact, the more money you get, the more happiness you get!

 

money shower
Enjoyable, isn’t it?

Source: Stack Overflow Developer Survey

System Design: Combining HAProxy, nginx, Varnish and more into the big picture


This comes from a question posted on stack overflow: Ordering: 1. nginx 2. varnish 3. haproxy 4. webserver?

I’ve seen people recommend combining all of these in a flow, but they seem to have lots of overlapping features so I’d like to dig in to why you might want to pass through 3 different programs before hitting your actual web server.

My answer explains what are these applications for, how do they fit together in the big pictures and when do they shine. [Original answer on ServerFault]

Foreword

As of 2016. Things are evolving, all servers are getting better, they all support SSL and the web is more amazing than ever.

Unless stated, the following is targeted toward professionals in business and start-ups, supporting thousands to millions of users.

These tools and architectures require a lot of users/hardware/money. You can try this at a home lab or to run a blog but that doesn’t make much sense.

As a general rule, remember that you want to keep it simple. Every middleware appended is another critical piece of middleware to maintain. Perfection is not achieved when there is nothing to add but when there is nothing left to remove.

Some Common and Interesting Deployments

HAProxy (balancing) + nginx (php application + caching)

The webserver is nginx running php. When nginx is already there it might as well handle the caching and redirections.

HAProxy —> nginx-php
A —> nginx-php
P —> nginx-php
r —> nginx-php
o —> nginx-php
x —> nginx-php
y —> nginx-php


HAProxy (balancing) + Varnish (caching) + Tomcat (Java application)

HAProxy can redirect to Varnish based on the request URI (*.jpg *.css *.js).

HAProxy —> tomcat
A —> tomcat
—> tomcat
P —> tomcat tomcat varnish varnish nginx:443 -> webserver:8080
A —> nginx:443 -> webserver:8080
P —> nginx:443 -> webserver:8080
r —> nginx:443 -> webserver:8080
o —> nginx:443 -> webserver:8080
x —> nginx:443 -> webserver:8080
y —> nginx:443 -> webserver:8080

Middleware

HAProxy: THE load balancer

Main Features:

  • Load balancing (TCP, HTTP, HTTPS)
  • Multiple algorithms (round robin, source ip, headers)
  • Session persistence
  • SSL termination

Similar Alternatives: nginx (multi-purpose web-server configurable as a load balancer)

Different Alternatives: Cloud (Amazon ELB, Google load balancer), Hardware (F5, fortinet, citrix netscaler), Other&Worldwide (DNS, anycast, CloudFlare)

What does HAProxy do and when do you HAVE TO use it?

Whenever you need load balancing. HAProxy is the go to solution.

Except when you want very cheap OR quick & dirty OR you don’t have the skills available, then you may use an ELB 😀

Except when you’re in banking/government/similar requiring to use your own datacenter with hard requirements (dedicated infrastructure, dependable failover, 2 layers of firewall, auditing stuff, SLA to pay x% per minute of downtime, all in one) then you may put 2 F5 on top of the rack containing your 30 application servers.

Except when you want to go past 100k HTTP(S) [and multi-sites], then you MUST have multiples HAProxy with a layer of [global] load balancing in front of them (cloudflare, DNS, anycast). Theoretically, the global balancer could talk straight to the webservers allowing to ditch HAProxy. Usually however, you SHOULD keep HAProxy(s) as the public entry point(s) to your datacenter and tune advanced options to balance fairly across hosts and minimize variance.

Personal Opinion: A small, contained, open source project, entirely dedicated to ONE TRUE PURPOSE. Among the easiest configuration (ONE file), most useful and most reliable open source software I have came across in my life.

Nginx: Apache that doesn’t suck

Main Features:

  • WebServer HTTP or HTTPS
  • Run applications in CGI/PHP/some other
  • URL redirection/rewriting
  • Access control
  • HTTP Headers manipulation
  • Caching
  • Reverse Proxy

Similar Alternatives: Apache, Lighttpd, Tomcat, Gunicorn…

Apache was the de-facto web server, also known as a giant clusterfuck of dozens modules and thousands lines httpd.conf on top of a broken request processing architecture. nginx redo all of that, with less modules, (slightly) simpler configuration and a better core architecture.

What does nginx do and when do you HAVE TO use it?

A webserver is intended to run applications. When your application is developed to run on nginx, you already have nginx and you may as well use all its features.

Except when your application is not intended to run on nginx and nginx is nowhere to be found in your stack (Java shop anyone?) then there is little point in nginx. The webservers features are likely to exist in your current webserver and the other tasks are better handled by the appropriate dedicated tool (HAProxy/Varnish/CDN).

Except when your webserver/application is lacking features, hard to configure and/or you’d rather die job than look at it (Gunicorn anyone?), then you may put an nginx in front (i.e. locally on each node) to perform URL rewriting, send 301 redirections, enforce access control, provide SSL encryption, and edit HTTP headers on-the-fly. [These are the features expected from a webserver]

Varnish: THE caching server

Main Features:

  • Caching
  • Advanced Caching
  • Fine Grained Caching
  • Caching

Similar Alternatives: nginx (multi-purpose web-server configurable as a caching server)

Different Alternatives: CDN (Akamai, Amazon CloudFront, CloudFlare), Hardware (F5, Fortinet, Citrix NetScaler)

What does Varnish do and when do you HAVE TO use it?

It does caching, only caching. It’s usually not worth the effort and it’s a waste of time. Try CDN instead. Be aware that caching is the last thing you should care about when running a website.

Except when you’re running a website exclusively about pictures or videos then you should look into CDN thoroughly and think about caching seriously.

Except when you’re forced to use your own hardware in your own datacenter (CDN ain’t an option) and your webservers are terrible at delivering static files (adding more webservers ain’t helping) then Varnish is the last resort.

Except when you have a site with mostly-static-yet-complex-dynamically-generated-content (see the following paragraphs) then Varnish can save a lot of processing power on your webservers.

Static caching is overrated in 2016

Caching is almost configuration free, money free, and time free. Just subscribe to CloudFlare, or CloudFront or Akamai or MaxCDN. The time it takes me to write this line is longer that the time it takes to setup caching AND the beer I am holding in my hand is more expensive than the median CloudFlare subscription.

All these services work out of the box for static *.css *.js *.png and more. In fact, they mostly honour the Cache-Control directive in the HTTP header. The first step of caching is to configure your webservers to send proper cache directives. Doesn’t matter what CDN, what Varnish, what browser is in the middle.

Performance Considerations

Varnish was created at a time when the average web servers was choking to serve a cat picture on a blog. Nowadays a single instance of the average modern multi-threaded asynchronous buzzword-driven webserver can reliably deliver kittens to an entire country. Courtesy of sendfile().

I did some quick performance testing for the last project I worked on. A single tomcat instances could serve 21 000 to 33 000 static files per second over HTTP (testing files from 20B to 12kB with varying HTTP/client connections count). The sustained outbound traffic is beyond 2.4 Gb/s. Production will only have 1 Gb/s interfaces. Can’t do better than the hardware, no point in even trying Varnish.

Caching Complex Changing Dynamic Content

CDN and caching servers usually ignore URL with parameters like ?article=1843, they ignore any request with sessions cookies or authenticated users, and they ignore most MIME types including the application/json from /api/article/1843/info. There are configuration options available but usually not fine grained, rather “all or nothing”.

Varnish can have custom complex rules (see VCL) to define what is cachable and what is not. These rules can cache specific content by URI, headers and current user session cookie and MIME type and content ALL TOGETHER. That can save a lot of processing power on webservers for some very specific load pattern. That’s when Varnish is handy and AWESOME.

Conclusion

It took me a while to understand all these pieces, when to use them and how they fit together. Hope this can help you.

Configuring timeouts in HAProxy


This comes from a question posted on stack overflow: By what criteria do you tune timeouts in HA Proxy config?

When configuring HA Proxy, how do you decide what values to assign to the timeouts? I’ve read a half dozen samples in various blogs, and everyone uses different timeouts and no one discusses why.

My originally answer was posted on ServerFault.

Foreword

I’ve been tuning HAProxy for a while and done a lot of performance testing on it. From 100 HTTP requests/s to 50 000 HTTP requests/s.

The first advice is to enable the statistics page on HAProxy. You NEED monitoring, no exception. You will also need fine tuning if you intend to go past 10 000 requests/s.

Timeouts are a confusing beast because they have a huge range of possible values, most of them having no observable difference. I have yet to see something fails because of a number 5% lower or 5% higher. 10000 vs 11000 milliseconds, who cares? Probably not your system.

Configuration

I cannot in good conscience give a couple of numbers as ‘best timeouts ever for everyone’.

What I can tell instead is the MOST aggressive timeouts which are always acceptable for HTTP(S) load balancing. If you encounter lower than these, it’s time to reconfigure your load balancer.

timeout connect 5000
timeout check 5000
timeout client 30000
timeout server 30000

timeout client:

The inactivity timeout applies when the client is expected to acknowledge or
send data. In HTTP mode, this timeout is particularly important to consider
during the first phase, when the client sends the request, and during the
response while it is reading data sent by the server.

Read: This is the maximum time to receive HTTP request headers from the client.

3G/4G/56k/satellite can be slow at times. Still, they should be able to send HTTP headers in a few seconds, NOT 30.

If someone has a connection so bad that it needs more than 30s to request a page (then more than 10*30s to request the 10 embedded images/CSS/JS), I believe it is acceptable to reject him.

timeout server:

The inactivity timeout applies when the server is expected to acknowledge or
send data. In HTTP mode, this timeout is particularly important to consider
during the first phase of the server’s response, when it has to send the
headers, as it directly represents the server’s processing time for the
request. To find out what value to put there, it’s often good to start with
what would be considered as unacceptable response times, then check the logs
to observe the response time distribution, and adjust the value accordingly.

Read: This is the maximum time to receive HTTP response headers from the server (after it received the full client request). Basically, this is the processing time from your servers, before it starts sending the response.

If your server is so slow that it requires more than 30s to start giving an answer, then I believe it is acceptable to consider it dead.

Special Case: Some RARE services doing very heavy processing might take a full minute or more to give an answer. This timeout may need to be increased a lot for this specific usage. (Note: This is likely to be a case of bad design, use an async style communication or don’t use HTTP at all.)

timeout connect

Set the maximum time to wait for a connection attempt to a server to succeed.

Read: The maximum time a server has to accept a TCP connection.

Servers are in the same LAN as HAProxy so it should be fast. Give it at least 5 seconds because that’s how long it may take when anything unexpected happens (a lost TCP packet to retransmit, a server forking a new process to take the new requests, spike in traffic).

Special Case: When servers are in a different LAN or over an unreliable link. This timeout may need to be increased a lot. (Note: This is likely to be a case of bad architecture.)

timeout check

Set additional check timeout, but only after a connection has been already
established.

Set additional check timeout, but only after a connection has been already
If set, haproxy uses min(“timeout connect”, “inter”) as a connect timeout
for check and “timeout check” as an additional read timeout. The “min” is
used so that people running with very long “timeout connect” (eg. those
who needed this due to the queue or tarpit) do not slow down their checks.
(Please also note that there is no valid reason to have such long connect
timeouts, because “timeout queue” and “timeout tarpit” can always be used
to avoid that).

Read: When performing a healthcheck, the server has timeout connect to accept the connection then timeout check to give the response.

All servers MUST have a HTTP(S) health check configured. That’s the only way for the load balancer to know whether a server is available. The healthcheck is a simple /isalive page always answering OK.

Give this timeout at least 5 seconds because that’s how long it may take when anything unexpected happens (a lost TCP packet to retransmit, a server forking a new process to take the new requests, spike in traffic).

War Story: A lot of people wrongly believe that the server can always answer this simple page in 3 ms. They set an aggressive timeout (< 2000ms) with aggressive failover (2 failed checks = server dead). I have seen entire websites going down because of that. Typically there is a slight spike in traffic, backend servers get slower, the healthchecks are delayed… until suddenly they all timeout together, HAProxy thinks ALL servers died at once and the entire site goes down.

Conclusion

Hope you understand timeouts better now.

Lessons Learned #0: The HAProxy statistics page is your best friend for monitoring connections, timeouts and everything.

Lessons Learned #1: Timeouts aren’t that important.

Lessons Learned #2: Be gentle on timeout configuration (especially timeout check and timeout connect). There has never been any issue because of “slightly too long timeout” but there are regular cases of “too short timeout” that put entire websites down.

Monitoring in the Cloud: Datadog vs Server Density vs SignalFX vs StackDriver vs BMC Boundary vs Wavefront vs NewRelic


We’re a tech company and we have more than 100 AWS instances to run our services. It is critical that we have good monitoring, metrics collections, graphs and alerting.

Current Setup

We have an in-house monitoring solution built over more than 9 tools, including but not limited to:

  • statsd
  • collectd
  • graphite
  • grafana
  • nagios
  • cacti
  • riemann
  • icinga

All are open-source solutions (as in build-it and maintain-it yourself). Most are tools coming straight from the 90’s with an old UI, they are hard to use and they are hard to maintain. None of these can scale or run on more than a single node.

That’s a total of 8 independent points of failure, put under constant pressure by many hosts and metrics, unable to understand AWS hosts going up and down regularly. So far, the palm of the worst-in-class belong to riemann. Its configuration is a 1000 lines file written in Clojure with up to 12 levels of indentation.

We’ve been babysitting this setup again and again every time it breaks and it’s been a major pain in the ass. We’re reached a desperate point were we just want to throw everything away and stop the pain.

What if we don’t want to send our data to 3rd party?

Neither do we.

We thought about it and we came to the conclusion that CPU percentage and memory usage are not critical information to be kept private at all cost. They don’t give away any user data and they don’t give away critical business information.

If there is something out there that is worthy to graph it out, so be it.

Actually, it’s a fake dilemma. We’ve tried “the build and maintain it ourself” already and it’s a major failure. Let’s not burn out more time and people to go that wrong route.

What to expect from a monitoring solution

The MUST have:

  • Short interval between metrics (our current collectd is about 15s-20s)
  • Graph by min, average AND max
  • Easy deployment
  • Cute graphs (colors, zoom, legend, easily readable)
  • Responsive site
  • Monitor the basics (memory, disk, I/O, …)
  • Custom dashboards
  • Custom alerting

The SHOULD have:

  • Compare graphs (arrange in grid, superimpose, align axes…)
  • Advanced alerting (moving time windows, multiple metrics, outlier detection)
  • Integrate with middleware (PostgreSQL metrics, nginx metrics, …)
  • Easily add/remove hosts (AWS environment is constantly evolving)

Options:

  • Collectd + Graphite + Grafana + Icinga + Riemann (the on-site crowd)
  • Server Density
  • Datadog (cloud)
  • BMC truesightpulse (ex. Boundary)
  • [Google] StackDriver
  • SignalFX
  • WaveFront
  • NewRelic

Trial by trialing

collectd + graphite + grafana + icinga + Riemann (on-site)

The standard on-site solution that everyone knows. Not worth presenting since we’re trying to run away from it.

Server Density

A London company (close to us :D) who raised some money in 2010, 2011, 2015. We had received positive feedbacks about Server Density before. Let’s go for the trial.

Agent Installation

The agent was painful to install.

Each host has to be registered individually with the service. It gets unique keys and a unique configuration. It was a pain in the ass to automate the deployment. Multiple REST API calls to their services and to get piece of configuration depending on the current state of the host in their service.

Web Interface

  • Metrics interval is 1 minute at best. An ENTIRE minute
  • No filtering by min, average, max
  • No legends on graphs. No clue what the lines are showing
  • No integration with any middleware or application
  • The website fails to load way too often

The site fails to load every few pages. After a few hours surfing for the trial, we were genuinely thinking that our office internet connection was broken. Thankfully it is not our internet but the server density site which is extremely buggy.

Conclusion

Removed that s**** after 48 hours, cleaned agents, killed all the hosts where they ever was an agent.

Between the site failing randomly, the terrible UI and all the basics features missing. This is one of the worst product we are have ever come across. We cannot comprehend how it ever managed to get positive reviews or raise money 3 times.

Datadog

An American companies founded somewhere around 2008. Raised 15 M$ in 2014, then 31 M$ in 2015 and finally 97 M$ in 2016.

Long story short. It’s very good and it does everything we wan. (We’ll publish an article dedicated to Datadog later).

Once in a lifetime, you get the opportunity to look at two companies of the same age in the same market. One of them (Datadog) just happening to raise 50 times more money than the other one (Server Density). It turns out to be a definitive indicator of how good the products are relative to each other.

[Google] StackDriver

An American company founded around 2012. Raised 5 M$ in 2012, acquired by Google in 2014.

The main site http://www.stackdriver.com/ is still online. The screenshots are nice and we want to try that thing.

There is an issue though. We try to try it and we can’t because there is no way to try it. Parts of the site are inaccessible, parts redirect to google, some sections are missing.

Google bought it in may 2014, it is now may 2016. The product should be available and the site should be up (eventually all under a different name and logo) but it’s not.

It looks like the service was killed as a result of the Google acquisition. This could have been a good monitoring tool but we’ll never know. If anyone had the opportunity to try and has experience with it, please comment.

June update: There are references to Google StackDriver suddenly appearing all over the GCE documentation. A closed-beta is available on-request for premium customers.

July update: It’s now clear that StackDriver is being integrated to Google. It will become part of their cloud offering and it will be available as a standalone product. Expecting a release within 1-2 years.

BMC truesightpulse (ex. Boundary)

American company founded around 2010. Bought for 15 M$ in 2012 by BMC and became truesightpulse.

We had heard of Boundary multiple times but couldn’t find it. We already settled for Datadog  (and were satisfied) by the time we understood that Boundary was acquired and renamed by BMC.

Judging by what we can see on the website. The screenshots are good, it can get metrics from all the common databases/webservers, it integrates with AWS/GCE. The pricing is a bit cheaper than Datadog ($12/month per host).

It’s the historic direct competitor to Datadog. They’re mostly copy cat of each other.

SignalFX

[July 2016 update: added SignalFX]

Yet another monitoring company that raised millions. A late comer to the market.

Basically, it’s a direct copy-cat of Datadog and BMC. The UI is nice and the graphs are cute (same as the competitors). It’s lagging behind in terms of advanced features and integrations though, not sure if it can catch up with the leader.

The price point is per metric stream per month which may make it cheaper than Datadog while somewhat equivalent for simple basic monitoring.

If you have to trial only two services. The first pick is Datadog and the second pick is SignalFX. (BMC is a fair second pick as well, note that we’re biased against bigger companies with more products and less focus).

Wavefront

[July 2016 update: added Wavefront]

We received a link to Wavefront during our holidays right after we closed the evaluation. It’s another late comer and perfect copy-cat (we’re crossing a line here: some icons and UI are identical pixel wise).

We open the link on our laptop in battery saving mode and… Firefox freezes for a minute. Who thought that a full screen HD video of a dude surfing was a good thing to put as a main page?

Well, we will have to wait for the end of the holidays to see the website, until we have access to our work computers again (i7 8 cores, 32 GB memory, SSD).

Once we get back to work and check the website, it turns out that Wavefront doesn’t display any price publicly and gives no trial either. Can’t do a anything without talking to their sales guys first.

At this point, we’ve already done weeks of trial and we’ve got 3 strong competitors who have better products and are more accessible. For the sake of it, we’ll just pretend that Wavefront doesn’t exist.

NewRelic

No need to introduce NewRelic. Maybe the most advertised company of 2015, one of the highest valuation ever done for monitoring related tools, world best in class Application Monitoring Performance (APM).

We already used NewRelic APM to monitor our applications and we love it. It gives very deep performance information about the application (detailed profiler, call stack, debugging). If they have a server monitoring thing, we could expand our deployment.

NewRelic doesn’t do monitoring

It turns out that NewRelic don’t have any product to do server monitoring.

Still thought about NewRelic to monitor the database/webservers because it would be nice to have performance indications, query timings and things like that. It turns out that they don’t support PostgreSQL at all. In fact they don’t support ANY database. NewRelic APM is only available to monitor applications written in Java, Python, C# and a few others. That’s it. Nothing more.

We checked out the NewRelic plugins. There are 3 plugins for PostgreSQL, all of them written pre-2014, being abandoned GitHub project by a random dude. They can barely get 5-10 metrics and provide no profiling whatsoever. Not to mention that the comments averaging 2/5 stars are scary.

As a conclusion, NewRelic cannot do server monitoring. (They’re really awesome in the application performance market though).

Conclusion

#MonitoringSucks is over. We’ve got a pack of great monitoring tools all invented at once.

The world best in class is Datadog (we’ll write a dedicated article later). It’s older and more mature. It has the most features and integrations. When you have to pick a monitoring tool for the future of your tech company, that’s the horse you want to put your money on.

The challengers are SignalFX and BMC truesightpulse.

 

How to export Amazon EC2 instances to a CSV file


Amazon website is limited to 50 instances per page. Viewing lots of instances is a pain and it doesn’t support exporting to CSV/TSV/Excel/other out of the box. The only fix is to use the CLI.

Requirements

  • An AWS account with access rights to see your servers
  • A pair of AWS keys (Users -> [username] -> Security Credentials -> Create Access Key)
# Install AWS packages
sudo apt-get install -y python python-pip
sudo pip install aws-shell

Listing Instances

# aws-shell will show a wizard to configure your account and region the first time you use it
aws-shell
ec2 describe-instances --output text --query 'Reservations[*].Instances[*].[InstanceId, InstanceType, ImageId, State.Name, LaunchTime, Placement.AvailabilityZone, Placement.Tenancy, PrivateIpAddress, PrivateDnsName, PublicDnsName, [Tags[?Key==`Name`].Value] [0][0], [Tags[?Key==`purpose`].Value] [0][0], [Tags[?Key==`environment`].Value] [0][0], [Tags[?Key==`team`].Value] [0][0] ]' > instances.tsv
# open instances.tsv with Excel
# enjoy

You can modify the command to pick the information you want. Refer to the official AWS command line reference.

CV Francais vs UK resume


CV Français

+---------------------------------------------------------------------+
| nom                                                                 |
| ville                                                               |
| age                           photo                            blog |
| nationalité (VISA ?)          photo                       portfolio |
| marié/célibataire             photo                          github |
| permis B                   (optionnel)                stackoverflow |
| téléphone                                                           |
| mail                                                                |
|                                                                     |
|                           === TITRE ===                             |
|                           === TITRE ===                             |
|                           === TITRE ===                             |
|                                                                     |
| === FORMATION                                                       |
|   - 2013, Master <spécialité>, école <nom>                          |
|   - 2009, DUT <spécialité><spécialité>, université <nom>            |
|                                                                     |
| === EXPÉRIENCES PROFESSIONNELLES                                    |
|                                                                     |
|    *** Dates - Société1 - Role1 - Ville ****                        |
|      - lorem ipsum dolor                                            |
|      - sit amet and cookies                                         |
|      = MOTS-CLÉS: latin, langues mortes                             |
|                                                                     |
|    *** Dates - Société2 - Role2 - Ville ****                        |
|      - the quick brown fox                                          |
|      - jumps over the lazy cat                                      |
|      = MOTS-CLÉS: anglais, animaux                                  |
|                                                                     |
| === COMPÉTENCES                                                     |
|   - Latin: Victoriae, mundis, mundis, lacrima                       |
|   - Cuisine: Omelette au fromage                                    |
|                                                                     |
| === INTÉRÊTS ET LANGUES                                             |
|   - Ping-pong                                                       |
|   - Équitation                                                      |
|   - Anglais: bilingue (TOEIC )                                      |
|                                                                     |
|                      === OBJECTIFS ===                              |
|     Je recherche un stage en x de y a z dans la ville v.            |
|                                                                     |
+---------------------------------------------------------------------+

UK Resume

+---------------------------------------------------------------------+
| name                                                           blog |
| city                                                      portfolio |
| nationality (VISA status)                                    github |
| phone                                                 stackoverflow |
| email                                                               |
|                                                                     |
|                           === TITLE ===                             |
|                           === TITLE ===                             |
|                           === TITLE ===                             |
|                                                                     |
| === PROFILE                                                         |
|    Lots of bullshit about how qualified I am.                       |
|    Currently seeking an internship in x from y to z date in c City. |
|                                                                     |
| === PROFESSIONAL EXPERIENCES                                        |
|                                                                     |
|    *** Dates - Company1 - Role1 - City ****                         |
|      - lorem ipsum dolor                                            |
|      - sit amet and cookies                                         |
|      = KEYWORDS: Latin, dead languages                              |
|                                                                     |
|    *** Dates - Company2 - Role2 - City ****                         |
|      - the quick brown fox                                          |
|      - jumps over the lazy cat                                      |
|      = KEYWORDS: English, animals                                   |
|                                                                     |
| === EDUCATION                                                       |
|   - 2013, MEng. <domain>, university <name>                         |
|   - 2010, BEng. <domain>, university <name>                         |
|                                                                     |
| === SKILLS                                                          |
|   - Latin: Victoriae, mundis, mundis, lacrima                       |
|   - Cooking: Omelette au fromage                                    |
|                                                                     |
| === INTERESTS                                                       |
|   - Ping Pong               - Horse Riding                          |
|                                                                     |
| === LANGUAGES                                                       |
|   - French: native          - English: fluent (certifications)      |
|                                                                     |
+---------------------------------------------------------------------+

The Differences

Personal Information

The UK resume have NO age, NO marital status, NO photo. UK has strict anti discrimination law to which companies actually obey. They’re not allowed to ask about those and you should not write them at all. That is a mistake punishable by instant destruction. Luckily for you. Companies used to receiving candidates from abroad, especially in tech, are not too strict on this and may not even know about the law.

The UK only ‘profile’ section

The UK resume ALWAYS begins with a profile section. A few lines to introduce [s]yourself[s] your professional profile and eventually some context (e.g.  Looking for internship in London in x from y to z date). It can be the hardest to find inspiration for. One useful advise is to stay factual and avoid the bullshit. The expectations are not set in stone, your mileage may vary depending on the reader.

Factual: “8 years experience in Project Management and Business Analysis within Finance and Telecommunications, backed by a 1st class degree and MSc.”
Bullshit: “I’m self motivated, able to work in a team and on my own.”

The French only ‘objectifs’ section (aka. ‘goals’)

Usually not present on most French resumes. There is some debate about including it or not. You’re obviously sending a resume because your goal is to find a job, right ? Well, not necessarily any job anywhere anytime. It is worth writing that you’re available from specific dates, for an internship, or in a given city only. If you have such restrictions, this section definitely makes sense and should be included. The British are smart enough that they have a profile section to put these things into.

Is it a CV or a Resume ?

It is called a CV in French (and others Latin languages). In the UK it is either called a CV or a resume, whichever you prefer.

Some people say that one word is British and the other is American, however it is not clear which is which. Some people argue that resume is the 1-2 pages short version and CV is the longer version.

I like to get things right -so does my inner grammar Nazi- and I did some digging. Supposing there ever was a strict difference between those, it doesn’t matter any more. It is long lost and forgotten. Pick whichever you prefer and stick to it.

LDAP Benchmark: 10 million users with OpenDJ


Let’s store 10 million users in a LDAP. It is possible? If so, what hardware does it take?

Benchmark Setup

Test Scenario

Same scenario and procedure as the previous article comparing LDAP servers with 100k users accounts.

Hardware

full cluster
Authentication Platform

One rectangle is one server. Lines represent connections.

Load testing is done using the OpenAM REST API.

The load injector is a distributed proprietary solution we pay a lot of money for. It could use a full diagram on its own.

Software and Configuration

  • CentOS 6.7
  • Oracle JDK 7 (special edition)
  • OpenDJ 2.6
  • OpenAM 12 (tomcat7)

We have a special package of the Oracle JDK given by Oracle. Custom stuff and bugfixes before they get to the general public. We’re a Java shop with big critical projects and that’s part of the advantages we get with the support contract.

All applications are tuned for maximum performances:

  • HAProxy: leastconn balancing, persistent sessions with cookies, 10 Gb networking, up to 16 000 HTTP requests/s during this benchmark
  • OpenAM: disable debug logs, log level info, LDAP connections pool = 65, tomcat thread pool = 400, JVM Heap = 4GB
  • OpenDJ: log level info, multi-master replication, memory for database caching = 80%, JVM heap = 50 GB

We spent a lot of time to tune the JVM settings. Evaluate the ConcMarkSweep against the G1 Garbage Collector, the effects of generation sizes and settings. That was an utter waste of time. It barely has any influence on the applications (+- 1%).

A bit of history: OpenDJ was originally developed by Sun, who invented Java and the JVM. The story is that used internally OpenDJ as a benchmark during the development of the G1 garbage collector. This is the latest java garbage collector recommended when using more than 8 cores and 30 GB memory.

A Word About OpenDJ Replication

OpenDJ supports master-master replication. Any instance can be written to and changes are propagated to other instances.

Master-slave replication is not supported and will never be. [It’s never needed anyway].

OpenDJ replication
Worldwide OpenDJ deployment with 104 directories + 8 replication nodes

“This is a worldwide deployment with many directory services in 4 regions and 8 replication services fully connected. Each directory service is connected to a single replication server, but can failover in matter of seconds, by priority in the same region.”

Source: visualizing the opendj replication topology, Ludovic Poitou Blog, ForgeRock Product Manager

There are dedicated directory nodes to store user accounts (outer circle) and dedicated replication nodes to handle replication (inner circle). This architecture is scalable and it’s intended to scale to multiple datacenters, worldwide. For this benchmark we’re using a single site with 4 directory nodes and 2 replication nodes.

There is a common myth saying that LDAP replication sucks. To be accurate, it is only the OpenLDAP replication (and ApacheDS) that is poorly done and buggy. The OpenDJ replication is well designed and working flawlessly.

There are very few databases in existence supporting worldwide multi-site replication, let alone intended for critical production environment. OpenDJ is part of the elite few along Cassandra from Facebook plus BigTable and Spanner from Google.

Tests Results

Performances

full cluster authrate
OpenAM REST API Login Rate Comparison
full cluster latency
OpenAM REST API Median Login Duration Comparison

An OpenAM server will only connect to one OpenDJ server at any time. This current setup is a simple architecture with the same amount of OpenAM and OpenDJ servers, with the connection affinity configured to group them in pairs.

The bottleneck here is the CPU usage on OpenAM. The tests are pushing OpenAM servers to constant 100% CPU usage while OpenDJ servers are running at 30%.

We could EITHER double the amount of cores on all the OpenAM servers to double the throughput (for a slightly increased latency) OR get the same performances results with only two OpenDJ directories.

OpenDJ Memory and Disk Consumption

OpenDJ database size
OpenDJ Database Size

Disk usage increases linearly with the number of user accounts. The official ForgeRock recommendation is to set OpenDJ heap to [at least] twice the size of the on-disk database to be able to cache everything.

OpenAM Memory Per Session

OpenAM memory per session
OpenAM Memory per Session

This is an estimation of the memory usage per opened session. There is no way to measure it directly with confidence, this is a high bound obtained by averaging different memory metrics across multiple OpenAM instances over a few test runs.

The high peak at the beginning is because we can’t distinguish properly memory used by sessions and general server memory usage, having only a few sessions at the start gives an overrated average. The peak at the end is because the JVM is optimizing and reclaiming memory aggressively when the process is close to its maximum heap allowance.

One sessions requires ~ 15k of memory. This is consistent with the official ForgeRock recommendation to allocate 3GB memory heap to OpenAM for 100 000 sessions. For more sessions it’s recommended to scale horizontally with each instance having 100 000 sessions or less.

Conclusion

There is a boost in performances under 50 000 user accounts. Afterwards, performances are flat no matter the amount of user accounts. Thus performances is not linked to the amount of user accounts. Who would have guessed?

OpenDJ and OpenAM were flawless during all stress tests. No bug, no crash, no error, no comment to be done.

When you want to store user accounts in a LDAP and authenticate through OpenID Connect & SAML with optional 2 factor authentication, for 10 million users, across the globe. OpenAM with OpenDJ is the go-to solution. In fact it’s the only solution available on the planet to support all of this.