This is the story of a debugging case with a happy ending.
TL;DR This is a bug in the glibc malloc(). It mainly affects 64 bits multi-threading applications. With a special mention to Java applications because the JVM seems to always trigger the worst case.
The write up is quite long, skip to third section glibc malloc() if you just want the ending.
A developer reported an issue in one of their service and asked for some help to investigate. Sure.
Digging into the service logs, I noticed that there were no logs for a handful of instances. It’s barely noticeable, only a few percents of the normal activity goes there and is not accounted for.
I tracked this down to pair of dev and production hosts, part of larger clusters with tens of hosts.
How come services that should be running there are not? More worrying there were no failure notifications, or alerts, or issues reported by users?
This is a bank and those are fairly critical. These cannot fail nor be missing.
To the extent of my knowledge they have only failed twice in recent years. The first time due to a bug happening in rare circumstances in one region, the second time due to the complete and sudden loss of a datacenter.
In both cases it took a significant amount of time to understand the root cause (almost one hour to conclude with certainty that a datacenter has exploded? What?!). Both were mostly non-events as the failover kicked in within 30 seconds, re-routing most activity to the other side of the world, long before anybody could realize anything happened or do anything about it. Disaster recovery at it’s peak.
What are the odds that we are looking at the soon-to-be third issue, except this time we’re catching it early before other have noticed?
Gathering another coworker and back into serious debugging.
After careful checking. The absent jobs are all running and working well, which is reassuring. However they are “ghosts” on today, for lack of a better word. There is no trace of activity or debug logs.
How is that possible? Let’s investigate the hosts and figure out what’s wrong with them in particular.
It doesn’t take long to realize that the hosts are missing pieces around logging/monitoring.
One of the easier bits to investigate further might be one of the Logstash agents. It is not running on any of the affected machines.
The configuration looks good. Run a configuration check and it passes. There are starting logs from yesterday so it must have worked at some point.
Let’s try to restart it. Done.
Odd. It is running fine now. Back filling data that was missing. Why would it be dead in the first place?
Data disappear after a few minutes. What? Double checking. There is no more process.
Restart it again, while paying more attention. It starts and dies one minute later. Repeat one more time, same.
Victory, we know for sure something is broken and we have a repro! Having a repro is the most critical ingredient to the resolution of a tricky bug.
There are no meaningful error indicating a configuration issue or an application issue or a JVM issue.
Let’s restart it again keeping a close look at the memory consumption. It’s going up and up to multiple gigabytes and die.
Working software doesn’t stop working by itself, but due to some changes in its configuration or environment. What changed recently?
For context, the firm is mono repo with thousands of developers and hundreds of thousands of jobs deployed every day. Almost exclusively a python shop.
There is one change happening firmwide over the recent weeks that may be related. The firm is deprecating 32 bits. Literally everything that is 32 bits.
If 2020 is the year python 2 must die, then 2021 is the year 32 bits must die.
This makes perfect sense. Server operating systems have been 64 bits for two decades, desktops for one. Many applications, databases and middleware can only run 64 bits because they require more than 4 GB of memory. Major libraries are dropping 32 bits support.
One challenge with transitioning from 32 bits to 64 bits is applications with memory leaks combined with the lack of memory limits.
In a 32 bits world, a leaky application can consume more and more memory until it’s approaching the 4 GB memory limit adressable by 32 bits and crash.
In a 64 bits world, a leaky application can consume more and more memory with no limit, exhausting resources on the host and crashing other tasks. This is a major recurring source of issues with deploying applications.
The solution is almost trivial, set a memory limit by default. Let’s say 8 GB. It’s more than what was available in 32 bits (4GB) with a large margin for extra overheard.
Sure enough. The planned depreciation of 32 bits came along -default- memory limits on every process. Well announced and staged gradually.
Of course there were a couple services that blew up after that and they were immediately fixed by setting a memory limit of 50 GB or whatever high number. Urgency did not call for accuracy.
Back to our few dying services. What do they have in common? The hosts fit the criteria for early staging. They must have received the update.
Quick check. Yes, the deployed processes get a memory limit.
Again, restart a failing process and monitor carefully, it works fine for about one minute, with the memory increasing, then dies. What are the odds that it is killed when reaching the memory limit?
One problem though. There is no way that job can approach anywhere near 4 GB of memory. This is a java application and the JVM preallocates memory on startup, it is configured to use exactly 1024 MB of heap
Where does the extra memory comes from?
Long story short, this is due to a bug in malloc(). Well, it’s not a bug it’s a feature.
malloc() preallocates large chunks of memory, per thread. This is meant as a performance optimization, to reduce memory contention in highly threaded applications.
In 32 bits runtime, it can preallocate up to 2 * 64 MB * cores.
In 64 bits runtime, it can preallocate up to 8 * 64 MB * cores.
On a typical physical server, dual Xeon CPU with a terabyte of RAM. The core count is easily 40 or above. 10 cores * 2 CPU * 2 for hyper threading.
This means a preallocation of up to 20 GB of RAM in the process.
How much exactly depends on the application, but as a rule of thumb it requires a multi-threaded application to have the most. For some reasons the JVM always seem to trigger the worst case.
The allocated memory counts toward the memory limit.
That’s why the process fails by dying after one minute, it’s allocating/deallocating memory in different threads, triggering more and more preallocation as a side effect, until it reaches the limit and gets terminated.
It’s a nasty bug between the preallocation in glibc malloc() and the memory limit in rlimit. Any application that has a low rlimit (tens of gigabytes or less) is at risk to hit this bug and be terminated.
How to mitigate?
Preallocation behavior can be adjusted with the environment variable
A deployment system that sets a memory limit must set this variable too. The major schedulers and cloud hosting have mitigations out-of-the-box, see Heroku or Hadoop YARN.
A value of 1 can disable it, but is likely to reduce performance by a few percents, so not recommended.
Interestingly enough, this is supposed to be a performance optimization but the few benchmarks on the internet all show a net degradation of performance as the value goes up in the 2 digits (from around 10 to higher). The default in the hundreds seems out of whack (512 arenas with 40 cores).
Edit: found after the write up. The issue is referenced in a glibc wiki page from 2013 Todo/Enhancing_malloc “Throttle number of arenas based on process rlimit on process startup and/or everytime RLIMIT_AS is modified.“