We’re a tech company and we have more than 100 AWS instances to run our services. It is critical that we have good monitoring, metrics collections, graphs and alerting.
We have an in-house monitoring solution built over more than 9 tools, including but not limited to:
All are open-source solutions (as in build-it and maintain-it yourself). Most are tools coming straight from the 90’s with an old UI, they are hard to use and they are hard to maintain. None of these can scale or run on more than a single node.
That’s a total of 8 independent points of failure, put under constant pressure by many hosts and metrics, unable to understand AWS hosts going up and down regularly. So far, the palm of the worst-in-class belong to riemann. Its configuration is a 1000 lines file written in Clojure with up to 12 levels of indentation.
We’ve been babysitting this setup again and again every time it breaks and it’s been a major pain in the ass. We’re reached a desperate point were we just want to throw everything away and stop the pain.
What if we don’t want to send our data to 3rd party?
Neither do we.
We thought about it and we came to the conclusion that CPU percentage and memory usage are not critical information to be kept private at all cost. They don’t give away any user data and they don’t give away critical business information.
If there is something out there that is worthy to graph it out, so be it.
Actually, it’s a fake dilemma. We’ve tried “the build and maintain it ourself” already and it’s a major failure. Let’s not burn out more time and people to go that wrong route.
What to expect from a monitoring solution
The MUST have:
- Short interval between metrics (our current collectd is about 15s-20s)
- Graph by min, average AND max
- Easy deployment
- Cute graphs (colors, zoom, legend, easily readable)
- Responsive site
- Monitor the basics (memory, disk, I/O, …)
- Custom dashboards
- Custom alerting
The SHOULD have:
- Compare graphs (arrange in grid, superimpose, align axes…)
- Advanced alerting (moving time windows, multiple metrics, outlier detection)
- Integrate with middleware (PostgreSQL metrics, nginx metrics, …)
- Easily add/remove hosts (AWS environment is constantly evolving)
- Collectd + Graphite + Grafana + Icinga + Riemann (the on-site crowd)
- Server Density
- Datadog (cloud)
- BMC truesightpulse (ex. Boundary)
- [Google] StackDriver
Trial by trialing
collectd + graphite + grafana + icinga + Riemann (on-site)
The standard on-site solution that everyone knows. Not worth presenting since we’re trying to run away from it.
A London company (close to us :D) who raised some money in 2010, 2011, 2015. We had received positive feedbacks about Server Density before. Let’s go for the trial.
The agent was painful to install.
Each host has to be registered individually with the service. It gets unique keys and a unique configuration. It was a pain in the ass to automate the deployment. Multiple REST API calls to their services and to get piece of configuration depending on the current state of the host in their service.
- Metrics interval is 1 minute at best. An ENTIRE minute
- No filtering by min, average, max
- No legends on graphs. No clue what the lines are showing
- No integration with any middleware or application
- The website fails to load way too often
The site fails to load every few pages. After a few hours surfing for the trial, we were genuinely thinking that our office internet connection was broken. Thankfully it is not our internet but the server density site which is extremely buggy.
Removed that s**** after 48 hours, cleaned agents, killed all the hosts where they ever was an agent.
Between the site failing randomly, the terrible UI and all the basics features missing. This is one of the worst product we are have ever come across. We cannot comprehend how it ever managed to get positive reviews or raise money 3 times.
An American companies founded somewhere around 2008. Raised 15 M$ in 2014, then 31 M$ in 2015 and finally 97 M$ in 2016.
Long story short. It’s very good and it does everything we wan. (We’ll publish an article dedicated to Datadog later).
Once in a lifetime, you get the opportunity to look at two companies of the same age in the same market. One of them (Datadog) just happening to raise 50 times more money than the other one (Server Density). It turns out to be a definitive indicator of how good the products are relative to each other.
An American company founded around 2012. Raised 5 M$ in 2012, acquired by Google in 2014.
The main site http://www.stackdriver.com/ is still online. The screenshots are nice and we want to try that thing.
There is an issue though. We try to try it and we can’t because there is no way to try it. Parts of the site are inaccessible, parts redirect to google, some sections are missing.
Google bought it in may 2014, it is now may 2016. The product should be available and the site should be up (eventually all under a different name and logo) but it’s not.
It looks like the service was killed as a result of the Google acquisition. This could have been a good monitoring tool but we’ll never know. If anyone had the opportunity to try and has experience with it, please comment.
June update: There are references to Google StackDriver suddenly appearing all over the GCE documentation. A closed-beta is available on-request for premium customers.
July update: It’s now clear that StackDriver is being integrated to Google. It will become part of their cloud offering and it will be available as a standalone product. Expecting a release within 1-2 years.
BMC truesightpulse (ex. Boundary)
American company founded around 2010. Bought for 15 M$ in 2012 by BMC and became truesightpulse.
We had heard of Boundary multiple times but couldn’t find it. We already settled for Datadog (and were satisfied) by the time we understood that Boundary was acquired and renamed by BMC.
Judging by what we can see on the website. The screenshots are good, it can get metrics from all the common databases/webservers, it integrates with AWS/GCE. The pricing is a bit cheaper than Datadog ($12/month per host).
It’s the historic direct competitor to Datadog. They’re mostly copy cat of each other.
[July 2016 update: added SignalFX]
Yet another monitoring company that raised millions. A late comer to the market.
Basically, it’s a direct copy-cat of Datadog and BMC. The UI is nice and the graphs are cute (same as the competitors). It’s lagging behind in terms of advanced features and integrations though, not sure if it can catch up with the leader.
The price point is per metric stream per month which may make it cheaper than Datadog while somewhat equivalent for simple basic monitoring.
If you have to trial only two services. The first pick is Datadog and the second pick is SignalFX. (BMC is a fair second pick as well, note that we’re biased against bigger companies with more products and less focus).
[July 2016 update: added Wavefront]
We received a link to Wavefront during our holidays right after we closed the evaluation. It’s another late comer and perfect copy-cat (we’re crossing a line here: some icons and UI are identical pixel wise).
We open the link on our laptop in battery saving mode and… Firefox freezes for a minute. Who thought that a full screen HD video of a dude surfing was a good thing to put as a main page?
Well, we will have to wait for the end of the holidays to see the website, until we have access to our work computers again (i7 8 cores, 32 GB memory, SSD).
Once we get back to work and check the website, it turns out that Wavefront doesn’t display any price publicly and gives no trial either. Can’t do a anything without talking to their sales guys first.
At this point, we’ve already done weeks of trial and we’ve got 3 strong competitors who have better products and are more accessible. For the sake of it, we’ll just pretend that Wavefront doesn’t exist.
No need to introduce NewRelic. Maybe the most advertised company of 2015, one of the highest valuation ever done for monitoring related tools, world best in class Application Monitoring Performance (APM).
We already used NewRelic APM to monitor our applications and we love it. It gives very deep performance information about the application (detailed profiler, call stack, debugging). If they have a server monitoring thing, we could expand our deployment.
NewRelic doesn’t do monitoring
It turns out that NewRelic don’t have any product to do server monitoring.
Still thought about NewRelic to monitor the database/webservers because it would be nice to have performance indications, query timings and things like that. It turns out that they don’t support PostgreSQL at all. In fact they don’t support ANY database. NewRelic APM is only available to monitor applications written in Java, Python, C# and a few others. That’s it. Nothing more.
We checked out the NewRelic plugins. There are 3 plugins for PostgreSQL, all of them written pre-2014, being abandoned GitHub project by a random dude. They can barely get 5-10 metrics and provide no profiling whatsoever. Not to mention that the comments averaging 2/5 stars are scary.
As a conclusion, NewRelic cannot do server monitoring. (They’re really awesome in the application performance market though).
#MonitoringSucks is over. We’ve got a pack of great monitoring tools all invented at once.
The world best in class is Datadog (we’ll write a dedicated article later). It’s older and more mature. It has the most features and integrations. When you have to pick a monitoring tool for the future of your tech company, that’s the horse you want to put your money on.