Monitoring in the Cloud: Datadog vs Server Density vs SignalFX vs StackDriver vs BMC Boundary vs Wavefront vs NewRelic


We’re a tech company and we have more than 100 AWS instances to run our services. It is critical that we have good monitoring, metrics collections, graphs and alerting.

Current Setup

We have an in-house monitoring solution built over more than 9 tools, including but not limited to:

  • statsd
  • collectd
  • graphite
  • grafana
  • nagios
  • cacti
  • riemann
  • icinga

All are open-source solutions (as in build-it and maintain-it yourself). Most are tools coming straight from the 90’s with an old UI, they are hard to use and they are hard to maintain. None of these can scale or run on more than a single node.

That’s a total of 8 independent points of failure, put under constant pressure by many hosts and metrics, unable to understand AWS hosts going up and down regularly. So far, the palm of the worst-in-class belong to riemann. Its configuration is a 1000 lines file written in Clojure with up to 12 levels of indentation.

We’ve been babysitting this setup again and again every time it breaks and it’s been a major pain in the ass. We’re reached a desperate point were we just want to throw everything away and stop the pain.

What if we don’t want to send our data to 3rd party?

Neither do we.

We thought about it and we came to the conclusion that CPU percentage and memory usage are not critical information to be kept private at all cost. They don’t give away any user data and they don’t give away critical business information.

If there is something out there that is worthy to graph it out, so be it.

Actually, it’s a fake dilemma. We’ve tried “the build and maintain it ourself” already and it’s a major failure. Let’s not burn out more time and people to go that wrong route.

What to expect from a monitoring solution

The MUST have:

  • Short interval between metrics (our current collectd is about 15s-20s)
  • Graph by min, average AND max
  • Easy deployment
  • Cute graphs (colors, zoom, legend, easily readable)
  • Responsive site
  • Monitor the basics (memory, disk, I/O, …)
  • Custom dashboards
  • Custom alerting

The SHOULD have:

  • Compare graphs (arrange in grid, superimpose, align axes…)
  • Advanced alerting (moving time windows, multiple metrics, outlier detection)
  • Integrate with middleware (PostgreSQL metrics, nginx metrics, …)
  • Easily add/remove hosts (AWS environment is constantly evolving)

Options:

  • Collectd + Graphite + Grafana + Icinga + Riemann (the on-site crowd)
  • Server Density
  • Datadog (cloud)
  • BMC truesightpulse (ex. Boundary)
  • [Google] StackDriver
  • SignalFX
  • WaveFront
  • NewRelic

Trial by trialing

collectd + graphite + grafana + icinga + Riemann (on-site)

The standard on-site solution that everyone knows. Not worth presenting since we’re trying to run away from it.

Server Density

A London company (close to us :D) who raised some money in 2010, 2011, 2015. We had received positive feedbacks about Server Density before. Let’s go for the trial.

Agent Installation

The agent was painful to install.

Each host has to be registered individually with the service. It gets unique keys and a unique configuration. It was a pain in the ass to automate the deployment. Multiple REST API calls to their services and to get piece of configuration depending on the current state of the host in their service.

Web Interface

  • Metrics interval is 1 minute at best. An ENTIRE minute
  • No filtering by min, average, max
  • No legends on graphs. No clue what the lines are showing
  • No integration with any middleware or application
  • The website fails to load way too often

The site fails to load every few pages. After a few hours surfing for the trial, we were genuinely thinking that our office internet connection was broken. Thankfully it is not our internet but the server density site which is extremely buggy.

Conclusion

Removed that s**** after 48 hours, cleaned agents, killed all the hosts where they ever was an agent.

Between the site failing randomly, the terrible UI and all the basics features missing. This is one of the worst product we are have ever come across. We cannot comprehend how it ever managed to get positive reviews or raise money 3 times.

Datadog

An American companies founded somewhere around 2008. Raised 15 M$ in 2014, then 31 M$ in 2015 and finally 97 M$ in 2016.

Long story short. It’s very good and it does everything we wan. (We’ll publish an article dedicated to Datadog later).

Once in a lifetime, you get the opportunity to look at two companies of the same age in the same market. One of them (Datadog) just happening to raise 50 times more money than the other one (Server Density). It turns out to be a definitive indicator of how good the products are relative to each other.

[Google] StackDriver

An American company founded around 2012. Raised 5 M$ in 2012, acquired by Google in 2014.

The main site http://www.stackdriver.com/ is still online. The screenshots are nice and we want to try that thing.

There is an issue though. We try to try it and we can’t because there is no way to try it. Parts of the site are inaccessible, parts redirect to google, some sections are missing.

Google bought it in may 2014, it is now may 2016. The product should be available and the site should be up (eventually all under a different name and logo) but it’s not.

It looks like the service was killed as a result of the Google acquisition. This could have been a good monitoring tool but we’ll never know. If anyone had the opportunity to try and has experience with it, please comment.

June update: There are references to Google StackDriver suddenly appearing all over the GCE documentation. A closed-beta is available on-request for premium customers.

July update: It’s now clear that StackDriver is being integrated to Google. It will become part of their cloud offering and it will be available as a standalone product. Expecting a release within 1-2 years.

BMC truesightpulse (ex. Boundary)

American company founded around 2010. Bought for 15 M$ in 2012 by BMC and became truesightpulse.

We had heard of Boundary multiple times but couldn’t find it. We already settled for Datadog  (and were satisfied) by the time we understood that Boundary was acquired and renamed by BMC.

Judging by what we can see on the website. The screenshots are good, it can get metrics from all the common databases/webservers, it integrates with AWS/GCE. The pricing is a bit cheaper than Datadog ($12/month per host).

It’s the historic direct competitor to Datadog. They’re mostly copy cat of each other.

SignalFX

[July 2016 update: added SignalFX]

Yet another monitoring company that raised millions. A late comer to the market.

Basically, it’s a direct copy-cat of Datadog and BMC. The UI is nice and the graphs are cute (same as the competitors). It’s lagging behind in terms of advanced features and integrations though, not sure if it can catch up with the leader.

The price point is per metric stream per month which may make it cheaper than Datadog while somewhat equivalent for simple basic monitoring.

If you have to trial only two services. The first pick is Datadog and the second pick is SignalFX. (BMC is a fair second pick as well, note that we’re biased against bigger companies with more products and less focus).

Wavefront

[July 2016 update: added Wavefront]

We received a link to Wavefront during our holidays right after we closed the evaluation. It’s another late comer and perfect copy-cat (we’re crossing a line here: some icons and UI are identical pixel wise).

We open the link on our laptop in battery saving mode and… Firefox freezes for a minute. Who thought that a full screen HD video of a dude surfing was a good thing to put as a main page?

Well, we will have to wait for the end of the holidays to see the website, until we have access to our work computers again (i7 8 cores, 32 GB memory, SSD).

Once we get back to work and check the website, it turns out that Wavefront doesn’t display any price publicly and gives no trial either. Can’t do a anything without talking to their sales guys first.

At this point, we’ve already done weeks of trial and we’ve got 3 strong competitors who have better products and are more accessible. For the sake of it, we’ll just pretend that Wavefront doesn’t exist.

NewRelic

No need to introduce NewRelic. Maybe the most advertised company of 2015, one of the highest valuation ever done for monitoring related tools, world best in class Application Monitoring Performance (APM).

We already used NewRelic APM to monitor our applications and we love it. It gives very deep performance information about the application (detailed profiler, call stack, debugging). If they have a server monitoring thing, we could expand our deployment.

NewRelic doesn’t do monitoring

It turns out that NewRelic don’t have any product to do server monitoring.

Still thought about NewRelic to monitor the database/webservers because it would be nice to have performance indications, query timings and things like that. It turns out that they don’t support PostgreSQL at all. In fact they don’t support ANY database. NewRelic APM is only available to monitor applications written in Java, Python, C# and a few others. That’s it. Nothing more.

We checked out the NewRelic plugins. There are 3 plugins for PostgreSQL, all of them written pre-2014, being abandoned GitHub project by a random dude. They can barely get 5-10 metrics and provide no profiling whatsoever. Not to mention that the comments averaging 2/5 stars are scary.

As a conclusion, NewRelic cannot do server monitoring. (They’re really awesome in the application performance market though).

Conclusion

#MonitoringSucks is over. We’ve got a pack of great monitoring tools all invented at once.

The world best in class is Datadog (we’ll write a dedicated article later). It’s older and more mature. It has the most features and integrations. When you have to pick a monitoring tool for the future of your tech company, that’s the horse you want to put your money on.

The challengers are SignalFX and BMC truesightpulse.

 

Advertisements

10 thoughts on “Monitoring in the Cloud: Datadog vs Server Density vs SignalFX vs StackDriver vs BMC Boundary vs Wavefront vs NewRelic

    • For: AppDynamics, New Relic, Dynatrace

      They are not server monitoring software. They are APM (Application Performance Monitoring) designed to analyse the performances of a SINGLE application. They’ll give you details about time spent in each function call, database requests metrics, syscall traces, garbage collection stats and more.

      In theory, they could be adapted for general server monitoring. In practice, they are NOT intended for that usage and are highly inappropriate.

      They simply don’t have the required features and the price model is 10-100 times more per node.

      (e.g. A $1600 license to monitor a single application on a single node is not unheard of with AppDynamics)

      Like

  1. It’s a shame you had such a bad experience with Server Density.

    We do have multiple options for automating the agent deployment ranging from our own auto deploy script (which is the default way to install the agent) through to Chef, Puppet, Ansible and Salt Stack installers. These are all linked from the install guide in-app and we email the details out as part of the welcome email. Perhaps you skipped our installer and went directly to our RPM/Deb packages, which are what the automated options install for you? I’ll look at how we can make this more obvious.

    Server Density is a small team so we don’t compare to the likes of Datadog when it comes to features. Indeed, our use case is typically for less sophisticated environments and it’s clear from your list of “must/should have” that you’re in that segment. We have thousands of paying users and work on prioritising requests from their feedback which is why we’ve not necessarily tried to compete with the arms race that would result from copying other, more well funded competitors.

    If you’re at a tech startup then you’re outside our target market. Datadog is a good choice there. In fact, our agent was forked by Datadog some time ago and we forked it back last year so the functionality at the agent level is very similar.

    It sounds like you had a lot of connectivity issues which is annoying. We pride ourselves on the level of support we give customers so it’s a shame you didn’t get in touch so we could figure out what was going on. Or if you did, I’d be interested to know your ticket IDs/email so I can see why it wasn’t fixed. We work on the principle that monitoring software should be more reliable than the infrastructure it’s monitoring so take these kinds of issues very seriously.

    As a small company it’s disappointing to read such a negative review because we’ve put a lot of time and effort into the product. But it sounds like Datadog is the right product for you anyway. You can contact me directly if you wanted to follow up: david@serverdensity.com

    PS. New Relic has had a (free) server monitoring product for a number of years now: https://newrelic.com/server-monitoring

    David Mytton
    Co-Founder & CEO, Server Density

    Like

  2. Any reason you missed out Dataloop.IO? We’re also in London, just raised money and run the DevOps Exchange meetup in London so wondering how we missed the list!

    Refreshing to see DevOps teams appreciating the pain of building and managing another in house solution, that’s why we started Dataloop.IO in late 2013 as we found the SaaS solutions like Datadog were good for out the box simplicity, but lacked the flexibility and ecosystem of open source which is why many of our customers chose us over Datadog.

    Anyway would love to show you what we’ve got as I’m sure we would be at the top alongside Datadog! Just email me back.

    Like

  3. New Relic does have server monitoring, we use it. It’s quite good, but you can’t alert when a disk has 1GB left for example, you have to alert on percentages and that can be a PITA as 1% of a 1TB system is a lot different than 1% of a 10GB system, one is in an emergency situation and the other is not.

    Like

    • Yes. Sysdig gives conferences and flyers to everyone in London. There are 3 fliers from them on my desk right now 😀

      It’s ONLY for docker and it’s $25/host/month. I am not sure whether it’s a performance tool (syscall view, flame graphs) or a monitoring solution (metrics, graphing, alerting).

      They install kernel modules on top of the system to intercept syscalls and actively monitor docker containers. It’s very intrusive, sort of rootkit like. IMO, docker is extremely unstable and I won’t take any risk by adding 3rd party kernel modules on top on it.

      I suppose it can be worth a try for docker shops though.

      Like

  4. New Relic has had a (free) server monitoring product for a number of years now. I would appreciate if you can test and include that.

    Like

  5. We use Datadog for 3 years now for everything (except APM). When we started doing the infrastructure, we created those nice dashboards, than the R&D teams picked it up and started doing custom metrics, based on application internals. Now we are full of custom metric dashboards. It’s so cool to see the new micro service start from scratch with a dashboard showing how the service performs. Not because the tool provides it out of the box, but because the tool is soo easy to use, that it’s a no-brainer to add a call here and there.

    For APM there is only one king and we’ll pay whatever they want to get it – New Relic. I evaluated NR’s server monitoring, in hopes to drop one of the monitoring tools. Unfortunately it’s not worth the time.

    Datadog is going to start doing APM soon, so will see…

    Like

Post a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s