Monitoring and Service Discovery

At Mandrill, API uptime is essential. We have to ensure all the pieces of our architecture are happy and healthy. When they're not, we need to take them away quickly and automagically so they don't negatively impact our users. We accomplish the automagic bits using some off-the-shelf tools and our own homegrown tooling we call "Pulse."

Configuration management has seen massive improvements recently. You can take your pick between any of the popular tools—Puppet, Chef, Ansible, Salt,, etc.—and all will get the job done, but one question that these tools haven't fully answered is the idea of service discovery. We needed some tooling to tie together deployment, configuration management, and service discovery. And thus, Pulse was born.


How does one node have knowledge about others? There are countless ways to solve this problem, and it's fairly easy when your architecture is relatively constant. The problem gets trickier as things get more dynamic, and ephemeral cloud instances are all the rage. In this new world of stateless computing, things change often and generally in an automated fashion. Autoscaling (up and down) is a useful tool in managing dynamic workloads (saving money) and maintaining platform health (saving customers). Overworked cluster? Build more, now! Is node X acting funny? Kill it, build a new one, rinse, repeat.

In order to take full advantage of ephemeral computing, you need the ability to rapidly deploy and configure instances. At Mandrill, we chose Saltstack as our configuration management tool. We primarily chose Salt over the other options because of its integrated tool "salt-cloud." Salt-cloud is essentially a wrapper around all the popular cloud providers for instance management. The combination of salt-cloud and Pulse results in our own auto scaling tool that functions on just about any cloud provider available.

So Pulse can point at salt-cloud to create and destroy instances, but how does Pulse know when to create/destroy things?

Health checking and monitoring

Pulse utilizes a couple of different sources to determine a node's health. For things that require direct world (customer) access, we test a node's state with various health checks from Pingdom. Pingdom solves the problem of checking real-world availability from outside of our network. Although Pingdom is great for testing general availability, we still need other methods to ensure complete monitoring coverage. For things Pingdom cannot cover, we use a monitoring tool called Sensu.

We use Sensu to monitor everything from system level health (load, cpu usage, disk usage, etc.) to functional application checks. Sensu is flexible, so we can run any type of test we need and in any language. We write our checks in Python, Ruby, and good ol' BASH. We have checks that run on each node individually, and ones that run from a central location for testing flow through our architecture. We can utilize any measurable or testable aspect of a node to determine its health. A node or an overall group of nodes can be deemed worthy of access for Mandrill users in real-time.

Sensu allows for arbitrary information to be stored in various ways. Sensu stores data in blobs of JSON. We can store generally static information about a client in its client JSON—things such as a node's role, IP address, geographical location, etc. We can also utilize Sensu's "stashes." A stash in Sensu is a place to store more blobs of JSON data, but is more tailored towards dynamic information. Stashes can be used to store any arbitrary JSON that a user wishes (though there are some built-in uses, like "silencing" a client or check). We store some real-time health information about a node in stashes. For instance, we can set a node into maintenance mode via a stash, and Pulse will know to remove it from production routing. Sensu is very API-driven, so we wrote wrappers within Pulse that can serve the functions we need.

Service discovery management

The other component of Pulse is the meaty part—service discovery management. When the health checking and monitoring piece of Pulse decides that a node has entered a naughty state, the service discovery part does the job of "downing" or "unrouting" a node. Ask any systems folks what the most annoying thing about monitoring is, and most will mention the agony of "false positives." What's better than being woken up at 3am because some node decided to alert to something that wasn't really anything? Everything. False positives can be at best annoying, and at worst, dangerous. We have a number of failsafes built in to Pulse and our Sensu configuration to ensure that when a node is given the mark of death, that it actually deserves it.

Pulse can perform multiple functions related to service discovery. At its first level, Pulse can simply unroute nodes—this can be done via DNS or load balancer depending on the node's role. We keep enough capacity in each region to handle losing several nodes. Depending on some thresholds that we control, Pulse can decide to destroy or create new nodes. For example, if a particular availability zone in any given cloud provider goes dark, pulse will automatically spin up new instances of that role in a nearby zone to alleviate the load. Furthermore, if any nodes are marked unhealthy for a configured amount of time, they'll be terminated and new ones will be seamlessly provisioned to replace them.

CLI's are cool

Like API's, Mandrill engineers love CLI tools. The meaty bits of Pulse act as a background process, but it also provides some nice CLI functions. We can quickly gather information about a single node, or every node in our infrastructure. The CLI can mark nodes or even entire regions as maintenance mode, and the background process will do the necessary work to remove them from production use. Although we like when instances are provisioned automatically, there are times when we need to manually scale up/down a region. The Pulse CLI provides a seamless interface for that. The Pulse CLI is our one-stop-shop that ties in all the aspects of our operations.

Avoiding outside dependencies

Given the nature of running a fast and reliable API, Mandrill will always have the need for a global distribution of customer-facing nodes. As a result, we'll likely always utilize service providers to achieve that worldwide coverage. However, we don't like the idea of depending on others for the reliability of our platform. All of our efforts in service discovery, monitoring, deployment, and config management are designed with the idea of being service provider agnostic. That just means if one large cloud provider disappears, Mandrill keeps chugging along and delivering all the emails.

If you're interested in solving these problems too, we're hiring.