At Mandrill, API uptime is essential. We have to ensure all the pieces of our architecture are happy and healthy. When they're not, we need to take them away quickly and automagically so they don't negatively impact our users. We accomplish the automagic bits using some off-the-shelf tools and our own homegrown tooling we call "Pulse."
Configuration management has seen massive improvements recently. You can take your pick between any of the popular tools—Puppet, Chef, Ansible, Salt, my-custom-bash-script.sh, etc.—and all will get the job done, but one question that these tools haven't fully answered is the idea of service discovery. We needed some tooling to tie together deployment, configuration management, and service discovery. And thus, Pulse was born.
At MailChimp, our motto is "Listen Hard and Change Fast." The need for fast iteration permeates every aspect of our culture, including how we deploy code changes to our production servers. We found some unique challenges iterating quickly on an infrastructure service like Mandrill where stability and 100% uptime are absolute requirements. After two years and over 1,200 successful automated deployments with no scheduled downtime, here's what we've learned.
Author's note: This is the first of our 'from the trenches' posts, which are designed to give you a little bit of information on the how and why we do some of the things we do here at Mandrill.
When we created Mandrill we knew that a status page was going to be a pretty important thing for us to build out. It needs to be a place where we can show our current users that their trust is well placed, and where potential users can see that we do a really good job at providing a fast and reliable way to send their email.
Before I started designing our status page, I decided to talk to several people here in the company with a lot of experince using and looking through them: devs from our operations group. One of the general questions I asked them was what, for them, makes a good status page. The response was laughter with the (cleaned up for blog posting purposes) reply "Status pages are, pretty much, universally horrible. Good luck." And that was basically the same reply I got from others in the company.
Status pages are usually a sea of green check marks and if you're lucky you can find an icon you can hover over and see that the service was almost unusable (but still green). They may have a little, buried historical data or an uptime stat that, without any context, doesn't really tell you much. So the problem I was facing was how we balance out the green while also providing data with context?