Author's note: This is the first of our 'from the trenches' posts, which are designed to give you a little bit of information on the how and why we do some of the things we do here at Mandrill.
When we created Mandrill we knew that a status page was going to be a pretty important thing for us to build out. It needs to be a place where we can show our current users that their trust is well placed, and where potential users can see that we do a really good job at providing a fast and reliable way to send their email.
Before I started designing our status page, I decided to talk to several people here in the company with a lot of experince using and looking through them: devs from our operations group. One of the general questions I asked them was what, for them, makes a good status page. The response was laughter with the (cleaned up for blog posting purposes) reply "Status pages are, pretty much, universally horrible. Good luck." And that was basically the same reply I got from others in the company.
Status pages are usually a sea of green check marks and if you're lucky you can find an icon you can hover over and see that the service was almost unusable (but still green). They may have a little, buried historical data or an uptime stat that, without any context, doesn't really tell you much. So the problem I was facing was how we balance out the green while also providing data with context?
Start with the basics
Now, while I just railed on the sea of green check marks, they do serve a purpose. They give an 'at a glance' health check. A quick look at the page can tell you if there are any big issues happening. It doesn't give much detail or insight, just that right now this is kinda sorta how the service is running. Since this is basically like saying 'good' or 'bad', it's important that these are just the start of the information you're seeing, not all of it.
We started with the basics and created our 'at a glance' health check. We have servers all over the globe, so needed to take this into account. If you're checking in from London, the status of our servers in Dublin are probably a little more important to you than the status of Mandrill overall. Rather than make you look through a table of locations, we make it easier and give you context by showing the general status for each region on a map. We also gave the status dots one more job: click any one of them to see detailed stats for the selected region (we'll go over those in a minute).
As an added benefit, you can see just how much closer those servers are to you than if we just had them all in one location here in the United States (remember, SMTP is chatty).
If those dots ever turn yellow or red, we'll also post status messages to let you know what the issue is, what we're doing to fix it and a post-mortem once it's been fixed. These status messages are prominent, so you don't have to hunt or hover to find out what's happening. We also post updates over on Twitter (we're @mandrillapp), so you won't have to keep going to the status page if don't want to.
Add detailed and historical stats
There's one big problem with green dots and status messages that I haven't mentioned yet: there's a human behind them making human assumptions. They may not be manually flipping the switch from green to yellow, but someone had to make a decision that certain thresholds of the many stats that we monitor will cause a status change to yellow or red. The problem with that is there may not be issues that rise to the level of a status change, but still affect your application, such as slight increases in response time. The best way to get around that is to just give you the data and this is where transparency comes in. We don't filter the data, it's all straight from our monitoring service at the same time we get it. You see the data, whether it's good or bad. This can be really scary for a company, but we felt that we owe it to you to show you how we're doing.
You can see following current stats and historical data from the last several days for for Mandril overall and each of our regions:
- API uptime
- SMTP uptime
- Website uptime
- API error rate
- Queue size
- Sending speed
Make it responsive
So, there we go, general health, status messages and lots of detailed stats. Great information, but only if you can get to it wherever you are. Our users are going to acccess this from a wide range of devices and to help with that we built the status page responsively. This makes it easy to get to and see all aspects of the page, no matter what device you're using. We also made sure to use touch friendly controls and graphs for those of you with touch screen devices (I think there might be a couple).
That's the beginning of the Mandrill status page. We have some refinements and additions that we'll be working on, but if there are things you'd like to see or if you have questions about what we've done, let me know in the comments.