At MailChimp, our motto is "Listen Hard and Change Fast." The need for fast iteration permeates every aspect of our culture, including how we deploy code changes to our production servers. We found some unique challenges iterating quickly on an infrastructure service like Mandrill where stability and 100% uptime are absolute requirements. After two years and over 1,200 successful automated deployments with no scheduled downtime, here's what we've learned.
Async the important things
The most important thing you can do to have stable releases is to architect your application for failure generally. Failures in a high-scale service come in many forms. Sometimes there's a coding error causing an exception to be thrown. Sometimes a server is down or behaving erratically. Sometimes you're just under heavy load. Being able to respond with "your request has been queued" can save you regardless of the specifics.
This one can be hard depending on the nature of your application. We're lucky have only one operation that we care a great deal about: sending an email; and email sending is naturally asynchronous. If a bad code push causes a bug or even just a performance regression in the sending path, we'll automatically save the message for later delivery and report a "queued" response back to the user. Since most bugs like this only last a few seconds to a few minutes in production, this is an easy way to smooth over transient issues without interrupting the user experience.
Release in small increments
It's pretty easy to test and verify the impact of a single, small change in isolation. This get much harder when you have a bunch of changes interacting with each other. Rather than trying to solve this with lots of manual QA or huge functional test suites, we develop each feature in separate branches which are then deployed independently.
We need some internal discipline to make sure the branches don't become too big, but small branches have other benefits. Splitting up large changes usually helps us structure things into independent modules better. If something does break with a release, it's much easier to diagnose and roll back or patch problems when the potential scope is smaller. There's a lot less bughunting when only need to look at a single file to see where the problem must be.
Automate the entire release process
If you cannot give the instructions to test, review, merge, migrate, and deploy a feature as a single command, it isn't automated enough. Humans are inconsistent and fallible. They will forget the steps, or accidentally run a command twice. Releasing code is a task that should be done frequently and consistently - ideal for a computer, awful for a human.
All releases should look the same
It's much easier to monitor, test, and automate one release process than many. Bug fixes and features, small releases and large releases should all go through the exact same steps. Separating them can be convenient in the short term, but you need universal trust in your deployment process and that means no edge cases.
For Mandrill, the unit of deployment is a feature branch. If you make a one-line fix or add a new feature with 10 new files over 50 commits, it's all just a branch. This made building the deployment system easy - it only ever deals with feature branches that each relate to a single ticket. Since it has very few rules and no exceptions, it can be simple and robust.
Two heads are better than one
Though the release process should be automated, humans should still be involved. There are many aspects of quality that cannot be easily measured by an automated test passing or failing. There are many ways to involve humans in deployment, but our preference is mandatory peer review. Automated lint checks and tests are run on every proposed change, but they act as extra context and data for the reviewer who has the final say. Computers are good at precise measurement - humans are better at judgement and nuance. Both are necessary.
These are just a few of the lessons we've learned trying to build a stable but rapidly iterating product for over 300,000 active users. If you're interested in solving these problems too, we're hiring.