Coding in the Crease

Current Section

Home

Zeroing in on Zero Downtime

06/05/2012, 7:07am CDT

By Patrick Byrne

Zero downtime is a big deal to us. How big? Read on and find out!

At TST Media, we work very hard for no downtime. We have over a million unique visitors a week at all hours of day and night, and we want each one of them to see the scores, stats, and news articles they came for.

Downtime can come from lots of places:

Deploying updates to the application
Bugs in your code
Server maintenance
Trouble in the datacenter

We put a lot of effort into reducing or eliminating downtime, largely successfully. Earlier this year, we moved datacenters in different parts of the country, and experienced only three minutes of planned downtime. More recently, we upgraded to Ruby 1.9.3 with no downtime whatsoever.

How is this possible?

Redundancy

The biggest single arrow in our quiver is simple: redundancy. We have more than one of everything: database servers, application servers, and even multiple instances of our application on each server.

Monitoring

If anything stops working in any of our servers, our team is alerted immediately, so that we can look into what’s wrong. We use Pingdom and New Relic to monitor our application, as well as alerts from Engine Yard, our hosting provider, which monitor the servers. Pager Duty alerts our on-call developer for non-critical items, and sends alerts to the entire team for critical failures.

Our redundancy will ensure that our visitors won’t notice that anything went wrong, since we have spare capacity to keep serving them until we fix whatever’s wrong. This has allowed us to handle unplanned hardware failures in the data center, network congestion, running out of disk space, and database failures, all without any trouble for our visitors.

Rolling Deploys

An application server cannot serve traffic while it is restarting after an update. Redundancy lets us deploy code in a rolling fashion to each server in turn. We deploy many times a day, releasing fixes for critical bugs immediately and other improvements as often as we please.

What is a Rolling Deploy?

Instead of deploying the application to each of our servers and restarting them all at once, we deploy to each server, one at a time, so that visitors can still be served by the other servers.

You may be asking yourself why that’s necessary. We use Phusion Passenger to manage the application instances on each server. Passenger’s default mechanism to restart an application server after a deployment is to touch the restart.txt file. For a small application, this is sufficient, as it may take a small amount of time to restart the application, and a couple users may see some slow page loads.

Ngin, however, is a very large application with a high amount of traffic. If we were to use this technique, thousands of requests could queue up waiting for the application to restart. We’ve customized our Capistrano deploy script to make sure no one has to sit and wait for the application to restart.

Nitty Gritty

Capistrano, for the uninitiated, performs server commands in parallel, which is great for most of our commands, like deploying updated code to the servers. However, for this portion of a deployment, we want to run the commands on a single server. To do this, we wrote the run_on_host method.

# allows you to execute a block of commands on a single host by manipulating the
# ENV['hosts'] variable
def run_on_host(host)
  hosts_env_var = ENV['hosts']
  ENV['HOSTS'] = "#{host}" # switches to run commands just on host
  result = yield
  ENV['HOSTS'] = hosts_env_var # put this back to what it was
  result
end

We essentially overwrite the variable used by Capistrano to know which servers to perform a given command on (the HOSTS environment variable), and then put the original list of servers back when we’re done. We put this to good use below, in our rolling_restart capistrano task.

Here is the meat of our rolling_restart Capistrano task. The instances variable is an array of the application instances which we retrieve from the Engine Yard API. We loop through these instances, and do the following:

remove it from haproxy, so that it stops receiving traffic,
wait a few seconds to make sure the requests its already received are complete,
restart our web server,
trigger Passenger to load our application by accessing 127.0.0.1 (an IP address which always points to itself),
wait a few seconds more,
add the server to haproxy, so that it begins to receive traffic.

    instances.each_with_index do |instance, index|
      puts "================= Restarting instance #{index + 1} of #{instances.size}"

      instance_hostname = run_on_host(instance) { capture("hostname").strip }
      sudo("sed -i -r 's/(.*#{instance_hostname})/#&/g' /etc/haproxy.cfg")
      sudo("/etc/init.d/haproxy reload")

      sleep 5

      run_on_host(instance) do
        sudo("/etc/init.d/nginx restart")
        sudo("curl 127.0.0.1:81 >/dev/null 2>&1")
      end
      sleep 3

      sudo("sed -i -r 's/^#*\s\sserver/  server/g' /etc/haproxy.cfg")
      sudo("/etc/init.d/haproxy reload")
      sleep 5 # wait a bit for newly introduced instance to get up and running 
      puts "================= Restarted instance #{index + 1} of #{instances.size}"
    end

Treat Data with Sensitivity

The servers and our code are certainly important, but just as important is the data (statistics, news articles, team schedules, and so on) our customers entrust with us. We take very seriously the importance of keeping this safe, with regular off-site backups and duplicating the data across multiple database servers and datacenters.

This impacts deploying updates, as well. Because of our rolling deployments, some of our servers will be running with newest code while others have the prior version of our application. When we make changes to the underlying data, we take special care to ensure that we make these changes in such a way that everything continues to work whether the newest code is deployed or not. We wrote about one of these techniques last year, in “Deploying When Removing Columns with Rails with Zero Downtime”.

This can sometimes mean breaking apart changes into the smallest possible chunks, and deploying over time. While we could do it all at once by bringing the site down for a few minutes, that’s not an experience we want to provide.

Read-Only Mode

Every so often, a change is so far-reaching to either the data or the application that we can’t get away with the tricks we use for every other deployment. We have a final trick up our sleeves: read-only mode. This is an important piece of the major deployments I mentioned above that we achieved with little or no downtime.

We reserve these for off-hours, when we receive less traffic and most of our visitors are guests who aren’t writing new content, so as to have the smallest possible impact. We first log out all of our logged-in visitors and prevent them from logging in, and lock down our services to prevent changes to the data. This allows us to perform the maintenance or deployment that we need, while still letting everyone view the content that they want.

While not being able to submit a comment when reading a news article isn’t ideal, it is a far better experience than not being able to see the news article in the first place.

What is This All For?

If we’ve done our jobs correctly, everyone who comes to one of our thousands of sites will get to see what they came for. No downtime and no interruptions.

Tag(s): Home High Availability