skip navigation

Here you will find ideas and code straight from the Software Development Team at Sport Ngin. Our focus is on building great software products for the world of youth and amateur sports. We are fortunate to be able to combine our love of sports with our passion for writing code.

The Sport Ngin application originated in 2006 as a single Ruby on Rails 1.2 application. Today the Sport Ngin Platform is composed of more than 20 applications built on Rails and Node.js, forming a service oriented architecture that is poised to scale for the future.

About Us
Home

Changing Data Centers - One Step at a Time

02/27/2015, 11:00am CST
By Luke Ludwig

How we moved from EC2 Classic to EC2 VPC.

Over the last 6 months the Platform Operations team at Sport Ngin has moved each of the platform's 24 applications to Amazon Web Services (AWS) Ops Works service. Most of these applications were previously running on Engine Yard, a Platform as a Service running on top of AWS.

Migrating applications to AWS or to the Cloud is a popular topic. Our situation was a bit different - we migrated from AWS with Engine Yard to be on AWS directly. We needed the full flexibility and power of AWS without a middle layer in between. As part of this move we were getting all new servers and new IP addresses on a different AWS account and moving from AWS EC2 Classic to an AWS Virtual Private Cloud (VPC). We were essentially changing data centers.

We recently moved Ngin, the largest and most complex application on our platform, without downtime. It was a challenging undertaking that required careful preparation. Ngin powers the platform's Site Builder, Registration, and League products. Ngin is a Ruby on Rails application with a MySQL data store. Ngin has 115,177 lines of code, 434 tables, 145 GB of data and serves around 30,000 requests per minute.

The Ngin data center move had three key concerns: Dependency Configuration, Data, and DNS.

Dependency Configuration

Ngin is a large application with many dependencies. Rebuilding our Chef recipes to work with Ops Works was a big part of this data center move. Ngin’s core dependencies include HAProxy, Nginx, Passenger, Memcache, MySQL, and Delayed Job. There are a number of lesser dependencies as well. Each one of these needed to be reconfigured following our new Chef recipe structure based on Ops Works.

A key principle to follow when undertaking a major operation such as this is to change as few variables as possible! The primary variable being changed is the new data center and the new Chef recipe configuration. Choosing to do anything else at the same time is generally not a good idea. It can be tempting to upgrade core dependencies or to change core dependencies - Passenger to Puma for example. It is important to resist such temptations! A major operation like a data center move is a huge effort - don’t make it larger than it needs to be. On our development team we follow a Code Smaller approach - the same principle applies here.

Gradually Turn it On

When making a major change such as moving data centers there is a lot of potential to break things in a really bad way. A systematic approach that gradually increases the team’s confidence level is necessary to ensure that everything goes smoothly. We’ve done data center moves like this twice in the last 7 years. The last time we leveraged an approach that used EM-Proxy to duplex production traffic onto the new data center. We took a different approach this time. We used the following small steps to increase our confidence in the data center move:

  1. Manual QA of key user activities on the Staging environment in the new data center
  2. Comprehensive audit comparing configuration files across data centers of key components (Nginx, MySQL, etc.)
  3. Manual QA on Production of a single test domain, whose DNS was modified to use the new data center
  4. Send 1% of production traffic to the new data center - Monitor for errors
  5. Send 5% of production traffic to the new data center - Monitor for errors
  6. Send 20% of production traffic to the new data center - Monitor for errors

Each step above increased our confidence that the data center move would go smoothly. To send live production traffic to the new data center required opening the correct network ports to allow application servers in the new data center to talk to the master MySQL database and the Memcache cluster in the old data center. Four days before the night of the actual data center migration we had already served 20% of live production traffic successfully with the new data center for several hours!

Moving Data across Data Centers

The most important concern in this move was ensuring the data was moved successfully. Ngin’s database setup consists of a master database, 2 read replicas, and a backup replica. Here are the basic steps that were performed to safely move the data over to the new data center:

  1. Set up MySQL replication from the database master in the old data center to each of the databases in the new data center. This was done a week in advance of the late night data center move.
  2. Put the Ngin application into a read-only mode. This logged users out, disabled our Registration product, and prevented users from logging into the platform. Public web pages continued to be served to non-logged in users.
  3. Put the database master into read-only mode. This ensures that no data-loss occurs during the database failover step.
  4. Do a MySQL failover by promoting one of the databases in the new data center to be the master and configuring all the other databases in both data centers to replicate from the new master.
  5. Take the Ngin application out of read-only mode.

By relying on MySQL replication across data centers, the data portion of the late night operation to complete the data center move occurred in about 30 minutes. We were also fully confident that all of our data was present and not corrupt. By hooking replication back up to the databases in the old data center we were capable of rolling back the data center move by redoing steps 2-5 above with a database in the old data center as the master.

A dump and load approach, where all the data is dumped using mysqldump into a sql file, transferred over the network, and then loaded into the new database is another option for a data center move. We have used a dump and load approach for databases with smaller amounts of data. A dump and load approach of Ngin data can take 12-24 hours, which is not an acceptable time frame for the Sport Ngin platform to run in a degraded read-only mode.

Dealing with DNS

DNS was an additional complication with Ngin’s data center move, given that thousands of domains point at Ngin’s load balancer. Ngin’s load balancer would be receiving a new IP address as part of the data center move. The majority of our customers use our name servers such that we are able to update their DNS records ourselves. However about 2% of our customers point their domains directly at Ngin’s IP address.

The best way to handle this situation smoothly is to turn the old load balancer into a proxy which forwards traffic onto the new load balancer. This allows us to work with customers over the course of several weeks to update their DNS to our new IP address while ensuring that all customer websites receive no downtime. The following steps were performed to accomplish the DNS portion of the data center migration.

  1. A day beforehand, we lowered the TTL of relevant DNS records to 60 seconds. This made the DNS change happen quickly.
  2. Changed the old load balancer to proxy all traffic to the new load balancer.
  3. Modify DNS records to point to the new load balancer’s IP address.
  4. A day after the data center move, raise the TTL of the DNS records back to their original setting (30 minutes).
  5. Work with customers over the next few weeks to change their DNS to point to the new IP address.
  6. Monitor the old load balancer and turn it off once it is receiving no traffic.

Results

The late night operation to change data centers went smoothly. We did the DNS portion first and then the Database portion. For about 30 minutes in the middle of the night our Site Builder and League products ran in a degraded mode, only accessible to non-logged in users, and our Registration product was unavailable. The next day we monitored our applications closely and not a single issue or new bug appeared. Choosing to change as few variables as possible and systematically increasing our confidence in small steps led to a successful execution of a major operation.

Tag(s): Home  High Availability  AWS