Here you will find ideas and code straight from the Software Development Team at SportsEngine. Our focus is on building great software products for the world of youth and amateur sports. We are fortunate to be able to combine our love of sports with our passion for writing code.
The SportsEngine application originated in 2006 as a single Ruby on Rails 1.2 application. Today the SportsEngine Platform is composed of more than 20 applications built on Rails and Node.js, forming a service oriented architecture that is poised to scale for the future.
Over the last 6 months the Platform Operations team at Sport Ngin has moved each of the platform's 24 applications to Amazon Web Services (AWS) Ops Works service. Most of these applications were previously running on Engine Yard, a Platform as a Service running on top of AWS.
Migrating applications to AWS or to the Cloud is a popular topic. Our situation was a bit different - we migrated from AWS with Engine Yard to be on AWS directly. We needed the full flexibility and power of AWS without a middle layer in between. As part of this move we were getting all new servers and new IP addresses on a different AWS account and moving from AWS EC2 Classic to an AWS Virtual Private Cloud (VPC). We were essentially changing data centers.
We recently moved Ngin, the largest and most complex application on our platform, without downtime. It was a challenging undertaking that required careful preparation. Ngin powers the platform's Site Builder, Registration, and League products. Ngin is a Ruby on Rails application with a MySQL data store. Ngin has 115,177 lines of code, 434 tables, 145 GB of data and serves around 30,000 requests per minute.
The Ngin data center move had three key concerns: Dependency Configuration, Data, and DNS.
Ngin is a large application with many dependencies. Rebuilding our Chef recipes to work with Ops Works was a big part of this data center move. Ngin’s core dependencies include HAProxy, Nginx, Passenger, Memcache, MySQL, and Delayed Job. There are a number of lesser dependencies as well. Each one of these needed to be reconfigured following our new Chef recipe structure based on Ops Works.
A key principle to follow when undertaking a major operation such as this is to change as few variables as possible! The primary variable being changed is the new data center and the new Chef recipe configuration. Choosing to do anything else at the same time is generally not a good idea. It can be tempting to upgrade core dependencies or to change core dependencies - Passenger to Puma for example. It is important to resist such temptations! A major operation like a data center move is a huge effort - don’t make it larger than it needs to be. On our development team we follow a Code Smaller approach - the same principle applies here.
When making a major change such as moving data centers there is a lot of potential to break things in a really bad way. A systematic approach that gradually increases the team’s confidence level is necessary to ensure that everything goes smoothly. We’ve done data center moves like this twice in the last 7 years. The last time we leveraged an approach that used EM-Proxy to duplex production traffic onto the new data center. We took a different approach this time. We used the following small steps to increase our confidence in the data center move:
Each step above increased our confidence that the data center move would go smoothly. To send live production traffic to the new data center required opening the correct network ports to allow application servers in the new data center to talk to the master MySQL database and the Memcache cluster in the old data center. Four days before the night of the actual data center migration we had already served 20% of live production traffic successfully with the new data center for several hours!
The most important concern in this move was ensuring the data was moved successfully. Ngin’s database setup consists of a master database, 2 read replicas, and a backup replica. Here are the basic steps that were performed to safely move the data over to the new data center:
By relying on MySQL replication across data centers, the data portion of the late night operation to complete the data center move occurred in about 30 minutes. We were also fully confident that all of our data was present and not corrupt. By hooking replication back up to the databases in the old data center we were capable of rolling back the data center move by redoing steps 2-5 above with a database in the old data center as the master.
A dump and load approach, where all the data is dumped using mysqldump into a sql file, transferred over the network, and then loaded into the new database is another option for a data center move. We have used a dump and load approach for databases with smaller amounts of data. A dump and load approach of Ngin data can take 12-24 hours, which is not an acceptable time frame for the Sport Ngin platform to run in a degraded read-only mode.
DNS was an additional complication with Ngin’s data center move, given that thousands of domains point at Ngin’s load balancer. Ngin’s load balancer would be receiving a new IP address as part of the data center move. The majority of our customers use our name servers such that we are able to update their DNS records ourselves. However about 2% of our customers point their domains directly at Ngin’s IP address.
The best way to handle this situation smoothly is to turn the old load balancer into a proxy which forwards traffic onto the new load balancer. This allows us to work with customers over the course of several weeks to update their DNS to our new IP address while ensuring that all customer websites receive no downtime. The following steps were performed to accomplish the DNS portion of the data center migration.
The late night operation to change data centers went smoothly. We did the DNS portion first and then the Database portion. For about 30 minutes in the middle of the night our Site Builder and League products ran in a degraded mode, only accessible to non-logged in users, and our Registration product was unavailable. The next day we monitored our applications closely and not a single issue or new bug appeared. Choosing to change as few variables as possible and systematically increasing our confidence in small steps led to a successful execution of a major operation.