Here you will find ideas and code straight from the Software Development Team at SportsEngine. Our focus is on building great software products for the world of youth and amateur sports. We are fortunate to be able to combine our love of sports with our passion for writing code.
The SportsEngine application originated in 2006 as a single Ruby on Rails 1.2 application. Today the SportsEngine Platform is composed of more than 20 applications built on Rails and Node.js, forming a service oriented architecture that is poised to scale for the future.
We take uptime and redundancy very seriously, which means that we have to be ready for many kinds of server or network failure, or even the failure of a whole Amazon Availability Zone of servers (as famously happened a few times last year).
We don’t want our customers to be impacted when this kind of problem occurs, so we spread our servers across three Availability Zones, with enough extra capacity to handle the failure of an entire zone. Traffic gets split among the application servers in each zone, and they each perform most of their database requests against three MySQL slave servers, using the multi_db gem.
This type of redundancy impacts performance. Network traffic between zones is slightly slower than network traffic within a zone. Based on our testing, ping latencies within a zone are well under 0.5 milliseconds (ms); between zones, they vary from 0.7 to 1.3ms.
b | d | e | |
---|---|---|---|
b | 0.34 | 1.01 | 0.86 |
d | 0.71 | 0.49 | 1.29 |
e | 0.84 | 1.24 | 0.27 |
Now, a change of less than a millisecond may not sound like much, but this cost is borne every time the application requests data from the database. This could happen dozens or, in the worst case, hundreds of times in a given request. When someone’s looking at their browser waiting for the page to load, every millisecond counts.
Since each zone already had a slave database, we can configure each application server to stop talking to the slaves in the other zones. Now, instead of being configured to write to master and read from one of three slaves, it will write to master and read from only the slave within its zone.
This change was simple to implement, with just a few tweaks to our Chef recipes, and very low risk.
We rolled this out to production, and immediately saw a noticable drop in how much time we spend talking to the databases, pictured below.
After deploying this change, we saw an approximately 20% drop in average database time per request.
A few weeks later, we had one of the database servers undergo very heavy load. Once we were alerted to the problem, we were able to remove all traffic to it in minutes by removing the app servers speaking to it from our load balancer. If we hadn't rolled out this affinity, this would have been much more difficult, since every app server would query from this database roughly a third of the time.
We were also surprised to discover that this change decreased our AWS bill, by reducing communication between zones. We saw this portion of our bill drop by nearly half. Your mileage may vary.
Tag(s): Home High Availability