Coding in the Crease

Current Section

Home

To Affinity and Beyond

04/09/2013, 5:30pm CDT

By Patrick Byrne

How to improve performance of multi-zone database access without sacrificing redundancy.

Ready for Failure

We take uptime and redundancy very seriously, which means that we have to be ready for many kinds of server or network failure, or even the failure of a whole Amazon Availability Zone of servers (as famously happened a few times last year).

We don’t want our customers to be impacted when this kind of problem occurs, so we spread our servers across three Availability Zones, with enough extra capacity to handle the failure of an entire zone. Traffic gets split among the application servers in each zone, and they each perform most of their database requests against three MySQL slave servers, using the multi_db gem.

The Cost of Redundancy

This type of redundancy impacts performance. Network traffic between zones is slightly slower than network traffic within a zone. Based on our testing, ping latencies within a zone are well under 0.5 milliseconds (ms); between zones, they vary from 0.7 to 1.3ms.

Sample Latencies Between Three AWS Zones (ms)

	b	d	e
b	0.34	1.01	0.86
d	0.71	0.49	1.29
e	0.84	1.24	0.27

Now, a change of less than a millisecond may not sound like much, but this cost is borne every time the application requests data from the database. This could happen dozens or, in the worst case, hundreds of times in a given request. When someone’s looking at their browser waiting for the page to load, every millisecond counts.

In Which We Have Our Cake, and Eat it Too

Since each zone already had a slave database, we can configure each application server to stop talking to the slaves in the other zones. Now, instead of being configured to write to master and read from one of three slaves, it will write to master and read from only the slave within its zone.

This change was simple to implement, with just a few tweaks to our Chef recipes, and very low risk.

The End Result

We rolled this out to production, and immediately saw a noticable drop in how much time we spend talking to the databases, pictured below.

Performance graph showing roughly 20% drop in average database time per request

After deploying this change, we saw an approximately 20% drop in average database time per request.

A few weeks later, we had one of the database servers undergo very heavy load. Once we were alerted to the problem, we were able to remove all traffic to it in minutes by removing the app servers speaking to it from our load balancer. If we hadn't rolled out this affinity, this would have been much more difficult, since every app server would query from this database roughly a third of the time.

We were also surprised to discover that this change decreased our AWS bill, by reducing communication between zones. We saw this portion of our bill drop by nearly half. Your mileage may vary.

Tag(s): Home High Availability