Sharding Your Database

Yesterday I was in a class taught by one of the folks at Percona, the MySQL and LAMP stack performance experts (they're the ones behind mysqlperformanceblog.com). Sharding was covered and though I learned a lot, it also reinforced my opinion that this is a dubious technique which fanbois like to reach for waaaaay too soon.

In the context of this post, I'll mean "sharding" to refer to horizontal partitioning of rows across separate servers. Horizontal partitioning isn't bad in and of itself. Further, sharding isn't bad in an of itself. Sharding is useful in those rare cases where you've considered everything else first. For other cases, it's begging for pain. Serious pain. Here are some ways of failing at sharding.

The basic idea is that you find some way of dividing up your dataset to serve across a number of servers. Let's say that you have a large "person" table containing information about the 300 million or so people in the USA. 300 million rows can be easy to work with, but that really depends on your application. Since your system is bogged down, you've decided that you're going to spread these users across multiple servers on multiple machines. How do you do that?

One strategy is to take the first letter of the last name and divide users across 26 servers. Well, a quick check of the most common surnames in the USA shows that we're going to have millions of people with a surname starting with 'M' and only a few starting with the letter 'Z'. Thus, your shards are terribly unbalanced. Not what you want (and this assumes 26 ASCII letters).

Another strategy is something like this:

my $server_num = $user_id % $number_of_servers;

Assuming your user ids are evenly distributed, you have balanced shards. Sounds great! So what happens when you need to add a server? The algorithm fails horribly. You'll have to recalculate the server number for every user and figure out a plan to move all of the affected users when you add a server. Oops.

So now you get smart and come up with this:

my $server_num = lookup_server($user_id);

That starts to work better. Because you can tweak the lookup_server() code at will, you can add and remove servers and better balance exactly what's going to be moved. That also helps when, say, there's a Presidential election and whatever server "Obama" is on is getting overloaded and you need to move that to another server with more capacity.

So you're sitting there, all smug, with your 42 sharded servers, secure in the knowledge that you've found the answer to your performance problems. And then you get hit with a series of backwards-incompatible database changes. Altering tables can be slow and frustrating, so you can't do all of your servers at once (real fun watching 42 database servers collapse at the same time due to a bug). What often happens in this case is that your application code needs to be designed to work with both the before and after changes because you don't know which shards are going to be updated and which are not. Only the most naïve are going to think of this as "a simple matter of programming". Large systems are hard to work on and if you have to shard, you probably have a large system by definition.

The gentleman talking about this explained some of the "fun" consultancies he's been on where he's seen problems as outlined above. In fact, one company sharded their database across 50 servers without even doing any query optimization. Then they had to update all 50 servers. It sounded like a nightmare.

Before you recommend sharding, make sure that you absolutely understand the system you're talking about. I had a boss once explain that it was impossible to speed up a query because of the "massive" number of rows in a table. There were only 3 million. I got the query to run in about a second, merely by adding appropriate indexes, selecting only the data I needed and moving some columns.

Sharding is a last resort and should probably be restricted to non-volatile schemas. If your application is undergoing rapid development and that database schema is changing rapidly, sharding is probably synonymous with shredding. If you must shard, make sure that you can figure out an equitable distribution of data and use a lookup service for the shard id. Then, if you have hot data, make it very easy to move that data across to another shard with less traffic.

8 Comments

Thanks for an easy-to-read explanation of sharding.

I just went through this where I was a victim of my own best intentions. I'm developing a fairly massive system, and designed it to be sharded from the get go, but that started creating all sorts of development problems, and ultimately performance problems. That's right, because of the way the data is accessed in the database, sharding actually slowed down the system rather than speeding it up.

For me, due to the nature of the data set, it turned out that the solution was to denormalize my data (store multiple copies of the data in multiple tables). Normally this is a huge no-no in relational database design, but it happened to be the perfect solution for this particular problem.

The moral of the story is, listen to Ovid. Don't throw out any ideas, until you've explored them all to see which is right for your situation. In other words, use the best tool for the job.

@Ovid: I agree completely. Do what you can to do it right, but ultimately what's "right" is whatever it takes to make your app work the way your users expect.

There's nothing new in the world.

Many years ago I used the LSDM structured design methodolgy (a cousin of the UK government's SSADM methodology).

The LSDM process for database design, as I was taught it, was: 1) produce a fully normalised database design; 2) recognise that this design would not provide optimum performance; 3) denormalise the database design until the top 10% processes of your application look like they will perform acceptably; 4) go with the denormalised design.

Hi Ovid, thanks for the commentary ;) I think you paraphrased my points very well. If I can just add a couple of comments:

  • The "Obama" problem you mentioned comes from the issue that while most people correctly identify that not all users are created equally, similarly not all users remain the same. The official Obama photographer has a Flickr account - imagine what pain they had leading up to the presidential election.

  • Another interesting component not everyone realizes is that (most) MySQL clients are blocking. A simple query unsharded could be: SELECT * FROM users WHERE id IN (123,4345,3254,94586,43586) as one round trip. In a sharded environment that could now (worst-case analysis) be 5 round trips.

Cheers, Morgan

Thanks for introducing me to this technique (that I will now never use :-)).

Added http://foldoc.org/sharding.

About Ovid

user-pic Freelance Perl/Testing/Agile consultant and trainer. See http://www.allaroundtheworld.fr/ for our services. If you have a problem with Perl, we will solve it for you. And don't forget to buy my book! http://www.amazon.com/Beginning-Perl-Curtis-Poe/dp/1118013840/