Sharding Your Database
Yesterday I was in a class taught by one of the folks at Percona, the MySQL and LAMP stack performance experts (they're the ones behind mysqlperformanceblog.com). Sharding was covered and though I learned a lot, it also reinforced my opinion that this is a dubious technique which fanbois like to reach for waaaaay too soon.
In the context of this post, I'll mean "sharding" to refer to horizontal partitioning of rows across separate servers. Horizontal partitioning isn't bad in and of itself. Further, sharding isn't bad in an of itself. Sharding is useful in those rare cases where you've considered everything else first. For other cases, it's begging for pain. Serious pain. Here are some ways of failing at sharding.
The basic idea is that you find some way of dividing up your dataset to serve across a number of servers. Let's say that you have a large "person" table containing information about the 300 million or so people in the USA. 300 million rows can be easy to work with, but that really depends on your application. Since your system is bogged down, you've decided that you're going to spread these users across multiple servers on multiple machines. How do you do that?
One strategy is to take the first letter of the last name and divide users across 26 servers. Well, a quick check of the most common surnames in the USA shows that we're going to have millions of people with a surname starting with 'M' and only a few starting with the letter 'Z'. Thus, your shards are terribly unbalanced. Not what you want (and this assumes 26 ASCII letters).
Another strategy is something like this:
my $server_num = $user_id % $number_of_servers;
Assuming your user ids are evenly distributed, you have balanced shards. Sounds great! So what happens when you need to add a server? The algorithm fails horribly. You'll have to recalculate the server number for every user and figure out a plan to move all of the affected users when you add a server. Oops.
So now you get smart and come up with this:
my $server_num = lookup_server($user_id);
That starts to work better. Because you can tweak the lookup_server() code at will, you can add and remove servers and better balance exactly what's going to be moved. That also helps when, say, there's a Presidential election and whatever server "Obama" is on is getting overloaded and you need to move that to another server with more capacity.
So you're sitting there, all smug, with your 42 sharded servers, secure in the knowledge that you've found the answer to your performance problems. And then you get hit with a series of backwards-incompatible database changes. Altering tables can be slow and frustrating, so you can't do all of your servers at once (real fun watching 42 database servers collapse at the same time due to a bug). What often happens in this case is that your application code needs to be designed to work with both the before and after changes because you don't know which shards are going to be updated and which are not. Only the most naïve are going to think of this as "a simple matter of programming". Large systems are hard to work on and if you have to shard, you probably have a large system by definition.
The gentleman talking about this explained some of the "fun" consultancies he's been on where he's seen problems as outlined above. In fact, one company sharded their database across 50 servers without even doing any query optimization. Then they had to update all 50 servers. It sounded like a nightmare.
Before you recommend sharding, make sure that you absolutely understand the system you're talking about. I had a boss once explain that it was impossible to speed up a query because of the "massive" number of rows in a table. There were only 3 million. I got the query to run in about a second, merely by adding appropriate indexes, selecting only the data I needed and moving some columns.
Sharding is a last resort and should probably be restricted to non-volatile schemas. If your application is undergoing rapid development and that database schema is changing rapidly, sharding is probably synonymous with shredding. If you must shard, make sure that you can figure out an equitable distribution of data and use a lookup service for the shard id. Then, if you have hot data, make it very easy to move that data across to another shard with less traffic.