To switch or not to switch - The MySQL Question

Before I let my enthusiasm overpower me and jump into this rewrite with all it's worth, I have to take a long and hard look at what technologies are being used, and what I want to replace them with.

Swapping out Class::DBI with DBIx::Class is an easy choice, and I'll get back to that in a later post, but what of the engine behind it?

Currently, the whole thing is built on MySQL, but during the lifespan of this fairly complex CRM application I have seen some of the limitations of MySQL up close.
A lot of my contacts and friends also insist that while MySQL is fine for a website or any simple CRUD operations, it will let you down when it comes to more complex queries.

Rumor has it that these problems are both in performance and dependability, and while it might have been simple user error, I have seen both.
Keep in mind, however, that I am no DBA. What indexing my databases currently have I have gotten help from friends and colleges to build, or they were already built, and I inherited them.

What I think, however, is that the optimum choice of database is largely dependent on the use case.
For our statistics needs, for example, there is a lot of COUNT usage, and anyone worth their salt knows that PostgreSQL and most (all?) other MVCC-implementing database engines, even InnoDB for MySQL, will COUNT rather slowly.

A splash of googling turned up a PGSQL COUNT(*) workaround.
I must say, that's pretty good. The total number of rows in the database, however, is useless to me.
I need stuff like...
SELECT COUNT(DISTINCT(`caseID`)) FROM events WHERE userID = 31;
...to determine how many unique cases user 32 has logged events for.

Alright, so I update the user record every time a case is assigned, to reflect the "ownership count" for the user that lost it and the user that gained it. Right?
SELECT COUNT(DISTINCT(`caseID`)) FROM events WHERE userID = 31 AND loggedTime > NOW() - INTERVAL 30 DAY;
...Okay, now what? I need to know how many unique cases user 31 has logged events for in the last 30 days. Suddenly, it becomes very complex to keep updated. So complex, in fact, that it's better for overall performance if I just leave it with the straight-up COUNT syntax with no further hooplah.

In the end, it comes down to this: Where do I need the speed?

Both me and the people who will be using the statistics on a daily basis are fine with the statistics taking a few seconds to load. That's not where the performance is needed.
Where it is needed, however, is in the so-called Morphing, where our business rules are applied. It's also needed in Case Selection, where the cases are prioritized and the one that comes up on top is handed to the first available agent.

So, yeah, COUNT is relatively slow, but I won't be using that as much as long, complicated and confusing SELECT that don't have any COUNT in them at all.
It's these 50-line SELECTs I have to worry about. The WHEREwolves.

When it comes to these beasts, I set out to try and find dirt on each database engine. The problem with doing this kind of research is that anything I might find might be very much out of date. A major problem with any DB engine could be fixed in the latest release, so I need the data as fresh as I can get it.

How about June 2009? Since that article, PostgreSQL has advanced from 8.3.7 to 9.0.4 and MySQL from 5.1.30 to 5.5.14.
Obsolete data!

After over an hour just searching for data and even trying to run some benchmarks of my own, I spoke to a coworker on the subject. He's not doing software development for the company, but he runs a website with thousands and thousands of hits per day in his spare time. It runs MySQL.
He made me realize one major upside to MySQL: People know MySQL!

I won't be around to maintain this application for ever, and poking around the company a little reveals a healthy heap of MySQL people, but only the occasional "Yeah, I've used Postgres once".

So, it turns out that this decision won't be about dependability or performance at all. It will be about maintainability down the line, and about what people out there know how to use.
With the performance so seemingly close, and dependability reportedly being so much improved for InnoDB in MySQL in later versions, it seems we're staying with it.

I just never even thought of this reason to do so.

Thanks, Tom Erik. Your perspective was very valuable to me.

15 Comments

Nice story, there is a trade-off between maintainability and performance. We'd all be writing in an Assembly language otherwise! It's a good choice to stick with the tools that are best supported.

I will say though, InnoDB COUNT() is not in itself "slow" - COUNT() without a WHERE clause is the slow one. The queries you mention have WHERE clauses, and thus shouldn't be orders-of-magnitude slower. It's also slower than MyISAM because it's the correct value. MyISAM COUNT() is just an approximate value, not guaranteed to be accurate.

Both MySQL and PostgreSQL have their advantages and disadvantages, but these days you can basically treat them as equals unless one or the other has some very specific feature that you need.

It seems you're doing a good job of weighing those pros and cons. However, I wouldn't test MySQL's performance by using MySQL. Use one of the faster and more stable variants like Percona to run your tests, and if you choose MySQL, deploy on Percona. It's a drop-in replacement that I've found to be 20-25% faster, with better backups, and less issues.

I'll second the vote for using a Percona build of MySQL. The have some nice patches (some contributed by Google) that make it perform better, scale better and easier to measure.

And like JT said, these days the MySQL vs Postgres battle is pretty much a wash unless you're looking for something very specific. They both scale well, they both perform well and they are both accurate when handling complicated queries. The main difference, like you pointed out is that more people know MySQL and in my experience that means the community is more vibrant.

In my opinion you cannot reasonably entrust any data to a database that asks you to explicit SET sql_mode = "TRADITIONAL" to get it to reject rather than mangle bad data…

There are many cases where MySQL’s schizophrenic conceptions of data have bit me. The fact that the date 0000-00-00 both IS and IS NOT NULL, f.ex. Other long-standing limitations have persisted forever – if you have too many TEXT columns and they get too full then Inno just cannot store the row (error 139).

The other big one to me is that Postgres has transactional ALTER TABLE, which means you can implement zero-downtime-redeploy fairly trivially.

MySQL just feels rickety to me. Postgres feels like a power tool. It has limitations, sure, but what is there is well engineered, unlike MySQL, which feels like a pile of slipshod hacks liable to blow up in my face at any inopportune moment.

While it may be true that you can compare MySQL and PostgreSQL if you only want them as data stores, there's no doubt which one wins if you'd like to push functionality into the database.

Btw, the count() problem is true (and hard to solve for MVCC based databases), but most uses in a web environment are for paging, and as such don't need to be 100% exact.

For PostgreSQL you could use the planner's estimate:

http://wiki.postgresql.org/wiki/Count_estimate

@Aristotle, so what you're saying is MySQL is just like Perl? :)

The "less issues" comment is referring to some of the patches that Percona applies to their releases. They have a few extra patches applied that fix some long time bugs, table corruption issues, and some performance bottlenecks. Basically, it comes down to Percona being more on the ball than MySQL/Sun/Oracle is. Eventually mainline MySQL will probably do everything Percona does, but by then Percona will have new releases out that do more than what mainline MySQL does.

In addition to raw database performance you may also choose to consider periodic roll up of reporting data into summary bucket structures such as star schema or data cubes. The idea here is that most of the counting is done in the periodic batch and then reporting itself is only doing sums over the buckets.

For the extra feature list you can go to the comparison table on their site:

http://www.percona.com/software/percona-server/feature-comparison/

A lot of the neat indexing / locking / buffering stuff for performance is listed at the bottom of that chart.

And their benchmarks can be found here:

http://www.percona.com/software/benchmarks/

And the bug patching stuff is an ongoing deal. You pretty much need to read the forums, blogs, and release documents to find out about each of those. Unfortunately there's not one good location where you can get a list of those, at least that I know of.

Mike: would that it were. MySQL actually is a lot like PHP.

Leave a comment

About Demonen

user-pic I'm just a lowly software developer in a huuuuuge company, and I try to do my part. Cog in the machine!