To switch or not to switch - The MySQL Question
Before I let my enthusiasm overpower me and jump into this rewrite with all it's worth, I have to take a long and hard look at what technologies are being used, and what I want to replace them with.
Swapping out Class::DBI with DBIx::Class is an easy choice, and I'll get back to that in a later post, but what of the engine behind it?
Currently, the whole thing is built on MySQL, but during the lifespan of this fairly complex CRM application I have seen some of the limitations of MySQL up close.
A lot of my contacts and friends also insist that while MySQL is fine for a website or any simple CRUD operations, it will let you down when it comes to more complex queries.
Rumor has it that these problems are both in performance and dependability, and while it might have been simple user error, I have seen both.
Keep in mind, however, that I am no DBA. What indexing my databases currently have I have gotten help from friends and colleges to build, or they were already built, and I inherited them.
What I think, however, is that the optimum choice of database is largely dependent on the use case.
For our statistics needs, for example, there is a lot of COUNT usage, and anyone worth their salt knows that PostgreSQL and most (all?) other MVCC-implementing database engines, even InnoDB for MySQL, will COUNT rather slowly.
A splash of googling turned up a PGSQL COUNT(*) workaround.
I must say, that's pretty good. The total number of rows in the database, however, is useless to me.
I need stuff like...
SELECT COUNT(DISTINCT(`caseID`)) FROM events WHERE userID = 31;
...to determine how many unique cases user 32 has logged events for.
Alright, so I update the user record every time a case is assigned, to reflect the "ownership count" for the user that lost it and the user that gained it. Right?
SELECT COUNT(DISTINCT(`caseID`)) FROM events WHERE userID = 31 AND loggedTime > NOW() - INTERVAL 30 DAY;
...Okay, now what? I need to know how many unique cases user 31 has logged events for in the last 30 days. Suddenly, it becomes very complex to keep updated. So complex, in fact, that it's better for overall performance if I just leave it with the straight-up COUNT syntax with no further hooplah.
In the end, it comes down to this: Where do I need the speed?
Both me and the people who will be using the statistics on a daily basis are fine with the statistics taking a few seconds to load. That's not where the performance is needed.
Where it is needed, however, is in the so-called Morphing, where our business rules are applied. It's also needed in Case Selection, where the cases are prioritized and the one that comes up on top is handed to the first available agent.
So, yeah, COUNT is relatively slow, but I won't be using that as much as long, complicated and confusing SELECT that don't have any COUNT in them at all.
It's these 50-line SELECTs I have to worry about. The WHEREwolves.
When it comes to these beasts, I set out to try and find dirt on each database engine. The problem with doing this kind of research is that anything I might find might be very much out of date. A major problem with any DB engine could be fixed in the latest release, so I need the data as fresh as I can get it.
How about June 2009? Since that article, PostgreSQL has advanced from 8.3.7 to 9.0.4 and MySQL from 5.1.30 to 5.5.14.
Obsolete data!
After over an hour just searching for data and even trying to run some benchmarks of my own, I spoke to a coworker on the subject. He's not doing software development for the company, but he runs a website with thousands and thousands of hits per day in his spare time. It runs MySQL.
He made me realize one major upside to MySQL: People know MySQL!
I won't be around to maintain this application for ever, and poking around the company a little reveals a healthy heap of MySQL people, but only the occasional "Yeah, I've used Postgres once".
So, it turns out that this decision won't be about dependability or performance at all. It will be about maintainability down the line, and about what people out there know how to use.
With the performance so seemingly close, and dependability reportedly being so much improved for InnoDB in MySQL in later versions, it seems we're staying with it.
I just never even thought of this reason to do so.
Thanks, Tom Erik. Your perspective was very valuable to me.
Nice story, there is a trade-off between maintainability and performance. We'd all be writing in an Assembly language otherwise! It's a good choice to stick with the tools that are best supported.
I will say though, InnoDB COUNT() is not in itself "slow" - COUNT() without a WHERE clause is the slow one. The queries you mention have WHERE clauses, and thus shouldn't be orders-of-magnitude slower. It's also slower than MyISAM because it's the correct value. MyISAM COUNT() is just an approximate value, not guaranteed to be accurate.
Yes, I've had the COUNT return inaccurate values before, and it's no fun at all, which is one of the reason we've stopped MyISAM usage entirely.
My own humble tests suggest that PostgreSQL is measurably slower to COUNT() than MyISAM, but also slightly slower than InnoDB on the same data, performing the same query.
Still, the difference is so small I don't know we'll notice it in production.
What we might notice in production, however, is that MySQL (supposedly) scales less elegantly than PostgreSQL. For the time being, however, the application in question is an internal-only CRM with about 70 users, so I'm unlikely to hit any performance snags in the database for the first few years.
Unless I do some "creative" indexing, that is.
I've "optimized" away a lot of performance with silly indices before, so I've grown sliiightly scared of them.... More on that later when I'm rebuilding the Model, I fear!
Both MySQL and PostgreSQL have their advantages and disadvantages, but these days you can basically treat them as equals unless one or the other has some very specific feature that you need.
It seems you're doing a good job of weighing those pros and cons. However, I wouldn't test MySQL's performance by using MySQL. Use one of the faster and more stable variants like Percona to run your tests, and if you choose MySQL, deploy on Percona. It's a drop-in replacement that I've found to be 20-25% faster, with better backups, and less issues.
I'll second the vote for using a Percona build of MySQL. The have some nice patches (some contributed by Google) that make it perform better, scale better and easier to measure.
And like JT said, these days the MySQL vs Postgres battle is pretty much a wash unless you're looking for something very specific. They both scale well, they both perform well and they are both accurate when handling complicated queries. The main difference, like you pointed out is that more people know MySQL and in my experience that means the community is more vibrant.
Thanks for the heads up on Percona, guys. I just deployed it on my private VPS and gave it a quick whirl, and it certainly is a 100% drop-in replacement. I host a small forum there, and I don't even think they noticed.
Now for a splash of performance testing!
JT Smith, when you say "less issues", what do you mean? Is there a list of bugfixes or anything? I've been unable to find anything other than "because he said so, and I trust him" sort of documentation on this, and I might need a whole lot more than that to get The Server Guys to approve me running "unknown software" (which I'm sure Percona is to them as well), on the servers...
If I just say it's a better version of MySQL, then they'll just say "use MySQL then" and revoke my root access for talking poorly about them on this blog.
OH WAIT I MEA....
..carrier lost
In my opinion you cannot reasonably entrust any data to a database that asks you to explicit
SET sql_mode = "TRADITIONAL"
to get it to reject rather than mangle bad data…There are many cases where MySQL’s schizophrenic conceptions of data have bit me. The fact that the date 0000-00-00 both
IS
andIS NOT NULL
, f.ex. Other long-standing limitations have persisted forever – if you have too manyTEXT
columns and they get too full then Inno just cannot store the row (error 139).The other big one to me is that Postgres has transactional
ALTER TABLE
, which means you can implement zero-downtime-redeploy fairly trivially.MySQL just feels rickety to me. Postgres feels like a power tool. It has limitations, sure, but what is there is well engineered, unlike MySQL, which feels like a pile of slipshod hacks liable to blow up in my face at any inopportune moment.
While it may be true that you can compare MySQL and PostgreSQL if you only want them as data stores, there's no doubt which one wins if you'd like to push functionality into the database.
Btw, the count() problem is true (and hard to solve for MVCC based databases), but most uses in a web environment are for paging, and as such don't need to be 100% exact.
For PostgreSQL you could use the planner's estimate:
http://wiki.postgresql.org/wiki/Count_estimate
@Aristotle, so what you're saying is MySQL is just like Perl? :)
Aristotle: Yeah, the whole 0000-00-00 IS (SOMETIMES) NULL thing makes me not trust date/datetime/time/timestamp for nullness, and always use <= 0000-00-01 or something silly like that. Also, it's a bad experience with TEXT in the past that has me putting silly limits on myself of "max one TEXT column per table".
Still, as long as you're aware of these, it's fine. The problem is the ones I'm not aware of and things I can't easily control, such as silly dates and stuff.
Well, bad example, since I validate date and time, but ... urr ... yeah, who knows what they might "assist" me with.
My point is: sql_mode = "TRADITIONAL" for sure.
As far as the pile of hacks goes, I used to agree, but it seems MySQL has tightened it down a lot lately. A lot of the features feel a little "tacked on", however.
I have planned maintenance outside business hours to do my ALTER TABLE stuff, but yeah, transactional table shenanigans shows they certainly buttoned down their transactions TIGHT.
kaare: Yes, but when I produce statistics that are used to report to clients I can't in good conscience give it away as accurate, and I'm not comfortable with the "it might be a little off" disclaimer. It doesn't feel very professional.
The "less issues" comment is referring to some of the patches that Percona applies to their releases. They have a few extra patches applied that fix some long time bugs, table corruption issues, and some performance bottlenecks. Basically, it comes down to Percona being more on the ball than MySQL/Sun/Oracle is. Eventually mainline MySQL will probably do everything Percona does, but by then Percona will have new releases out that do more than what mainline MySQL does.
In addition to raw database performance you may also choose to consider periodic roll up of reporting data into summary bucket structures such as star schema or data cubes. The idea here is that most of the counting is done in the periodic batch and then reporting itself is only doing sums over the buckets.
JT: What's your source on this? I can't really find anything concise on it over at the Percona site.
chris: That's already being done for the long-term reporting, but something like 90% of the reporting needs are real-time, or at the very least hour-to-hour.
"Situation RIGHT NOW, split by market, team, site and product type" sort of thing, and "Top and bottom 10 active agents" for real-time coaching and redistribution of work. Waiting 24 hours for the cube to update simply won't cut it :)
For the extra feature list you can go to the comparison table on their site:
http://www.percona.com/software/percona-server/feature-comparison/
A lot of the neat indexing / locking / buffering stuff for performance is listed at the bottom of that chart.
And their benchmarks can be found here:
http://www.percona.com/software/benchmarks/
And the bug patching stuff is an ongoing deal. You pretty much need to read the forums, blogs, and release documents to find out about each of those. Unfortunately there's not one good location where you can get a list of those, at least that I know of.
Mike: would that it were. MySQL actually is a lot like PHP.
Thanks JT, I'll compile a document for our server guys to try and convince them.