cpXXXan is moving ...
... although you probably won't notice.
Executive summary: your disks hate you
Until about 20 minutes ago, cpXXXan ran in a virtual machine on a box that I rent. That box also hosts VMs for CPANdeps, for some of my own CPAN-testing activities, and a few other things. I did it that way because it was cheap and convenient. However, over the three years that it's been running (gosh, is it really that long?!?) this has become a rather, umm, "sub-optimal" solution.
That's because the CPAN has got much larger, as has the number of CPAN-testers reports. Even worse, the rate of increase of both has been consistently increasing. This means that the amount of work to be done for the daily imports of new data, both for cpXXXan and for CPANdeps, has increased dramatically. This means that the jobs take longer, and scheduling them has become a Hard Problem.
Why would that matter? Well, both were very disk-intensive. That's partly unavoidable because, for example, cpXXXan has to index all the distributions on the CPAN just like PAUSE does (actually it's worse - cpXXXan has to index the BackPAN), but it also happened because I didn't have enough memory available to really properly configure the databases behind both sites, so for the rather large queries they were doing, MySQL had to keep going to disk - building temporary tables, reading tables into memory, and so forth. Disks are, as you know, SLOOOOW. A typical process, such as building an index of what modules work with perl 5.6.2, would take something like 30 minutes, of which 28 minutes was just waiting around while the disk thrashed. And that's a small dataset. It would take four days to build a point-in-time index of what the CPAN looked like at midnight on 2012-01-01.
So, why would scheduling matter? Well, deep down in the bowels of your operating system is some code that attempts to optimise disk accesses. It works by doing things like opportunistically reading ahead in files that you've recently accessed on the assumption that you're about to read more of the same file, and by keeping a cache in memory of what has been read recently. But it makes one really bad assumption - that it is the only disk access scheduler running. This is clearly a terrible assumption to make if you have two or more disk-intensive VMs running on the same hardware and accessing the same physical disk. The clever optimisations become an active hindrance, as the individual schedulers can have no idea of what the others are doing, and so the disk thrashes.
So I've now moved cpXXXan to its own hardware, and will be doing the same for CPANdeps soon.
It now runs a lot faster. There's enough memory that MySQL hardly ever has to touch the disk, and doesn't have to build temporary tables on disk. And when it does have to use the disk - for example when re-indexing the BackPAN - there's no other VMs actively sabotaging the scheduler. The end result is that instead of taking 30 minutes to rebuild the index for cp5.6.2an, it takes 2 minutes. During those two minutes, there was hardly any disk I/O, which means that I might as well do another rebuild in parallel and make use of the second CPU (something that I just couldn't do before, as the increased I/O contention would have in fact slowed things down). While the daily update is running, the machine looks like this ...
Cpu(s): 99.7%us, 0.2%sy, 0.0%ni, 0.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6837 mysql 20 0 1408m 1.1g 2648 S 200 58.0 6773:31 mysqld
21049 david 20 0 39172 6500 2740 S 0 0.3 0:00.05 build02packages
21053 david 20 0 39172 6500 2740 S 0 0.3 0:00.05 build02packages
Notice that it is wasting no time at all waiting around for I/O (0.0%wa) and that both CPUs are being used fully. It isn't quite twice as fast as running just one build02packages
process on its own, probably because of contention for memory access, but it's still, combined with the 28x speed-up from eliminating almost all disk I/O, over 50 times faster.
Leave a comment