SQLite Database Fixes

Recently Andreas alerted me to a problem with the SQLite database used to store the basic metadata for the CPAN Testers statistical database, aka cpanstats. On reading the database, for some queries an error message is now being returned; "database disk image is malformed". It's unclear where the error has occurred, but it seems to have been something that has only surfaced recently.

As a consequence I am now rebuilding the complete SQLite database. This means that the downloads available from the Development website will remain static until this is complete. Once complete and all is fine, then the backup mechanisms will be re-enabled.

However, there is the possibility that the database has grown so large now (with over 17 million records), that the data storage, and particularly the indexing, is not being written to disk correctly. With the database currently being around 5GB uncompressed, and just under 1GB compressed, it would be beneficial to reduce its size for efficiency, disk IO and bandwidth. So I have started to think of the alternative options.

Firstly the easiest option would be to create a database that only includes reports for distribution releases on CPAN. This would reduce the size slightly, but as we are now submitting reports at a far greater rate than for those releases that reside only on BACKPAN, the short-term gain would be quickly lost within a month or two.

Secondly we could only store data for a set period, such as per year. This would allow those that only need updates to just retrieve the latest instance (current + previous year), while those that need to start from scratch would need to download all the annual snapshots first. This would mean that all the data is easily accessible, but it does require anyone who wishes to maintain their own database to have a mechanism to update their local copy, rather than just download and have the database instantly ready.

A third option is to provide an API that can provide a list of the updates in CSV format, with the ability to specify a from date and to date, with the full SQLite database being created on a monthly basis. This would reduce disk IO and bandwidth, but would still require a local copy to be maintained with reqular updates.

A fourth option would be to switch to a MongoDB (or similar) database and enable replication. This would reduce bandwidth usage and reduce disk IO, however it would mean that anyone wanting to maintain a local copy would have to put in a lot more work to set-up and configure their local copy, and would require them to change any tools they currently have to use the new style of database storage. In the longer term, this is possibly an idea for a Google Summer of Code project.

Of all of these the journalling style of the third option would probably be the most practical in the short- to medium-term, while I think moving to a database that supports efficient replication would be better long-term. All would require work and/or tools for anyone maintaining their own copy, but if we can prepare all the necessary code before switching over, the change shouldn't be too painful.

These are all only ideas at the moment, and no plans have been made to change anything. If you have other suggestions that might be viable, I'd be please to hear from you. I'd also like to know if you currently download the SQLite DB, whether any of these ideas would work for you.

As mentioned, no changes are planned and beyond the current rebuild, the current DB is not going away any time soon. However, it does make sense for us to look at more efficient ways of exposing the data. Scaling for the future should not really be looking to continually copy the same 5-10GB of data daily around our eco-system.

Cross-posted from the CPAN Testers Blog

2 Comments

If you decide to go that route, take the “or similar” option: Don’t use MongoDB.

Leave a comment

About CPAN Testers

user-pic This is the new account for incidental and summary updates to what's happening with the CPAN Testers. For all the latest news and views please see our blog.