When Perl is not applicable
The story repeats again and again, and there's a single conclusion I can make out of that: never do large data aggregation in SQL and Perl.
This time I had a Perl script which builds a standard business report: lots of items show their performance by week, month, quarter etc. Thus you've got a few big two-dimensional tables where one axis is time and another one contains data item names. Say, from the beginning of the year there are about 30 weeks past today, eight months, two quarters. These are three tables with weeks, months and quarters on their time axis.
Basically, all you need is sum(field), count(field) where date between date1 and date2, applied to every type of data you've got in the system. In my case it was about a few million records with 10-20 parameters in each.
Perl works good, and it took about a couple of hours to update the latests statistics every night. Things go worse when I had a need to recalculate everything. Just because of some changes in data and algorithm. Starting a script in the morning I saw it only disappeared from the process list the next day. Thus it was something about 24 hours to complete and build the whole new set of data. Hopefully, nobody could even notice the absence of data but still it is too long for not so big array of data.
I rewrote the whole script from scratch and wrote it in C++. Yes, it is a bit more difficult to manipulate strings in C++ and there's no standard DBI there, but there's big advantage: it uses only four bytes for integer values and I can keep lots of data in memory during my calculations.
This allowed me to change the algorithm and remove all the unnecessary kind-of-repeated SQL selections with changing where conditions. Now I only come through every piece of data once. It works so fast that I even don't have to update the programme to allow it recalculate only the latest data. The whole set from the beginning of the year is now done in 30 seconds. Well, maybe two minutes for additional calculations that were absent in the original Perl script.
I even decided not to use fork and did not use multiple value SQL inserts in some parts of the code. There's just no need for this kind of optimization. Not to mention that it consumes almost no memory which means I will be able to use it in the far future with much bigger data sets.
This is not the blog post of self-honour :-) This is just a reminder that adding sums and counting is not the best task neither for Perl or the database. Just leave the DB give you raw data and count everything yourself in C++.
Being a strong Perl lover, it's so important to think about alternatives when it is applicable. Another story connected this this is that I tried to go beyond Perl when working in Booking.com and I was asked to leave the team and later the company because of that. Be warned :-)