When Perl is not applicable

By Andrew Shitov on August 6, 2012 8:54 PM

The story repeats again and again, and there's a single conclusion I can make out of that: never do large data aggregation in SQL and Perl.

This time I had a Perl script which builds a standard business report: lots of items show their performance by week, month, quarter etc. Thus you've got a few big two-dimensional tables where one axis is time and another one contains data item names. Say, from the beginning of the year there are about 30 weeks past today, eight months, two quarters. These are three tables with weeks, months and quarters on their time axis.

Basically, all you need is sum(field), count(field) where date between date1 and date2, applied to every type of data you've got in the system. In my case it was about a few million records with 10-20 parameters in each.

Perl works good, and it took about a couple of hours to update the latests statistics every night. Things go worse when I had a need to recalculate everything. Just because of some changes in data and algorithm. Starting a script in the morning I saw it only disappeared from the process list the next day. Thus it was something about 24 hours to complete and build the whole new set of data. Hopefully, nobody could even notice the absence of data but still it is too long for not so big array of data.

I rewrote the whole script from scratch and wrote it in C++. Yes, it is a bit more difficult to manipulate strings in C++ and there's no standard DBI there, but there's big advantage: it uses only four bytes for integer values and I can keep lots of data in memory during my calculations.

This allowed me to change the algorithm and remove all the unnecessary kind-of-repeated SQL selections with changing where conditions. Now I only come through every piece of data once. It works so fast that I even don't have to update the programme to allow it recalculate only the latest data. The whole set from the beginning of the year is now done in 30 seconds. Well, maybe two minutes for additional calculations that were absent in the original Perl script.

I even decided not to use fork and did not use multiple value SQL inserts in some parts of the code. There's just no need for this kind of optimization. Not to mention that it consumes almost no memory which means I will be able to use it in the far future with much bigger data sets.

This is not the blog post of self-honour :-) This is just a reminder that adding sums and counting is not the best task neither for Perl or the database. Just leave the DB give you raw data and count everything yourself in C++.

Being a strong Perl lover, it's so important to think about alternatives when it is applicable. Another story connected this this is that I tried to go beyond Perl when working in Booking.com and I was asked to leave the team and later the company because of that. Be warned :-)

38 comments

Tagged as:

algorithm, c++, coding, database, perl, rewrite, smart, sums, think

38 Comments

bill.ruppert.myopenid.com | August 6, 2012 10:22 PM | Reply

I feel uncomfortable posting a negative comment while knowing so little about the problem but saying "never do large data aggregation in SQL and Perl" is very provocative and forces me to comment, admittedly somewhat in ignorance.

Your data set is not particularly large if it is only in the single digit millions of rows with only 10 or 20 columns per row. Your SQL engine should do most of the work very easily, especially if indexed on date. If not, extracting to a flat file, using the system sort, and summarizing that should also be very quick, minutes not hours.

Your larger point of being careful which tools you use is very valid, but that "never do" is just wrong. Sorry.

Andrew Shitov | August 6, 2012 10:31 PM | Reply

I personally find "extracting to a flat file, using the system sort, and summarizing that" not sexy at all :-)

Alexandr | August 6, 2012 11:36 PM | Reply

Don't take anyone's shit, sometimes perl is not applicable at all.
On the other hand your company won't be able to use cheap developers to maintain your code. So I guess Booking.com did the right thing :-)

Andrew Shitov replied to comment from Alexandr | August 6, 2012 11:58 PM | Reply

There's no point at all in thinking that developers must be cheap.

Mr. Muskrat | August 7, 2012 12:34 AM | Reply

You assumed that Perl was the problem without any real proof. The way that you have described them, your reports are the sort of thing that every Perl business runs so why are yours so darn slow?

It sounds to me like you started with a bad design in the Perl** code and instead of just refactoring, you went to the extra effort of rewriting it in C++ with a major design overhaul.

If you think you need to rewrite your Perl** code in C or C++ and refactor it in the process because it is too slow, start by just refactoring the Perl** code because you will probably be surprised at how much of the original problem was due to a bad design.

** and/or DB

Alexandr | August 7, 2012 12:41 AM | Reply

I meant that there are a lot of perl developers that don't know C++ and very few that do. So relying on those few is a bad idea for booking.com.

Andrew Shitov replied to comment from Mr. Muskrat | August 7, 2012 12:50 AM | Reply

Thanks for your opinion, and I would not going to reject it. On the other hand as I said I base my thoughts on my experience. Perl is not for adding up millions of values in a reasonable time.

http://petdance.com/openid/ | August 7, 2012 2:13 AM | Reply

The big question here is how many times you have to do these calculations. Twice ever? Twice a month? Twice a day? Twice an hour?

When you have to do it twice an hour, I can see coding it in C. When you have to do it twice ever, it probably makes sense in Perl. And probably the truth lies somewhere in the middle.

Steffen Mueller | August 7, 2012 6:58 AM | Reply

Andrew, cut the crap.

I write at least some C virtually every day at work, for Booking.com and I'm a damn long way from leaving. I can even prove this. Check the my CPAN modules. Half of the XS modules that were touched since 2010 are written with my employer's support.

Phred | August 7, 2012 7:30 AM | Reply

I concur, only do *massive* data aggregation in SQL and Perl ;)

http://www.perl.com/pub/2005/05/05/aggregation.html

Andrew Shitov replied to comment from Steffen Mueller | August 7, 2012 8:57 AM | Reply

Steffen, be honest to yourself. You picked up an aggregation project and wrote it in Perl with flat file dumps.

Mithaldu | August 7, 2012 9:17 AM | Reply

The easiest way to convince people is with incontrovertible proof. Can you supply the perl script originally used, or maybe the nytprofhtml output for the hot spots?

Leon Timmermans | August 7, 2012 9:24 AM | Reply

Have you considered using PDL? It makes a lot of big data handling in Perl a tremendous amount more efficient?

Andrew Shitov replied to comment from Leon Timmermans | August 7, 2012 9:42 AM | Reply

No I did not. As well as, for example, R or S :-)

Andrew Shitov replied to comment from Mithaldu | August 7, 2012 9:48 AM | Reply

Mithaldu, there's no goal to prove something to someone or to convince people doing my way. I wrote what I think about the issue. Also, I find it is rather useful to solve the problem on a higher level, not moving the code with silly refactoring or over-profiling it.

Steven Haryanto | August 7, 2012 11:19 AM | Reply

I'd reckon that a majority of people that frequent blogs.perl.org probably already know that when it comes to adding a few million numbers, Perl (as well as Python/Ruby, for that matter) will be quite a bit slower (and not to mention memory-hungry) than C. Hence projects like PDL are born.

What I'm a bit curious with is how adding a few million numbers is also quite a bit slower when done in SQL (which, presumably, is also written in C/C++).

But since the blog author declines to provide more details, I'll just say meh, and move on to the next blog post. :)

Andrew Shitov replied to comment from Steven Haryanto | August 7, 2012 11:51 AM | Reply

Steven, I can explain why I tend to avoid adding things inside the database in this task.

Look, I've got a list of numbers, one for a day. Now, you need to receive these numbers: count(*) and sum(value) during January, count(*) and sum(value) during February etc, together with count/sum for every week of Jan, Feb etc.

Thus you need to go along the data more than once if you rely on aggregation functions in SQL.

Opposite, you can select all the rows one by one and create a hash/map/dictionary in your programming language.

Sawyer X | August 7, 2012 12:32 PM | Reply

A simple "I found an instance in which I think using a very high level language isn't proper" blog post became a pool of seething anger. Why?

Did you really have to add that venomous attack on Booking? Is it really relevant here? Was it really worth it?

I was gonna comment to Alexandr as well, but seeing his Twitter feed, it seems like this is simply the way he writes, in which case I opt out thinking it's not gonna change anything.

Andrew, I think your post would resonate more with readers if you had kept it as "very high level languages" (or "dynamic languages", or even "scripting languages") and without the Booking comment. Instead it sounds like an attack on Perl and Booking (and at least on the Perl side I know you don't mean it at all).

Andrew Shitov replied to comment from Sawyer X | August 7, 2012 1:00 PM | Reply

You are telling me how you would write a blog post. But I wrote as I wrote it. I have no idea of the performance of other high level languages in these circumstances, why should I mention them?

http://www.butteredham.com/blog/ | August 7, 2012 2:49 PM | Reply

I think we all know that C/C++ are significantly faster than Perl. Most of us have probably written routines in those languages when we needed speed for certain tasks, and that is why XS modules exist, after all.

But saying that a task which takes 30 seconds in C++ was taking 24 hours in Perl is a pretty big claim -- especially when at least part of the work is being done at the database end which wouldn't change between the two. I have a hard time believing that Perl is that much slower at anything. Even if the overhead of Perl's data structures pushes you heavily into swap, I wouldn't expect it to make a program take 2880 times as long!

So to have that big a difference, it sounds like there was something wrong with your Perl script -- that in going to C++, you didn't just change languages, but changed the way it was being done. And that's the way your post reads; for one thing, you were able to eliminate a lot of SQL calls. Fair enough: sometimes doing something in a different language will inspire a better way of doing it. But that doesn't mean you couldn't have sped the Perl script up drastically by reworking it in similar ways, and it certainly doesn't support the claim in the first sentence of this post.

No, you're under no obligation to post your code for review. But why wouldn't you? Either you'll prove your detractors wrong, or they'll be able to offer improvements that will make Perl a viable choice for you when this "story repeats again and again." What do you have to lose?

Shawn H Corey | August 7, 2012 3:11 PM | Reply

"This allowed me to change the algorithm and remove all the unnecessary kind-of-repeated SQL selections with changing where conditions."

Obviously, the fault was not in Perl.

Andrew Shitov replied to comment from Shawn H Corey | August 7, 2012 4:37 PM | Reply

No, see my comment at August 7, 2012 11:51 AM. It was not possible to keep everything in memory with Perl.

tommolesworth.myopenid.com replied to comment from Andrew Shitov | August 7, 2012 5:52 PM | Reply

Look, I've got a list of numbers, one for a day. Now, you need to receive these numbers: count(*) and sum(value) during January, count(*) and sum(value) during February etc, together with count/sum for every week of Jan, Feb etc.
Thus you need to go along the data more than once if you rely on aggregation functions in SQL.

I'd agree with the concept that the right tool for the job is the one which gets the results you need (in the timeframe you need them by), just curious about this though:

Why would it need to process the data more than once? Data aggregation is something that SQL excels at. Presumably there are some missing details, but surely:

select month, week, count(value), sum(value)
from table
group by month, week

Only a single pass required, no marshalling of raw data and transmitting it outside the database - would expect this to be quicker to implement + run than either a Perl or C++ approach. Add rollup/cube operators as required - maybe the issue here is that there isn't enough documentation around for using Perl in OLAP situations like this?

Andrew Shitov replied to comment from tommolesworth.myopenid.com | August 7, 2012 7:38 PM | Reply

select month, week, count(value), sum(value) from table group by month, week

Tell me please where do I see the number of items and their sum in June?
+------+------+----+-------- | m | w | count | sum +------+------+----+-------- | 5 | 18 | 78 | 255.717 | 5 | 19 | 21 | 751.41 | 5 | 20 | 61 | 943.39 | 5 | 21 | 52 | 487.664 | 5 | 22 | 48 | 641.14 | 6 | 22 | 28 | 606.26 | 6 | 23 | 44 | 989.24 | 6 | 24 | 62 | 192.55 | 6 | 25 | 9 | 602.93 | 6 | 26 | 59 | 47.454 | 7 | 27 | 54 | 420.060 | 7 | 28 | 28 | 191.88

Andrew Shitov replied to comment from http://www.butteredham.com/blog/ | August 7, 2012 7:47 PM | Reply

Yes, it's difficult to believe but it's just different speed when you don't rely on DB's (MySQL to be precise) aggregations and only exploit the simplest operations outside of it.

As you may guess from the comments, my previous colleagues were also trying to speed the things up, but rather failed to do it with pure Perl. Dumping SQL to flat files should be regarded going beyond Perl.

Aristotle replied to comment from Andrew Shitov | August 7, 2012 8:05 PM | Reply

SELECT month, week, COUNT(value), SUM(value)
FROM table
GROUP BY month, week WITH ROLLUP

Andrew Shitov replied to comment from Aristotle | August 7, 2012 8:10 PM | Reply

Now measure execution time and compare not on a bare table but in a real business application.

Steffen Mueller | August 7, 2012 9:14 PM | Reply

"and failed" is interesting because you're starting from a false premise. The goal was to speed things up and that was achieved. The major win was an improved algorithm, including caching of intermediate results, and optimal work scheduling. The run time of the calculations now scales much better (sub linear with set size) and the process has O(1) memory overhead. Additionally, we did not have to go far from our usual tools and instrumentation (like how we manage database access) to get there and every single one of our developers can maintain the code.

We did not fail. We smoked the problem.

Oh, and both your code and your general approach (aggregation in C) could have been combined with the improved algorithm. I suggested that to you, but you simply declined to work with me. I'll leave the rest up to the readers' imagination.

Andrew Shitov replied to comment from Steffen Mueller | August 7, 2012 9:21 PM | Reply

every single one of our developers can maintain the code

This is the most dangerous and misleading thing to believe in.

First of all, the statement is wrong. I insist that not every single one will always be able to maintain the code from every other single one. It is not about the company we mentioned here, it is about any big team of developers. People don't like to explore other people's crap and it's take time to understand "foreign" code. In terms of economical development of the modern world, there's no division of labour here.

The second and more important issue is that when manages believe that "every single one can" they put all the developers on a single level. This approach leads to the situation when the company needs nobody of their employees. Also, it's just not wise for a manager to ignore people's identity and individual talents and abilities.

Aristotle replied to comment from Sawyer X | August 8, 2012 12:24 AM | Reply

A simple “I found an instance in which I think using a very high level language isn’t proper” blog post became a pool of seething anger. Why?
Did you really have to add that venomous attack on Booking? Is it really relevant here? Was it really worth it?

Nothing to do with the attack on Booking, the stage was set in the very first paragraph. “Perl is a toy, is everyone too stupid to see that?” Write that on a Perl site, watch the fireworks. It wasn’t a simple innocent post turned sour.

Aristotle replied to comment from Andrew Shitov | August 8, 2012 12:50 AM | Reply

Now measure execution time and compare not on a bare table but in a real business application.

Why should I try? Your post is vague to the point of hand-waving about what the actual scenario and problem is, so whether I can reproduce an environment that is close enough to yours to get relevant figures out of it is anyone’s guess. I’m not into games of throw dice and make shit up. I stuck to a question with a clear answer (“how do I get monthly in addition to weekly aggregates with SQL?”).

As far as can be gleaned from your post you may well be right, but just as likely you may be clueless about how to make SQL and/or Perl work to your advantage. The knee-jerk reactions against your claim are silly too – your post is missing way too much detail for any bystander to decide whether the people at Booking were the idiots or you are. Or neither. Or both.

Shrug.

Andrew Shitov replied to comment from Aristotle | August 8, 2012 8:41 AM | Reply

There are two things in your comments that demonstrate that you want to comment the text that you wish to read, not the one that is written and posted here.

the stage was set in the very first paragraph. “Perl is a toy, is everyone too stupid to see that?”

Nobody said that here except you.

Your post is vague to the point

Again, you give the answer with an example of the exact piece of code as a response to the vague question.

:m) | August 8, 2012 9:30 AM | Reply

Perl is always applicable, unless when it is not. In which case TMTOWTDI. :-)

I have read this post and the comments with some interest. And I have gotten some clues about what to do when I should have to process really big amounts of data. This has never happened to me. Although some million datasets I once analyzed for fun and as a proof of concept. The script was surprisingly fast and it made a huge difference if I used Text::CSV or a "real" database, of course. Sadly I can't give the exact numbers.
I once even experimented with pack() to save memory, but this became obsolete after I had bought a new computer with much more RAM..

Aristotle replied to comment from Andrew Shitov | August 8, 2012 5:54 PM | Reply

Nobody said that here except you.

You didn’t say it, but everybody responded to you as if you had. The tone of the responses would have been hardly different without your attack on Booking, as SawyerX thought.

Again, you give the answer with an example of the exact piece of code as a response to the vague question.

“Tell me please where do I see the number of items and their sum in June?”, with most of the SQL already included plus a sample result output table, is not a vague question.

So what is it that you thought I wish to read? I didn’t think I had any particular agenda but I may be lacking self-awareness, so I’m curious.

Sawyer X replied to comment from Andrew Shitov | August 9, 2012 8:36 AM | Reply

You are telling me how you would write a blog post. But I wrote as I wrote it.

Of course you wrote it as you wrote it. It’s your choice. That was just my opinion and suggestion, and you’re free to ignore it if that’s what you want.

I have no idea of the performance of other high level languages in these circumstances, why should I mention them?

Because if anything (and especially if you’re right about your claims here), it’s definitely not a Perl-specific thing but it’s relating to a certain type of languages. It would be like me attacking Ruby because it’s not as fast as Assembly. It wouldn’t be just Ruby, it would be any language like Ruby. If I were to go on a Ruby forum and tell them their language sucks because it’s not as fast as Assembly, they would probably get the impression I’m attacking their language instead of any type of language that provides the same features at the cost of speed.

Do you expect Python, Ruby or even PHP to be a much better competition to C++? I don’t think so, but you gave the impression that the problem is Perl, instead of languages that have lower performance benefits in favor of features and ease of use.

Basically you gave the wrong impression, and this is also why you got such comments.

Again, this is just my opinion, feel free to ignore it as well. I won’t be offended, it’s okay. :)

Alexandr replied to comment from Sawyer X | August 9, 2012 2:50 PM | Reply

If author expected perl to be good enough for certain task and used it and failed, how is it not perl's fault? Perl and perl community gave him his expectations.

Sawyer X replied to comment from Alexandr | August 9, 2012 4:00 PM | Reply

That's just silly. If a person wants to use scissors for something and fails, assuming all scissors are of about the same quality (as all dynamic languages are about the same speed), there are two options: 1. he didn't use them correctly, 2. they weren't up for the task and he should use a different tool.

The first case hasn't yet been established as wrong, but I'll ignore that for now. Others have addressed it.

The second case is what I care about. If it's not the proper tool, then it obviously has nothing to do with scissors of a specific brand. If the person failed using scissors (while all scissors are of quality X), and he blames a specific scissors company (though he'll get the same result from any other scissors produced by any other company) then he's completely misguided and misguiding others.

And I would understand the company saying "look, you either used the scissors wrong or scissors are simply not the right tool. so quit advertising that OUR scissors are fucked up, and simply say that scissors were a wrong choice, by ANY company".

What you're doing is a simple (albeit very impolite) cop-out.

Say "dynamic languages are not suitable" (and that's still of an "if" if you were using them correctly - Andrew *did* say he changed his algorithm and didn't originally use MySQL for some of the operations as he possibly could have), don't say "YOUR dynamic language is not suitable" when speaking of something that ALL dynamic languages share alike (such as performance, in this case). You're simply misleading, and you should know better.

Andrew Shitov | October 15, 2012 5:13 PM | Reply

For the record, http://stackoverflow.com/questions/12793562/text-processing-python-vs-perl-performance

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Andrew Shitov

I blog about Perl.

More info »

Andrew Shitov