Perl memory management...
Does Perl run out of memory?
Today I got an email from someone saying “I was told by a person who used Perl for computational genomics applications that it was running out of memory, so he switched to C++. What’s your thoughts on running out of memory in Perl?”
Just for posterity here is my reply (please note I’m no expert on this sort of thing and have never had the problem) was…
Perl has a great garbage collector. But of course if you read in a 1 GB file into memory, then you are using 1 GB, whatever language you use.
So the trick is to read in line by line and process the information that is required. This isn’t always possible, but there could be a few other issues which your college didn’t understand. For example you should not pass large data structures around, you should pass references to them, otherwise they get copied.
my @a_big_list = qw(lots of stuff);
bad(@a_big_list);
good(\@a_big_list);
sub bad {
my @copy_of_list = @_;
foreach my $thing (@copy_of_list) { ... }
}
sub good {
my $list_reference = shift;
foreach my $thing (@$list_reference) { ... }
}
The other one is if you use global variables all the time, instead of locally scoped variables (which are garbage collected when they go out of scope) you will have lots of extra memory used.
Check out http://www.onyxneon.com/books/modern_perl/ (free) — see “Array References” section for more info.
Programs will run out of memory if the coder doesn’t fully understand what they are doing, no matter what language.
You may also be interested in checking out http://www.bioperl.org/ and asking questions on their IRC channel. Perl is used for mass data processing by these guys so they might have further insights.
I also worked at a geometry problem, with a typical number of 800.000 points and doing some reasoning about them. As it was impossible to do with perl arrays (SV's have way too much waste) I created a tied representation in C, Tie::CArray, of simple numeric arrays.
But eventually perl was not good enough and I switched to lisp where I was able to stay abstract enough, perform well and use reasonable data structures.
But right now I started working on types, which means we will be able to use native arrays soon if declared as such.
Abusing memory mapped files would be another idea. Not File::Map, but Tie::MmapArray which can declare arbitrary structures as array members.
Great garbage collector?
perl uses refcounts, which makes it hard to use cyclic datastructures, such as normal graphs. You needs refs instead.
Computional geometry e.g. needs sometimes general graphs. I created plausible 3d geometry from stereoscopic images.
A garbage collector would help, but it would be slower and more complicated. We would need better iteration functions to be able to deal with cyclic data, as LISP has them.
One way to deal with large arrays in Perl is to use PDL ("Perl Data Language").
http://pdl.perl.org/
Except for the concern about cyclical data structures (a genuine concern), a reference counting GC is as efficient as you can make a GC in terms of memory parsimony (without building custom hardware).
SV size is a concern in many cases though.
> Abusing memory mapped files would be another idea.
I don't see why memory mapping would be abuse.
> Not File::Map, but Tie::MmapArray which can declare arbitrary structures as array members.
T::MA seems like a niche tool to me, TBH. It could trivially be built on top of File::Map if necessary, the other way around isn't really the case.
Or....Add more memory.
I've been using a server with 128GB for this type of work and it's wonderful to see Perl being able to use tens of gigabytes of memory and still perform nicely. Hashing short (50bp) DNA reads for example I've seen Perl process around 80MB/s. Personally I develop much quicker in Perl than any other language and gladly invest in good enough hardware to keep on using it.
Thanks for all the comments (very useful), I've pointed the person who emailed me to here.
http://pradeeppant.com/2011/02/23/pdl-a-way-to-deal-with-larger-arrays/ a recent post mentioning using PDL "to COMPACTLY store and SPEEDILY manipulate the large N-dimensional data".
I hit an out of memory type problem recently when pulling in a large file all at once. I just switched from the 32 bit machine I was using to a larger memory 64 bit machine and Perl to get around the limitation!
Of course that was a limitation on the way I'd programmed the script and I did later make it a little more efficient. Not copying the really large chunk of data helps both with memory usage and speed!
I don't think anyone should be surprised if they run out of memory when working with next-generation sequencing data (or any other large biological datasets) on the equivalent of a laptop's memory, even with 4-8 GB of memory. Don't try shoe-horning the camel to fit through the needle.
Not to say that one can't come up with memory-efficient implementations for most analyses, but how well would they perform? As @slahond points out, memory is cheap. I myself have a 48 GB machine to handle just this kind of data. I find as long as no circular references are generated Perl's normal GC works fine, and if I need I can go back and create memory-efficient code in C/C++ to do what I want (so far I haven't had to do that, fingers crossed).
"RAM is cheap" is often a cop-out. As soon as you want to run your analysis over multiple chromosomes, with a range of parameters, or with multiple pairs of species, your RAM will limit how much you can do in parallel.
Let's say you're trying to figure out the best set of BLAST parameters for your analysis. If you can make your program use 1/10 the RAM (entirely possible if we're talking Perl vs. C), you may be able to finish in 1/10 the time. A run that took an hour now takes 36 seconds, so instead of pressing "return" and going to lunch, you can press "return," check your email, and get right to analyzing the results. Often, making "batch" tasks "interactive" will fundamentally change the way you work.
Re: 'RAM is cheap', yes I agree, it's a copout under many circumstances. BLAST in particular is an example of something that is ridiculously easy to run in parallel, particularly when running multiple queries or checking parameters; simply batch out the sequences across a cluster, running whatever variation of the parameters you need per job submission. Similarly, multiple alignments of orthologous protein groups lends itself to easy parallel runs. Perl works very well in setting up and queuing tasks like this on a cluster. Sequence trimming and filtering is another task where many use pure Perl.
However, you would probably agree not all tasks are easily run in parallel, nor would one want to run such tasks with pure Perl (which seems to be the original question). With the BLAST example, one would be calling out to the system BLAST (C/C++ based) program which does the massive gruntwork; much of the Perl-related work would involve parsing and post-processing the results via BioPerl or some other means. If the latter were possibly the source of memory issues then there would be a concern. AFAIK there aren't memory issues with the BioPerl modules, they're pretty battle-tested if a bit slow, but who knows?
Anyway, some tasks like genome assembly and next-gen alignments, though possibly run in parallel, are much more efficiently run on a system where everything is intentionally retained in memory so runs aren't I/O-bound. The reality is, though still expensive, high-mem boxes for these purposes are much cheaper than they used to be only 2-3 years ago. My 48 GB machine cost ~$4K, assembled from scratch. Sometimes it's the only tractable solution if you need answers within days vs a few months/years.
If you're using BioPerl, you've already given up any pretense of efficiency, so I'm assuming you haven't fallen down that rathole.
I haven't done this stuff in a couple of years, but there are plenty of 32-bit systems out there, and few graduate students can just tell their advisors buy lots of fully-loaded 64-bit machines.
I'm a BioPerl core developer, so I don't necessarily think it's a rathole. ;)
However, I agree re: efficiency, it's not great, and there are lots of potential improvements (reducing the inheritance hierarchy, breaking up the monolithic install, etc etc).
I've bitched about BioPerl before, so I won't do it again, but to summarize, my suggestions are: (1) focus on common tasks (e.g. read a FASTA or PDB file, parse BLAST output), make them easier to do with BioPerl than without it, and make them idiot-proof (e.g. detect and handle variant formats and junk); (2) get rid of the boilerplate documentation, which only helps Pod::Coverage.
It would also be awesome if BioPerl used memory-efficient data structures.