Adapting PDL to a Big Data Landscape
Note: although this article is directed at current PDL users, I would particularly appreciate the opinion of Perl users who are considering using PDL. Does my assessment seem accurate to you?
I was just watching a few of the talks on youtube from from YAPC::NA that I wanted to attend in Madison but could not because I was busy (writing my talks) or attending other talks. And it reminded me of the revelation that I had at YAPC. Although I am not looking for a job, I spoke with the sponsors at their job booths, just to get a feel for what's out there. Is it possible for a Perl programmer to get a job doing real data crunching? The answer, happily, was "yes".
Almost immediately, I began to realize that there is a whole world of data analysis that is on the horizon for which PDL is well suited. PDL was written by and for scientists, but there's no reason it couldn't be applied to the analysis of Big Data (made possible in large part due to Chris Marshall's work on fully cross-platform memory mapping and 64-bit cleanups). Analyses of large data sets are already happening at many private corporations using languages such as SAS, SPSS, S, and R. Some of them might use Matlab; a rare few might use Python or Perl. Due to our limited marketing budget (ha!), the only corporations that will choose to use Perl and PDL are those which already use Perl in some significant capacity. We PDL folks have two major things to take away from this. First, we must engage with the wider Perl community, and second, we must make it easy for PDL outsiders to learn about and use the full breadth of PDL.
Engaging the wider community I am happy to report that my Introduction to PDL was very well attended. In other words, the Perl people care about and are interested in PDL. We PDL people simply need to make ourselves better known and accessible to the other Perl people who live and work in our midst. I highly recommend attending your local Perl Mongers. If there is no such group and you're the outgoing type, try searching on LinkedIn for other Perl folks in your neck of the woods and contact them if you can. A sysadmin who knows about PDL is one thing; a sysadmin that can put his coworker in touch with you, a PDL user that he sees once a month, is a much more powerful thing. If you're less outgoing, join the #pdl channel at irc.perl.org. If you don't have an irc client or don't know how to use irc, just use the in-browser mibbit client.
Making it easy for outsiders With the release of the PDL::Book, we finally have a single comprehensive resource for learning PDL. This is great. However, both the core docs and the Book can be improved. As the need to analyze Big Data grows, new users will come to PDL needing new functionality, and they will need to be able to learn to implement that functionality. Do you understand the intricacies of PDL threading? At the very least, do you feel like you could sit down with another programmer and hack at it until you got it right? Or, going further, have you used PDL::PP? There's a chapter in the book on that, too. (I should know, I wrote it. :-) If not, read selected chapters from the book and give your feedback. (Credits are listed in the back of the book, or just email the mailing list. New users, you have to sign-up to send mail.) The better we can make the book and the docs, the better we will be able to accommodate newcomers. The more people in our little community who understand these things, the more responsive we can be when newcomers arrive and ask questions, and the more of them will stay and start contributing, making PDL even better.
Finally, yes, I am talking to you, Jane PDL Hacker. I know that some of you, even some of the PDL Big Wigs, do not attend your local Perl Mongers. You should. Furthermore, only one person gave me thorough feedback on the PDL::PP chapter, and I must shamefully admit that I have yet to read most of the rest of the book. If you do not help, you should not be surprised if PDL slowly bitrots into oblivion. But if you tell others about PDL and give useful feedback on the docs, PDL will grow and improve and your efforts will pay off in the form of an even more awesome tool.
The tide of Big Data is coming. Do Your Part: help make PDL awesome, and help other Perlers discover how awesome it is, and maybe even make it better.
I have been looking at making a perl compiler
that can hopefully put perl code into pure
C really Gcc so that you could go into the
assembly and change it up to optimize it.
as well as print out the C code so it will
really be a perl to C converter...
I think this is a big step for PDL which is the reason why I'm having it built, since perl takes care of all the memory operations, it makes it more productive to code. and then
the converter would take care of memory coding...
I hope this perl compiler can give people the high level prototyping they need time wise, and the speed they need as well as the hardware resource conservation that is in C.
That sounds interesting Mark. Do you have a repo I could follow? Also are you aware of the work Reini Urban has been doing, especially the
B::
modules?@David, ok I will shamefully admit that I have not read the book yet either. I can and will do that sometime soon. You are right, documentation is important!
Nice post David. Perl already has a great support from developers with CPAN, I really would enjoy working with data analysis with Perl and PDL. Do you think PDL can become as popular as R language, for example? Statistical languages like R are growing in popularity nowadays, and I don't really know why we can't work with PDL or even Perl scripts for some analysis.
@leprevost, popularity is an ever-present concern. Yes, I believe that it is possible for PDL to be come as popular as R, because I believe that Perl is a better programming language than R. However, "it's possible" isn't the same as "it's true."
The real question is: how do you grow a language's popularity? Traditionally I've thought about "stealing" programmers from other languages, or hooking students before they've made up their minds. In a zero-sum game, popularity boils down to having better libraries and documentation and having a fun and inviting community. Presently we are not playing a zero-sum game but are on the horizon of a period of growth. Perl programmers from system admins to web developers will increasingly be required to write code to analyze data. The data-analysis pie is growing (radially) and PDL can grow its users without stealing anybody. Question: How do we convert that influx of new developers into actually increasing the size of our slice of the data-analysis pie? Answer: We make it easy and fun to become proficient at using PDL and we encourage newcomers to write extensions.
BTW, PDL evangelism is frequently at the top of my mind and has been one of the motivations for my work on PDL::Graphics::Prima and prima-repl.
I have not read the PDL Book but I have some experience in using PDL in Big Data projects.
Though PDL is a excellent scientific library it lacks some features to make it useful in Big Data analysis. For my applications it was lacking support for sparse vectors and sparse matrices that are crucial for working with document/term matrices.
As you know, plotting is the current "weakness" for PDL.
I know Perl folks that do their stuff in R and not Perl because it is sort of easier to do the plotting in R.
I think the Alien::Base work will make it easier to handle system library dependencies that cripple good plotting for Perl/PDL.
@Luben,
Actually, PDL has a candidate sparse matrix library called PDL::CCS::Nd. It was first added to CPAN nearly a year ago. It's current pass rate isn't great, but it's a start in the right direction.
@gizmo_mathboy,
Do me a favor and tell those folks about PDL::Graphics::Prima. Also tell them that it's young and that I am very interested in helping get new people to try it out. I'll even implement new features if they request them. :-)
Mark: I have no idea why you haven't found it yet. Search for perlcc or B::C. This compiler you want to write exists since 5.005.
However not a lot of optimizations are currently possible. I believe PDL is currently better suited, until perl5 gets native type support.
I was in fire and flame when i found the PDL homepage. This was very interesting. I have used perl for a lot of tasks,and also combining this with numeric analysis would be great. I have however got into some problems. Installing on Fedora 17 was not without problems. F.eks PGPLOT did not work. Being a bit frustrated i installed Strawberry Perl on Windows 7 and also PDL. Hmm, also her some moduls did not install as they should f.eks Plplot did not work.
If the installation was more "foolproof" that would make PDL more tempting to get into the corporate world. If I ask our sys-admin to install this and he ends up in spending time to figure out the installation the race is over Im afraid.
Im still in fire and flame but I have got a small shower.