Data Science and Perl

Our company goes into many other companies and helps them build new Perl systems or fix old ones. Needless to say, we see how many companies work and a typical example is one of our clients I'll call "AlphaCorp." They use lots and lots of Perl. Their primary web site is almost entirely Perl. So when I went in to help them with their A/B testing (amongst other things), I was surprised that they also used a lot of Python. It turns out they had a specific need that Python fills and Perl does not: data science.

Because they hired so many Python developers to work in their data science area, they had more and more Python creeping into non-data science areas. Their Python devs didn't do much Perl and vice versa. Thus, while AlphaCorp said they'd rather not split themselves over multiple programming languages, they really had no choice. And that's a problem for Perl's future.

Now that Perl 6 has been renamed to Raku, many people are happy because the confusion over whether or not Perl 6 is an upgrade to Perl has been removed. However, that's not enough. We need people to use Perl, to want to write in Perl.

Python has dominated the dynamic programming market and one of the many reasons is simple: data science. It's no secret that corporate interest in data science has skyrocketed:

Google Trends for "Data Science" since 2004

I've heard repeatedly from data scientists that they don't care what tools they use so long as they can do their job, but Perl is a non-starter for them. Python, however, has tons of rich libraries that data scientists can use to do their job.

If you're not familiar with data science, it's useful to understand the difference between analysis and analytics. Though data science today tends to lump all of its work under the term "analytics" (probably because it sounds more technical), that doesn't really explain what's going on.

Analysis is breaking raw data down into discreet information you can use to understand something. In short, analysis is about what happened in the past. Companies have been doing advanced analysis for decades.

Analytics, however, is the use of tools--often AI--that take existing data and predict the future. Perl's (mostly) great for slicing and dicing and analyzing data, but Python excels at analytics because it has plenty of tools for it. There's numpy, Pandas, matplotlib and tons of machine learning tools. If you want to figure out how to put them all together, here's a free Python Data Science Handbook.

Short of figuring out how to put together a top-notch data science team to build the appropriate libraries in Perl (and that takes money, time, and expertise), Perl is going to continue to fall short because one of the hottest (and legitimate!) topics in software right now is an area that Perl doesn't seem to cover very well.

It probably goes without saying that AI is closely related to this and Perl falls short there, too.

How can we fix this?

9 Comments

You called out Numpy as an advantage for Python. PDL is the equivalent advantage for Perl, which actually predates Numpy and is both faster and more elegant.

The real dichotomy from the Perl-vs-Python ecosystem standpoint is that the Python community is far more numerous (since Python has been an introductory-CS lingua franca for 15 years now, while Perl is almost never a first language).

I have no experience with "data science" (and this becomes relevant in a moment), but perhaps analogously my experience in the realm of electronics might be illustrative.

I found that there's very little electronics-based code around in Perl, so I had to start creating things. I've created the entire Device::Chip concept for talking to specific chips and small bits of hardware, and I've begun a few experiments in the Electronics namespace for larger things like test and measurement equipment. It's slow going because as far as I can tell it's basically just me, nobody else.

The trouble is that creating such things requires someone with both language and domain experience - the Perl + electronics crossover. You'd need to find people with knowledge of both and an interest in using both of them together. And here I think comes the core of the bootstrapping problem. I've been able to begin this entire ecosystem largely because I happened to be in the right place at the right time, with that overlap of experience.

I wonder if a problem here may be that there aren't really people with experience in both Perl + data science to a sufficient degree in both simultaneously, to begin creating the tools required.

A splendid question.

My I suggest a three leg stool metaphor.

To succeed in doing Perl coding (or anything) they would generally need:

  1. A purpose and desired outcome
  2. Good documentation
  3. Friends and mentors

To expound upon each of them, the purpose can be $work or it can be some hobby project. Call it an itch to scratch if you like. A good tutorial may demonstrate how to scratch an itch, and good toolage (CPAN etc) should make itch scratching feasible.

What good documentation comprises of is subjective. However in this context, it's goal is to provide a really positive experience in coding perl and/or data science for the first time. Likely there is no one perfect document either, tutorials on the same topic by different authors will speak to different people giving them that "aha!" moment

Thirdly, humans are social animals. Having someone to provide technical guidance helps jump over obstacles, but perhaps more importantly provides a sense of belonging and fellowship - which is key to success. (What forms this could take is a great area for discussion and innovation for perl)

Please share your thoughts dear reader.

I love Perl but now work almost entirely with quick one liners embedded in bash scripts before loading data into R or, less often, python or Matlab.

PDL is comparable to numpy for my usage (PDL::Dataframe is not pandas though). But that's not the issue. the perl repl (pdl2, iperl, perl -d) isn't as useful as shell, R, or even ipython. I think the interactive session tooling is what makes a language appealing for data science.

It's essential to interactively manipulate, visualize, and iterate over data.

I'm a data scientist who still loves Perl even though I haven't used it in a long time. I don't know if there is ba fix to bring Data Science to Perl, or vice versa.

One of the things that made Perl successful was that it was the right tool at the right time, which was the advent of the internet. Smart, and mostly really, really nice people who gave unpaid time and energy to great cpan modules (in addition to b the n core developers) that exploded beyond the bounds of just tools for the internet to damn near anything else.

The same type of people, but whose passion is data science arenas, put their time and love mostly into R and Python. For R, that was because ancestors to modern machine learning came from stats, and R was the open source stats language. In addition, Hadley Wickham helped build a very welcoming community and forged really helpful extensions to the R language that made it really matched b to date science b problems. For Python, well, I'm not as familiar, but it seems to have stemmed from the coincidence of Python having a burst of popularity in general, and Wes McKinney making Pandas to be a data package that enabled vectorization.

Right before I started committing my time to R, when I was hoping there was a Perl package that would be easier to adopt, I tried the PDA package, I think was the name. It had vectorization too. But R had decent, though not great, native graphics, and had CRAN, like CPAN, and it had the data modeling packages built by academic statisticians. So I spent more time there, and went to Perl where I could still do regex work on text processing. But as facilities in R matured, I could start to do much of that work totally in R. So my Perl skills have gotten atrophied.

Point is, I don't know if Perl should try to compete and reinvent that particular wheel. Hell, R it's getting closer to having macro facilities as powerful as lisp, so I'm really liking this space. Perhaps where Perl would have some great utility would be in data engineering. That's where Scala got some share, via Spark.

This may be way out there, but perhaps developing good tools for, say, quantum computing would be a possibility. It's still fairly untapped, and had a chance of being a technology that data science will need to consume. I don't know for sure, but it's one idea.

I'm not of the caliber to make these things happen, but I'd love an excuse to come back to Perl.

A lot of those python libraries are wrappers around C++ code. So python is taking a lot of credit where it isn't due.

What they are doing right is writing books, blog posts, and evangelising. That is something Perl has lost momentum on.

But I hear this new Raku language is really good at data science, people shouldbgive that a try!

If many python libraries are wrappers for C++ code, wouldn't it be reasonably straight-forward to develop such wrappers with Raku?

Just saw this today. I am the author of Chart::GGPlot and Alt::Data::Frame::ButMore, I know Perl/Python/R quite well, so I might have a say here. I actually had a similar question in my mind, given that Perl already has PDL which is comparable to numpy, Perl could be at a better position compared to some other languages like Ruby or PHP in terms of the possibility to involve more into the data science world. In practice I found it not an easy thing: during my communicating/advertising Chart::GGPlot (Perl implementation of R's ggplot2) and Alt::Data::Frame::ButMore (this one could ideally further evolve to a poor man's pandas although its still far from that position, as my focus was more on Chart::GGPlot), I found that the number of Perl users in the data analytics area today has diminished to a very small level, and I didn't get very much effective feedback. To really get Perl to a position to be able to attract few new users rather than maintaining remaining PDL users, we need much more high quality libraries: PDL in my point of view is like a poor man's numpy, it needs to be improved a lot to really compete with numpy. And besides numpy/pandas/matplotlib Python has other key library like scipy, Jupyter (there are similar things in Perl like IPerl but are of much lower quality). Also Python has other advantages like easier C/C++ interfacing, it has a lot companies backing it, it's "in" to many people, etc. Besides Python there are other big player on the table, old players like R, Matlab, new but strong players like Julia. I could be pessimistic, but as I see it Perl is a lot behind. It's difficult for few people (people knowing data science but having enthusiasm in Perl are rare) to move all those things forward...

Sadly the commentators who point towards lack of interest from data science people are right. If I am proficient in R I have very little incentive to learn about Perl adaptability 4 data science. Having ggplot2 and any statistical test at your fingertips is so important. However I have used Perl to make a few simple programs for parsing HTML and preparing data for import into R and I used WWW::Mechanize to solve some network related problems. So maybe Perl should focus on things that can be done well via Perl - unlocking access to data on the net, become somewhat indispensable in that and look towards the next big thing - e.g. quantum computing.

Leave a comment

About Ovid

user-pic Freelance Perl/Testing/Agile consultant and trainer. See http://www.allaroundtheworld.fr/ for our services. If you have a problem with Perl, we will solve it for you. And don't forget to buy my book! http://www.amazon.com/Beginning-Perl-Curtis-Poe/dp/1118013840/