shortcomings of perl

By xiaoyafeng on August 5, 2013 2:32 PM

Today TIOBE roll out his Programming Community Index for August 2013. Since it added many new searching engines perl slide rapidly down from 9 to 11 compared to last year.

I’m not surprised seeing this happened. it’s 2013, not 2003. except regex, I can’t see perl has any unique feature can dominate any common programming field. (yes, we have CPAN, but you can’t ask a newbie can adeptly choose parts in it to assemble in a canon. ;) ) With more knowing perl, I can more an more understand perl’s shortcomings.

there is no fledged framework for web, like rails of ruby, wordpress of php. As I said above , too much parts in CPAN, would confused guys are new to perl. They would walk away before get to start enjoy choosing.
Speed. When you use modern perl, the performance would get penalty very much. As I know, these day, in rank of running speed, perl is moving gradually from head to back. it’s not its fault, perl is expressive, so that perl has to give up many way to optimize.
syntax. Because of historical reason, there are many inconsistent in perl syntax. here and there. I think it’s a really big problem for people want to learn perl.
portablity. I list this as a fourth one must lead to objections. yes, like wrote in perl.org: Perl 5 runs on over 100 platforms from portables to mainframes one of perl’s strong is portable. But when you are really getting into perl’s portability, you may realize, it’s all based on a shell. I admire lua’s real portability, and hope one day perl wll have.

But anyway I will continue using perl happily until a usable perl6 comes to us. with Inline, DBD, Moo and many other modules, all of them I list are not big problems.

20 comments

20 Comments

Toby Inkster | August 5, 2013 3:04 PM | Reply

I wouldn't read so much into this. This month they added a large number of extra search engines as statistical sources, and a whole bunch of their stats jumped. Look at what happened to C and Objective-C!

brian d foy | August 5, 2013 7:11 PM | Reply

If you're doing Unicode, Perl is your best tool.

And, remember that TIOBE is deeply flawed. Google that. They are completely useless rankings for any purpose.

Matthias | August 5, 2013 8:53 PM | Reply

I admit, TIOBE is worthless, worthless as a ranking index, but unfortunately it's at least bad marketing.
It's all about marketing, so how can we place perl on a better rank?

Nova Patch replied to comment from brian d foy | August 5, 2013 9:56 PM | Reply

I agree about Unicode support. Perl has the best Unicode support of any language and that is the top reason I use it as an internationalization developer. I don't see any other dynamic language as a viable alternative for the specific work I'm doing related to natural language processing and multilingual information retrieval. The only other language with adequate Unicode support for this work is Java.

James | August 5, 2013 11:15 PM | Reply

Catalyst, Poet/Mason, Dancer, Mojolicious.

Gábor Szabó - גאבור סבו | August 6, 2013 3:33 AM | Reply

@brian d foy, @Nick Patch
Could you write specific Unicode examples where you'd have trouble in Ruby, Python or PHP but works in Perl?

Toby Inkster | August 6, 2013 12:25 PM | Reply

How about this?

#!/usr/bin/env perl
use v5.14;
use utf8;
my $string = "café";
substr($string, -1, 1) = "e";
say $string;
# output: "cafe"

Versus:

<?php
$string = "café";
$string = substr_replace($string, "e", -1, 1);
print $string . "\n";
# output: "caf\x{c3}e"

PHP 5 strings are strictly byte strings; all the standard string manipulation functions (and thus, all the third-party libraries that use those built-in functions) assume that one character = one byte.

There are a separate set of functions for character strings in the "mbstring" extension, but this extension is not enabled by default, and not widely used. (Whatsmore it only offers multibyte equivalents for some of PHP's built-in string functions; not all of them.)

vsespb | August 6, 2013 3:16 PM | Reply

Indeed I always hear that perl has best unicode support, but I did not see evidences yet.
Ok, there is one for PHP above. But what about Ruby 1.8/1.9/Python 2/3?

Ruby seem to have different paradigm - all strings are byte strings with known encoding. This can be treated as performance advantage - you don't need to convert data to unicode "internal representation". Or it can produce mess (a matter of taste imho).

john napiorkowski | August 6, 2013 5:14 PM | Reply

@nick

I hear you on Perl and UTF8, however I do think that the subject is esoteric enough that even though Perl is ++ in regards to it, it does not shine as brightly as it should. I've seen several of your talks and we actually worked together and I think I would still have a hard time showing why this is a great thing about Perl :(

And one downside here is that it tends to play to the Perl stereotype a bit, that its great for text processing and related analysis.

Maybe you can convince someone at Shutterstock to toss some of that sweet post IPO money into sponsoring having you write some sort of Enterprise UTF8 with Perl5 book? I think starting a book like this and manage it on Github, similar to the Modern Perl book, would be fantastic. I'd be willing to help a bit, if that helps.

John

john napiorkowski | August 6, 2013 5:16 PM | Reply

Regarding web frameworks, I guess I would say Perl web development does not aim to be as full stack as RoR. For example with Catalyst one uses that as a base and then builds as needed, using CPAN on the way. I do agree that for newcomers it is not easy enough to figure out what pieces to use where, but it is the general paradigm Perl programmers prefer.

Nova Patch replied to comment from Gábor Szabó - גאבור סבו | August 6, 2013 5:51 PM | Reply

Gabor, vsespb: Python's regex engine doesn't support Unicode properties (\p) or grapheme clusters (\X). The lack of either one is a complete showstopper for me. Ruby 1.9 at least supports \p but only for two properties (General_Category and Script) plus some derived properties. As with Python, the absence of \X is a showstopper. PHP is the best off of any other language mentioned here because it uses PCRE as its regex engine, which at least includes \X but has similar \p functionality as Ruby. To the best of my knowledge, what I've described here covers Python 3 and Ruby 2. These are only some of the most lacking examples but there are plenty more that just make life harder for an internationalization developer or anyone attempting to provide good Unicode support.

vsespb | August 6, 2013 5:55 PM | Reply

@Nick, thanks!

Nova Patch replied to comment from Toby Inkster | August 6, 2013 8:13 PM | Reply

Toby: I wouldn't recommend underestimating PHP. Even though the mb_* functions from 4.0.3 aren't widely used, neither is \p nor \X in Perl and these are all very important features. The fact that they exist is what's important to these languages. Next is educating the communities to use them. What I find impressive about PHP's Unicode support is the grapheme_* functions added in 5.3.0, which don't have equivalents in any other core language that I know (other than the Perl 6 spec) and I would use them all the time if I worked with PHP more.

Nova Patch replied to comment from john napiorkowski | August 6, 2013 10:04 PM | Reply

john: Do you consider the topic of grapheme clusters to be esoteric? Although there's poor knowledge on the topic among programmers in general, I think it's something that every programmer who works with strings should know. I've been doing my part in attempting to educate developers at conferences like YAPC::NA, DCBPW, and Open Source Bridge. A large problem is such poor support among many languages, such as Python, Ruby, and especially JavaScript, but there's also still quite a knowledge gap among Perl and especially PHP and Java developers.

In Python, Ruby, or JavaScript, how would you solve the following problem?

1. If a string is greater than 10 characters, truncate it at 7 and append "..."
2. When counting characters or truncating characters, make sure that you're working with user-perceived characters and not something unknown to the user likes bytes or code points

Here are some unit tests:

is(

    # 9 grapheme clusters, 11 code points, 33 bytes

    trunc('สีแดงอมม่วง'), # "magenta" in Thai

    'สีแดงอมม่วง',

    'no truncation when 9 grapheme clusters'

);

is(

    # รั is 1 grapheme cluster, 2 code points, 6 bytes

    trunc('สาธารณรัฐเช็ก'), # "Czech Republic" in Thai

    'สาธารณรั...',

    'truncation with no corruption of รั'

);

Here is my solution in Perl 5.6+:

sub trunc {

    my $str = shift;

    if ($str =~ m{ \X{11} }x) {

        $str =~ s{^ ( \X{7} ) .+ $}{$1...}x;

    }

    return $str;

}

As for Unicode properties, it's not just for stereotypical Perl work like text processing and data munging. Any language that supports them provides an excellent platform for natural language processing and multilingual information retrieval.

I'll be presenting a new talk titled Unicode Programming in Modern Perl at The Internationalization and Unicode Conference (IUC) in Santa Clara this October. I'd like to think that with many speakers from Google, Adobe, Microsoft, Apple, IBM, and W3C, we aren't just a group of people discussing esoteric topics. Unicode is important and these companies know it.

I'd love to write a book on practical Unicode programming but wouldn't want to limit it to just Perl 5. Obviously Perl 5 would fill a large chapter in this book though :)

vsespb | August 6, 2013 10:45 PM | Reply

@Nick, it indeed looks like special area of natural language processing.

imho lot's of "typical" applications just need something from that list:

1) store/retrieve correctly to/from external storage.
2) display it correctly
3) determine length correctly
4) foldercase it correctly.
5) don't produce broken utf when truncating

6) optional: uppercase/lowercase it correctly.
7) optional: sort with respect to collations.
8) optional: deal with filesystem encodings issues
9) optional: search text correctly.

and they sometimes fail to do even that.

btw I believe your example above can be solved for most European and Cyrillic languages by normalizing to NFC, so 1 character=1 graphem

Nova Patch replied to comment from vsespb | August 7, 2013 12:02 AM | Reply

vsespb: I'd have to disagree about this being a "special area of natural language processing" and not a "'typical' application". Maybe Thai isn't the best example because it's so foreign to most Westerners. The reason I used it is because it's a real-world example. Shutterstock recently added support for Thai and Korean. For Thai especially, determining length or truncating simply can't be done without knowledge of grapheme clusters. There is no normalization form that will provide precomposed code points for common Thai characters because they don't exist. Even if you don't explicitly support localization in Thai, you may still need to handle strings in the Thai script, such as user-supplied names, titles, etc. Here are some examples that come to mind for non-composable grapheme clusters in Western scripts: "\x0D\x0A" (CRLF, a single user-perceived control character), "Spın̈al Tap" (Latin-based proper noun), "aку́т" or "окси́я" (Cyrillic with accents, as used in Russian dictionaries and encyclopedias). Additionally, the Unicode Consortium has stated that they don't intend to add new precomposed characters in the future since the existing ones are only included for round-trip compatibility with legacy encodings.

Mithaldu | August 7, 2013 11:09 AM | Reply

Just a small note:

The TIOBE algorithm is to search for " programming".

Now Perl has a moniker that is unique to the language. There is no other item in the world that has this name and is in common use. Meanwhile Ruby and Python are named after a gemstone and a snake (or a comedy troupe).

This means that anyone talking about Perl will have no reason to say "Perl programming language", while users of the other language actually need to say "Ruby programming language" or "Python programming language".

Now i invite you to do a little experiment. TIOBE's third-ranked search engine is wikipedia. Go on there, and search for:

* perl: http://en.wikipedia.org/w/index.php?search=perl&title=Special%3ASearch&fulltext=1
* perl programming: http://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=default&search=%22perl+programming%22&fulltext=Search
* python programming: http://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=default&search=%22python+programming%22&fulltext=Search
* python: http://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=default&search=python&fulltext=Search

You'll find that there are actually considerably less entries about Python on Wikipedia than Perl, but due to a small difference in their names, Perl shows up much less in TIOBE's searches. I'm fairly sure you'll be able to observe the same in other search engines they use.

vsespb | August 7, 2013 12:08 PM | Reply

@Mithaldu,
+1

@Nick,
у́ and и́ in "aку́т" or "окси́я" are not parts or Russian alphabet. It's just accents, can be used in any language probably. Same chance russians will use accents in the text, that
they will use characters from foreign alphabets (unless they are writing dictionary article). Same probably applies to "n̈" (it's not on keyboard according to Wiki).
CRLF imho should be processed in different layer, unless you writing word processor.

It's of course very interesting and indeed should be used in applications that should process unicode 100% correct, but if I report the bugs like "text truncation
function will truncate accent from "у́", to any company I worked in the past, they will probably ignore it or put in the bottom of bug queue.

btw if dictionary article truncated in the middle of word with accents, accent is probably useless and can be dropped.

close.screen | August 9, 2013 3:02 PM | Reply

I'm not understand problem with unicode.

Still on perl version v5.8.8:

echo "café" | perl -MEncode -lpe'
$_=Encode::decode("utf8",$_);
substr($_, -1, 1) = "e";
Encode::encode("utf8",$_)'

# output: cafe

In Ruby or Python this is not possible? I'm under table.

vsespb | August 9, 2013 3:16 PM | Reply

Ruby 1.9x - possible

$echo café|ruby 5.rb
cafe

$cat 5.rb
# encoding: utf-8
puts STDIN.read.gsub(/é/, 'e')

> I'm not understand problem with unicode.

see above comments from @Nick about

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About xiaoyafeng

I blog about Perl.

More info »

xiaoayfeng