A widespread and broken Perl idiom

The following code is using a widespread Perl idiom, taking advantage of features designed for one-liners:

my $content = do { local ( @ARGV, $/ ) = ("$file"); <> };

Another trick here is the array assignment, which slurps all elements of the list into @ARGV, leaving $/ to undef; that way, the diamond operator will read the entire file at once.

I've been using and publicizing this local @ARGV idiom for years. And I've spent several days last week chasing a bug caused by this line of code.

The bug was hard to find

I'm using wallflower to generate a static version of a dynamic site. This site has several URL that can't be reached from the root of the site, so I use the --filter option to feed it all the non-crawlable URL.

In filter mode, the code ends up reading all files on the command-line (or STDIN if no files are given) using the same idiom:

local @ARGV = @_;
while (<>) {
    ...
}

In my case, the pages were not generated, even when explicitly listed. In fact, only the first link from the first file fed to the script was actually read (and recursively crawled).

It took me a while to figure out that Perl believed the ARGV filehandle (the one read by <>) had reached end-of-file, and stopped the while loop after reading a single line. (I have to thank the perl debugger for this, especially its { command. It allowed me to keep track of the value of eof ARGV while following my program's steps.)

I fixed the issue in a small commit. (And yes, that means the --filter feature has been broken from the moment it was introduced, in October 2012).

The cause is well documented

The issue is that the magical <> touches all three @ARGV, $ARGV and ARGV variables, so localizing @ARGV is not enough.

The inner code that slurped an entire file using <> and an undef $/ left ARGV at end-of-file. This caused the outer while loop to stop far too early. Talk about action at a distance!

This bit from perlop about <> says it all:

It really does shift the @ARGV array and put the current filename into the $ARGV variable. It also uses filehandle ARGV internally. <> is just a synonym for <ARGV>, which is magical.

perlvar has the details about those three variables:

  • @ARGV: The array @ARGV contains the command-line arguments intended for the script. $#ARGV is generally the number of arguments minus one, because $ARGV[0] is the first argument, not the program's command name itself. See $0 for the command name.
  • $ARGV: Contains the name of the current file when reading from <>.
  • ARGV: The special filehandle that iterates over command-line filenames in @ARGV. Usually written as the null filehandle in the angle operator <>. Note that currently ARGV only has its magical effect within the <> operator; elsewhere it is just a plain filehandle corresponding to the last file opened by <>. In particular, passing "*ARGV" as a parameter to a function that expects a filehandle may not cause your function to automatically read the contents of all the files in @ARGV.

The fix is simple

Whenever using local @ARGV in combination with <>, one should always localize the entire *ARGV glob.

6 Comments

So could you show us the correct idiom, please?

I became aware of this as far back as my PerlMonks days, so at least 15 years ago. Namely, I memorised that I should be writing the idiom like this:

my $content = do { local ( *ARGV, $/ ) = [ ... ]; <> };

This differs from the incorrect version by exactly 3 character swaps.

Sadly, this elegant approach doesn't solve the original problem.

Localized assignment to a typeglob only localizes the slot being assigned to. The rest of the typeglob remains unlocalized, which means the magic <> still messes up the global $ARGV and the global *ARGV{IO} filehandle.

For example:

sub report_ARGV {
    use Data::Dump 'pp';
    say shift;
    say '  $ARGV       = ', pp $ARGV;
    say '  @ARGV       = ', pp @ARGV;
    say '  %ARGV       = ', pp %ARGV;
    say '  *ARGV->tell = ', *ARGV->tell;
}
report_ARGV('before:');
my $content = do { local ( *ARGV, $/ ) = [ __FILE__ ]; <> };
report_ARGV('after:');

produces:

before:
  $ARGV       = undef
  @ARGV       = ()
  %ARGV       = ()
  *ARGV->tell = -1
after:
  $ARGV       = "/tmp/slurp_demo.pl"
  @ARGV       = ()
  %ARGV       = ()
  *ARGV->tell = 429

Oh, wow.

I had to go all the way to

$_ = [ __FILE__ ] for local *ARGV;

to make it work in a single statement. Even something like

*{ \local *ARGV } = [ __FILE__ ];

wouldn’t work, despite the fact that circumfix deref is effectively a do { } block with a whole separate inner scope (e.g. *{ my $x = 'hi'; local \*ARGV } = [ __FILE__ ]; say $x; is a strict vars violation).

I don’t think local is that super-intelligent, which means what’s going on must be something like local only schedules localisation but that it doesn’t actually happen until the next point at which the temps stack gets cleaned up… or something like that. (I’m not actually a guts hacker, unfortunately. It would help to read the actual implementation of localisation…)

PS: There is of course no point in writing it that way once it becomes that subtle and that much of a mouthful. The two-statement version is simple and obvious.

Leave a comment

About BooK

user-pic Pink.