How Not To Write A Subroutine

By Ovid on April 15, 2010 1:18 PM

It's been pointed out to me that many programmers have issues with my blog because I'm writing about allomorphism, exceptions as flow control and other topics that are, frankly, not something a beginner programmer is going to warm to. I think that's a fair point and I'm going to try to start including information aimed at those new to Perl and those new to programming.

I'll start with subroutines. Specifically, an old, horrible example.

Back in the days of CGI being king, you often saw a variant of the following routine to parse CGI data:

 1 # don't use this
 2 sub parse_cgi {
 3   my @pairs;
 4   if ( $ENV{'REQUEST_METHOD'} eq 'GET' ) {
 5     @pairs = split( /&/, $ENV{'QUERY_STRING'} );
 6
 7   }
 8   elsif ( $ENV{'REQUEST_METHOD'} eq 'POST' ) {
 9     read( STDIN, $buffer, $ENV{'CONTENT_LENGTH'} );
10     @pairs = split( /&/, $buffer );
11
12   }
13   else {
14     print "Content-type: text/html\n\n";
15     print "<P>Use Post or Get";
16   }
17   my %formdata;
18   foreach my $pair (@pairs) {
19     my ( $key, $value ) = split( /=/, $pair );
20
21     # Convert plusses to spaces
22     $key =~ tr/+/ /;
23
24     # Convert Hex values to ASCII
25     $key   =~ s/%([a-fA-F0-9] [a-fA-F0-9])/pack("C", hex($1))/eg;
26     $value =~ tr/+/ /;
27     $value =~ s/%([a-fA-F0-9] [a-fA-F0-9])/pack("C", hex($1))/eg;
28
29     $value =~ s/<!--(.|\n)*-->//g;
30
31     if ( $formdata{$key} ) {
32       $formdata{$key} .= ", $value";
33     }
34     else {
35       $formdata{$key} = $value;
36     }
37   }
38   return %formdata;
39 }

There are, frankly, so many things wrong with this (I won't touch on all of them because I could write for hours), but I'll touch on some of the highlights.

Note: this is very, very old code and we generally avoid CGI today, but I still see this and variants cropping up (fortunately, less than I used to).

Black Boxes

The first thing to remember about a subroutine is that it should be a "black box". You pass data in and get data out. Here you would actually want a second routine which reads the form data and and then hands it to the parse_cgi code. Why? See the POST handling? When you read from STDIN, you empty STDIN. That data cannot be reread without jumping through serious hoops (and who wants to explain how to tie STDIN?). What happens if you want to debug that? If you try to print the STDIN data after this subroutine, the data have been read and you have nothing to print. Print the data before this subroutine and the subroutine has no data to read. Oops. You lose.

So a better strategy would be to call it like this:

my %formdata = parse_cgi(read_form_data());

And at the top of the parse_cgi() code:

sub parse_cgi {
    my $raw_form_data = shift;
    my @pairs = split /&/, $raw_form_data;

That allows you to capture the form data before this call and do other things with it, if needed. Of course, it also allows you to do this, if you want:

my %formdata = parse_cgi(read_form_data_from_file($filename));

You might not need to do that, but by separating these behaviors and treating your subroutines as black boxes, you get a lot of flexibility.

Cohesion

Subroutines and methods should generally do one thing and do it well (we call this cohesion). The code in question both reads form data and attempts to parse it. In "Black Boxes", I showed how passing arguments to a subroutine makes it more flexible. This was achieved by separating the reading of data from the parsing of it. However, there's still another classic mistake here:

29     $value =~ s/<!--(.|\n)*-->//g;

I used to see this a lot in code. Aside from the fact that this regular expression is broken badly, what's it doing there? I've never seen anyone explain it, but my guess is that this was some strange attempt to strip server side includes (SSIs) from the data. If so, this is a terribly misguided attempt to provide some security for incoming data (there are much better ways of dealing with this, but I won't touch on that here).

If that's the case, what do you do if you want HTML comments and SSIs? Maybe you want to log the raw data, but since the substitution is poor, it has a good chance of mangling said data and you won't know because you tried to do too much in one routine (separating the reading and parsing can work around this). I can't imagine why you would want to keep them, but if you do, this code won't let you.

If you really do want to properly scrub your incoming data, though, there's a lot more than SSIs you want to eliminate. Thus, you should have dedicated code written for scrubbing the data, not trying to hack it into parsing code.

Understand Your Data

See this?

31     if ( $formdata{$key} ) {
32       $formdata{$key} .= ", $value";
33     }
34     else {
35       $formdata{$key} = $value;
36     }

There are two serious flaws here. First, let's look at a valid query string on a URL (and note that ';' is a valid separator, but the ampersand is all the code split on):

http://www.example.com/?sport=football;sport=baseball

It's perfectly OK to have multiple values for a single key. The venerable CGI.pm would allow you to fetch them with this:

my @sports = $cgi->param('sport');

This code, however, would return this (complete with extra space):

%formdata = (
    sport => "football, baseball",
);

Presumably this is to allow you to split on commas (plus that extra space tucked in there). But what if the value contains a comma? Oops. We have multiple values. That should be an array reference, not a string.

Now let's look at another valid URL:

http://www.example.com/?sport=football;sport=sport=baseball

Guess what your data structure has?

%formdata = (
    sport => "baseball",
);

Why it contains this is an exercise for the reader.

If you don't really understand the nature of your data, you're going to write bad code like this.

Reuse Code

Even for this relatively simple task (and there are plenty of other issues I haven't touched on), it's easy to misunderstand what's going on. That's why, when CGI was so popular, the Perl community enjoined everyone to use the CGI.pm module. We don't do that any more -- there are far better ways of approaching this problem -- but we still don't recommend that you write your own code.

Summary

Subroutines should be black boxes that do one thing well. Don't rely on globals and don't do too much. They'll be more flexible that way.

These "new user" friendly blog posts will be an "on again/off again" sort of thing, but if they turn out to be popular, I'll write more.

12 comments

Tagged as:

newbies, subroutines

12 Comments

Ed Avis | April 15, 2010 3:00 PM

Another glaring bug in the 'formdata' code: what happens if the URI has this?

?sport=0;sport=baseball

anthony.derobert.net | April 15, 2010 6:48 PM

There is another great benefit to splitting that subroutine into the read & parse pieces as shown in "Black Boxes":

You can then test the parser.

A novice can figure out how to call parse_cgi on a bunch of test strings and confirm the result is as expected; doing it on the combined form is much harder (tie or kluges involving pipe come to mind).

Mark Stosberg | April 15, 2010 7:32 PM

This post relates the code snippet with CGI and CGI.pm. To be clear, this code is not in CGI.pm.

While CGI.pm may be old, and there are other good options, CGI.pm generally works securely, correctly and with good performance and documentation.

Roman Hubacek | April 15, 2010 7:58 PM

Thanks for nice post.

In exercise for reader, suppose the line 5 and 10 split is modified to /[&;]/, I am rather getting

{ sport => "football, sport" }

instead of what you write. Am I missing something?

Anonymous | April 16, 2010 1:33 AM

What are the alternatives to CGI.pm? I'm an old fart who hasn't done CGI stuff in over a decade.

brian d foy | April 16, 2010 5:12 AM

To be fair, if you're saying this because of our discussion in Vienna, it's not that many programmers have issues with your blog or that the people who said something similar in that conversation meant that you should do anything different.

I said that you had a limited audience (in a non-bad way) because your readers had to actually know quite a bit of craft to follow you. That's not a bad thing. The Perl world needs that too. There has to be something for people to follow once they learn how the basics of programming or testing.

Please keep bringing the other posts with the big words, etc. :)

Ovid | April 16, 2010 7:48 AM

@brian: our discussion in Vienna was the catalyst, but not the initial cause. This has been kicking around in my brain for a while. I just keep remembering how hard it was when I was leading Portland.pm to get material aimed at new people. We had (have) a group of very strong Perl programmers, some of the best known in the world, and trying to get a new programmer to stand in front of them was hard. Likewise, the Portland pros tended to have talks on OO programming, Parrot, prototyped-based programming, hacking Perl internals and other topics which a new programmer just won't get.

I want to try and encourage conversation at all levels, so maybe my nifty debugger hacks aren't helping newer programmers :) I also considered starting a second blog here, one aimed at developers coming to Perl. That would make it easier for some folks to filter which content they wanted to read.

Ovid replied to comment from Roman Hubacek | April 16, 2010 8:00 AM

@Roman: you are correct about the about. That split statement is much improved. I'd probably write something closer to this, though (assuming a modern Perl):

sub parse_cgi {
    my $raw_form_data = shift;
    my @pairs = split /[&;]/ => $raw_form_data;

    my %form_data;
    foreach my $pair (@pairs) {
        my ($k,$v) = split /=/ => $pair;
        foreach ($k,$v) {
            tr/+/ /;
            s/%([[:xdigit:]]{2})/pack("C", hex($1))/eg;
        }
        $form_data{$k} //= [];
        push @{ $form_data{$k} } => $v;
    }
    return \%form_data;
}

(There might be bugs in that. I wrote it off the top of my head)

By using an array reference instead of a string, I'm guaranteed to never worry about whether or not there's an embedded comma in a value. The following would break the original code:

www.example.com/?book=Hobbit,The

As a side note: by returning a hash reference, you never have to worry about whether you're called in scalar or list context.

Mark Stosberg | April 16, 2010 2:20 PM

CGI::Simple is a fork of CGI.pm that is written in a clearer style, removing some legacy cruft. But note the entries that remain in the bug tracker for it. CGI::Simple has far fewer users and there are sometimes in bugs in CGI::Simple that have been fixed in CGI.pm.

Today I'm thinking of a fix multi-line header parsing which will appear in the next release of CGI::Simple:

http://github.com/AndyA/CGI--Simple/commit/87d77e57cea951d8df59a6805166d8567d93245e

For something more radically different, a number of frameworks provide their own HTTP request and response handling including Plack, Catalyst and Mojo.

Roman Hubacek replied to comment from Ovid | April 18, 2010 7:17 AM

Thanks for reply and nice example.

Completely agree with returning the reference. Just recently I've been bitten by returning list -- I returned list of arrayrefs and sent that to Template Toolkit. Worked well until there was only one item in the list (TT traversed the arrayref as top-level list). Took me quite some time to figure out.

Btw, thanks for your testing series. I started using test just about half year ago and it really helping me find better ways how to test various scenarios.

Marco | April 21, 2010 1:22 AM

given that this is supposedly a 'beginner's post' I wonder if you could explain why you tend to use => instead of , in the push/split invocations; is it for visibility? some other reason I haven't yet divined in my quest towards perl monkdom?

Thanks for the post and the blog in general, it's been very useful in stretching my brain towards a greater perl understanding.

Ovid replied to comment from Marco | April 21, 2010 6:51 AM

@Marco: I can't say it's a particularly good practice (particularly since it's going to break in Perl 6), but early on in my Perl programming, I found I preferred the visual distinction between the subject and predicate in certain constructions for which this is appropriate:

my $result = join "\n"  => @subject;
@subject   = split /\n/ => $result;

I simply found that clearer than the alternative:

my $result = join "\n",  @subject;
@subject   = split /\n/, $result;

Separating the subject and predicate this way made sense to me, but it's something I should probably stop since it gives people pause and confusing newer programmers is not a good thing. (It's not a bad thing, either, but only in the sense that there's a difference between confusing code being a side-effect or a result of deliberate practice).

About Ovid

Freelance Perl/Testing/Agile consultant and trainer. See http://www.allaroundtheworld.fr/ for our services. If you have a problem with Perl, we will solve it for you. And don't forget to buy my book! http://www.amazon.com/Beginning-Perl-Curtis-Poe/dp/1118013840/

More info »

Ovid