August 2017 Archives

C comments and regular expressions

C comments and regular expressions

The C programming language has two kinds of comments, ones with a start and end marker of the form /* comment */, and another one which starts off with two slashes, //, and goes to the end of the line, like Perl comments. The /* */ kind are the original kind, and the // kind were borrowed from C++.[1]

Let's suppose you need to match the original kind of C comments. A simple regex might look something like this:

qr!/\*.*\*\/!

Here we've escaped the asterisks in the comment with a backslash, \, and used exclamation marks, !, to demark the start and end of the regex, so that we don't have to escape / with a backslash.

However, C comments have the feature that they can extend over multiple lines:

/*
  Comment
*/

which means that the above regex doesn't work. The problem is that the dot, ., doesn't match new lines. If we add the s flag to the end of the regex, the match succeeds:

qr!/\*.*\*\/!s

The s flag alters . so that it matches newlines.

However, unfortunately this still doesn't solve the problem of matching C comments. Here is an example program where it will fail:

int x; /* x coordinate */
int y; /* y coordinate */

Can you see why? The problem is that the . in the regular expression always matches as much as it can, so it will swallow up the first */ in the regular expression and go on consuming until it reaches the second one.

One way to solve this problem is to use something like [^*], "match anything except an asterisk".

qr!/\*[^*]*\*/!

This seems to work, but there is a flaw. Although

/*
 * comment 
 */

is OK as far as C is concerned, the regex refuses to match it because of the extra asterisk on the middle line, so now we have to add an extra clause to match an asterisk, except where followed by a slash:

qr!/\*([^*]|\*+[^/])*\*/!

But even this still has a problem. With a comment like

/***** comment ******/

it can't match the final */, so we need to change that to

qr!/\*([^*]|\*+[^/])*\*+/!

with a + after the final * so that it can match multiple asterisks before the final slash.

If you're using something like lex, that's the best you can do,[2] but fortunately Perl regexes have a few more useful abilities. The one which is useful here is the non-greedy matching ?, which changes .* from matching as much as possible to matching as little as possible.

qr!/\*.*?\*/!s

matches all kinds of original C comments without swallowing the ends of them.

The other kind of comments look much easier to match - there are no multiple lines, and the end marker is the end of the line, so a regex like

qr!//.*$!

should be enough to match them all.

Let's consider matching all comments in a C program. Suppose we have two regexes $trad_comment_re for the first kind of regex and $cxx_commment_re for the second kind. Naively we might write something like

 $c =~ m/$trad_comment_re|$cxx_comment_re/

Can you guess the pitfall? The problem is false positives with things which look like comments but aren't:

 char * c = "/* this is not a comment. */";

That's not a comment but a string. Although that one might seem unlikely, you'll also get false positives with things like

 const char * web_address = "https://www.google.com";

because of the // in the URL.[3]

A regex to match a C string looks like this:

 qr/"(\\"|[^"]*)"/

C strings start and end with double quotes, and they can also include double quotes after a backslash, hence we need to also match \".

Let's say we want to match all comments, then we need to also match for C strings, then discard the C strings, something like this:

while ($c =~ /($string_re|$trad_comment_re|$cxx_comment_re)/g) {
    my $comment = $1;
if ($comment =~ /^".*"$/) {
    next;
    }
# Now we have valid comments.
}

If this all sounds like too much work, try my module C::Tokenize, which offers all the regular expressions. The function strip_comments also takes into account some features of the C language itself, such as that

int/* comment */x;

is a valid C declaration, by inserting a space in place of the comment.

It can even be used to strip C-style comments from JSON, since JSON strings are identical to C strings for the purpose of matching.

[1] According to Dennis Ritchie, the // comments were the comment style of BCPL, a predecessor of C, and were resurrected by C++.

[2] See my C parser cfunctions for an example of lex regexes.

[3] Apparently these were a mistake which Sir Tim Berners-Lee only noticed when he tried to match C comments using a regex.

Check compression in web page

This module offers a way to check your web pages for correct gzip compression. It not only checks that your web page is compressed properly when required, but also checks that the web page is not accidentally compressed when not required, and that the compression actually does something useful in terms of reducing the page size. I wrote it because I couldn't find anything to do that on CPAN.

It's compatible with Test::More if you want to run the compression checks automatically.

New ways to include images in CPAN modules

The latest release of Test::podimage, version 0.05, shows a few interesting experimental ways to include images in CPAN modules.

It seems there is a way to show image files from the distribution on metacpan by using a leading slash, which I'd never heard of until now. There's also a new "=for image" tag. Oddly enough, although this was proposed five days ago, the CPAN grep search site tells us that this tag actually appears in some CPAN modules such as this one from 2010, although the above don't actually display on metacpan, perhaps because there is no leading slash in the image name.

A data URL can also be used.

Most of the image formats don't work on search.cpan.org.

Split a .pm into a .pod and a .pm

I searched on CPAN but was unable to find a way to split a .pm into a .pod and a .pm, so I made this script:

https://www.lemoda.net/perl/split-pod-from-pm/index.html

It's proved quite handy so far. Recently I took over maintenance of an old module called Net::IPv6Addr as part of the CPAN day. Today I upgraded the documentation a little so that the synopsis example is machine readable:

https://metacpan.org/source/BKB/Net-…

Script to update some modules

I have some modules which I need to periodically install on a web server, and cannot use cpan or cpanm to do this. One of the problems with this is that the local copies I made of the modules sometimes get out of date with the CPAN version. The following script updates the local copies of the modules. This uses make_regex from Convert::Moji to make a matching regex for a list of modules, but you can use list2re from Data::Munge in place of that.

# Check whether there are newer versions of the modules on the web
# site.

use warnings;
use strict;
use utf8;
use FindBin '$Bin';
my $updatedir = '/some/dir/or/another';
# This just runs "system" and checks the return value.
use Deploy 'do_system';
use Convert::Moji 'make_regex';
use File::Slurper 'read_text';
use version;
my @modules = <$updatedir/*.tar.gz>;
my %mods;
for my $module (@modules) {
    $module =~ s!.*/!!;
    my $mod;
    my $version;

    my @mods;
    if ($module =~ /([\w-]+)-([0-9\.]+)\./) {
        $mod = $1;
        $version = $2;
    }
    else {
        warn "no version in $module";
        next;
    }
    $mod =~ s/-/::/g;
    $version = version->declare ($version)->numify ();
    $mods{$mod} = $version;
}
my $re = make_regex (keys %mods);
my $file = '02packages.details.txt';
if (! -f $file || -M $file > 1) {
    do_system ("wget http://www.cpan.org/modules/$file.gz;gzip -d $file.gz");
}
my $cpan = read_text ($file);
my @cpan = split /\n/, $cpan;
for (@cpan) {
    if (/^$re\s+([0-9\._-]+)\s+(\S+)/) {
        my $match = $1;
        my $cpanv = $2;
        my $download = $3;
        $cpanv = version->declare ($cpanv)->numify ();
        print "Found $match version $cpanv\n";
        if ($cpanv > $mods{$match}) {
            print "UPDATE THIS ONE $_.\n";
            chdir $updatedir or die $!;
            do_system ("wget https://cpan.metacpan.org/authors/id/$download");
            chdir $Bin or die $!;
        }
    }    
}

Use STRLEN not int for SvPV

Obscure bugs occur with the following type of code:

 unsigned int len;
 c = SvPV (sv, len);

The bugs occur typically on a 64 bit system. They happen because unsigned int may be a 32 bit integer, but the second argument to SvPV should be STRLEN, which is unsigned long int. Giving a pointer to a 32-bit integer where it expects a 64-bit integer causes some very odd bugs, and may even crash the interpreter. So, one has to always do like this:

 STRLEN len;
 c = SvPV (sv, len);

and never use anything which is not STRLEN type.

I have a collection of more weird and wonderful XS bugs, found through CPAN testers, here:

https://www.lemoda.net/perl/perl-xs-cpan-testers/index.html

Despite having known about this for years, I just found another instance in my own module, thanks to the warning messages from clang, in Text::Fuzzy:

https://metacpan.org/source/BKB/Text-Fuzzy-0.26/Fuzzy.xs#L51

I've just now updated it:

https://metacpan.org/source/BKB/Text-Fuzzy-0.27/Fuzzy.xs#L51

Perhaps it would be worth making some kind of automated checker to go through XS code and make sure the second argument to strlen is always STRLEN.

About Ben Bullock

user-pic Perl user since about 2006, I have also released some CPAN modules.