C comments and regular expressions
C comments and regular expressions
The C programming language has two kinds of comments, ones with a
start and end marker of the form /* comment */
, and another one
which starts off with two slashes, //
, and goes to the end of the
line, like Perl comments. The /* */
kind are the original kind, and
the //
kind were borrowed from C++.[1]
Let's suppose you need to match the original kind of C comments. A simple regex might look something like this:
qr!/\*.*\*\/!
Here we've escaped the asterisks in the comment with a backslash, \
,
and used exclamation marks, !
, to demark the start and end of the
regex, so that we don't have to escape / with a backslash.
However, C comments have the feature that they can extend over multiple lines:
/*
Comment
*/
which means that the above regex doesn't work. The problem is that the
dot, .
, doesn't match new lines. If we add the s
flag to the end
of the regex, the match succeeds:
qr!/\*.*\*\/!s
The s
flag alters .
so that it matches newlines.
However, unfortunately this still doesn't solve the problem of matching C comments. Here is an example program where it will fail:
int x; /* x coordinate */
int y; /* y coordinate */
Can you see why? The problem is that the .
in the regular expression
always matches as much as it can, so it will swallow up the first */
in the regular expression and go on consuming until it reaches the
second one.
One way to solve this problem is to use something like [^*]
, "match
anything except an asterisk".
qr!/\*[^*]*\*/!
This seems to work, but there is a flaw. Although
/*
* comment
*/
is OK as far as C is concerned, the regex refuses to match it because of the extra asterisk on the middle line, so now we have to add an extra clause to match an asterisk, except where followed by a slash:
qr!/\*([^*]|\*+[^/])*\*/!
But even this still has a problem. With a comment like
/***** comment ******/
it can't match the final */, so we need to change that to
qr!/\*([^*]|\*+[^/])*\*+/!
with a + after the final * so that it can match multiple asterisks before the final slash.
If you're using something like lex, that's the best you can do,[2] but
fortunately Perl regexes have a few more useful abilities. The one
which is useful here is the non-greedy matching ?
, which changes
.*
from matching as much as possible to matching as little as
possible.
qr!/\*.*?\*/!s
matches all kinds of original C comments without swallowing the ends of them.
The other kind of comments look much easier to match - there are no multiple lines, and the end marker is the end of the line, so a regex like
qr!//.*$!
should be enough to match them all.
Let's consider matching all comments in a C program. Suppose we have
two regexes $trad_comment_re
for the first kind of regex and
$cxx_commment_re
for the second kind. Naively we might write
something like
$c =~ m/$trad_comment_re|$cxx_comment_re/
Can you guess the pitfall? The problem is false positives with things which look like comments but aren't:
char * c = "/* this is not a comment. */";
That's not a comment but a string. Although that one might seem unlikely, you'll also get false positives with things like
const char * web_address = "https://www.google.com";
because of the // in the URL.[3]
A regex to match a C string looks like this:
qr/"(\\"|[^"]*)"/
C strings start and end with double quotes, and they can also include
double quotes after a backslash, hence we need to also match \"
.
Let's say we want to match all comments, then we need to also match for C strings, then discard the C strings, something like this:
while ($c =~ /($string_re|$trad_comment_re|$cxx_comment_re)/g) {
my $comment = $1;
if ($comment =~ /^".*"$/) {
next;
}
# Now we have valid comments.
}
If this all sounds like too much work, try my module C::Tokenize, which offers all the regular expressions. The function strip_comments also takes into account some features of the C language itself, such as that
int/* comment */x;
is a valid C declaration, by inserting a space in place of the comment.
It can even be used to strip C-style comments from JSON, since JSON strings are identical to C strings for the purpose of matching.
[1] According to Dennis Ritchie,
the //
comments were the comment style of BCPL, a predecessor of C, and were
resurrected by C++.
[2] See my C parser cfunctions for an example of lex regexes.
[3] Apparently these were a mistake which Sir Tim Berners-Lee only noticed when he tried to match C comments using a regex.
Currently I am using this for getting the c multiline comments with lookahead. Maybe the regex is worth a try, too?
Your problem is the greediness of the .* here:
qr!/\*.*\*\/!s
Just use a lazy version:
qr!/\*.*?\*\/!s