A tour on perl-5.18.1 with c2ast, Marpa-powered C parser

The section on reserved names in GCC documentation gives several recommendations that probably the vast majority of C programs in the world do not follow.

Nevertheless, these are all good practices, and an interesting exercice was to check how the latest stable perl source code behaves v.s. these recommendations. As of the day writing this blog entry, this is perl-5.18.1.

The analysis has been done using c2ast.pl tool, from MarpaX::Languages::C::AST package, a Marpa::R2 powered C parser, advertised previously in this blogs.perl.org entry by Jeffrey Kegler.

Detailed list of "violations" is at this gist, generated on linux and gcc version 4.8.2 (Debian 4.8.2-10). It is interesting to note that over 1316 hits, there are 8 categories, with 4 of them eating 98% of the hits:

Hits Message
------------
466  The header file sys/stat.h reserves names prefixed with 'st_'
     and 'S_'
465  The header file fcntl.h reserves names prefixed with 'l_', 'F_',
     'O_', and 'S_'
238  Names that begin with either 'is' or 'to' followed by a lowercase
     letter may be used for additional character testing and conversion
     functions.
133  Names beginning with 'str', 'mem', or 'wcs' followed by a lowercase
     letter are reserved for additional string and array functions
  9  Names that end with '_t' are reserved for additional type names
  2  Names beginning with a capital 'E' followed by a digit or uppercase
     letter may be used for additional error code names
  2  The header file limits.h reserves names suffixed with '_MAX'
  1  The header file dirent.h reserves names prefixed with 'd_'
The top 10 files being:
Top 10 files (number of hits)
-----------------------------
 161 re_comp.c           
 136 regcomp.c           
 105 toke.c              
  96 op.c                
  95 ir_04_t.c           
  63 sv.c                
  57 pp_ctl.c            
  56 byte_t.c            
  50 re_exec.c           
  49 regexec.c           

This later statistic does not take into account when an identifier generates more than one hit: the "unit" is a message generated as per GCC recommendations.

Marpa::R2 is now a robust piece of s/w, c2ast being just one example of what can be achieved with it.

5 Comments

A couploe of points:
o) The numbers at the start of each line are hits, right? So I'd add a heading: 'Hits Message'.

o) I'm wondering if the phrase "The header file sys/stat.h reserves names prefixed with 'st_' and 'S_'" might be better expressed as "The header file sys/stat.h uses a reserved name, i.e. one prefixed with 'st_' or 'S_'".

@Ron: Actually the issue is that sys/stat.h is entitled to use names prefixed with st_, and reserves all of them for future use. The problem is that the Perl source is using one of those reserved names. So Jean-Damien's message is correct as it stands.

Note this is not just a matter of GNU recommendations: C90, C99 reserves these names and threatens bad thing will happen if the C code trespasses on them. Trespass, however, is what even the best C coders often do these days, and with impunity. The tempation to ignore them is strong: the reservations are overbroad and annoying, and there has been no good tool to spot them. Jean-Damien has at least fixed this last part of the problem.

This is a nice demonstration. But ...


Nevertheless, these are all good practices

Debatable. POSIX doesn't allow me to have a typedef benbullock_t or a function called strike_me_with_a_hammer () in my program since it may clash with the future POSIX something or other. These restrictions are just a little ridiculous, and whatever problems do actually occur could be fixed when they do actually occur, not by hogging gigantic spaces of identifiers.

How is it debatable? My understanding is that its not impossible that some future compiler will not compile the source because of this. Is that true? If so than this work is valuable and we should at least have it on the list.

I'm not a C/C++ programmer and don't know the pains here, but it seems with a tool like this one we can detect and solve them. Obviously the work is a pain in the ass to deal with but I guess that's the result of using an open standard like C

Leave a comment

About Jean-Damien Durand

user-pic About::Me::And::Perl