Rewriting Gumbo Binding - A GPTrixie 'demo'

 originally wanted to make a small history about GPTrixie, but it will probably be boring and you can probably look at the commit history to have an idea of how it evolved. Instead, we will see how to rewrite my Gumbo binding using GPTrixie.


trixie_cm20x20.pngWhat is Gumbo?

Gumbo is a standalone C99 library that parse HTML5. It's heavily tested and it's project endorsed by google. Gumbo on github


trixie_cm20x20.pngWhat is GPTrixie?

GPTrixie is a tool that extracts definitions from a C header to transform them into their perl6 NativeCall counterpart. This definition is partially false since it actually extract the C definition from a XML file produced by GCCXML. C parsing is something a compiler like clang or GCC are more likely to do a better job than me with my poor compiler knowledge. Anyways you can find it at GPTrixie on github or just install it with panda install App::GPTrixie

Be careful with GCCXML, a project named CAST is supposed to replace it, it's based on clang/llvm but sadly it does not support modern C (C99 C11 for example). Some distribution install cast in place of gccxml (debian for example, but you get gccxml as a gccxml.real binary). You can set a GPT_GCCXML environement variable to point to the right gccxml

GPTrixie stands for The Great and Powerful Trixie. She is a fictional character from the show My Little Pony, She is a stand-up magician. If you don't like the name, I invite you to find me a name that is less boring than NativeCall Generator :)


trixie_cm20x20.png What we will rewrite and the goal of this blog?

Obviously writing a tool that also write the logic behind the library is not realistically possible. The goal of this blog is to rewrite the Gumbo::Binding component of my Gumbo binding module. It holds the definition of Gumbo functions and structures. 

Anyone that write a binding for a C library know how tedious it can be, and it's easy to overlook a field in a structure or writing the wrong type (You can think of all the pre x-mas binding with lot of int instead of int32, leading to weird bug on x86 32bits). So my manually written Binding.pm6 took a lot of trials and errors, and it's not complete.

Originally I write GPTrixie to be a tool that you run over a header and copy paste the output in your file and change what does not fit you. (Because it can't know if a char * is a string or a Buffer or data for example). Now the goal is to have a file that hold some option and specification that will make GPTrixie generate a 'ready to use' Binding.pm6 file.



trixie_cm20x20.png First steps

I recommend following some steps with gptrixie before writing the configuration file.

First run gptrixie on your header file without options it will give you something like:

root@testperl6:~/piko/cpp/gptrixie# perl6 -I lib bin/gptrixie /usr/local/include/gumbo.h 
Calling GCCXML : gccxml /usr/local/include/gumbo.h -fxml=plop.xml[]
Parsing the XML file
Doing magic
Times -- gccxml: 0.7278153 sec; xml parsing: 3.4344711 sec; magic: 0.3243911
Number of things founds
-Types: 93
-Structures: 10
-Unions: 1
-Enums: 6
-Functions: 12
-Variables: 4
-Files: 3
The last part is interesting. Let's me explain a bit. The number of types is the number of C type founds, the value is not really relevent because something like const char *  in C generates 3 types (char is one, Pointer to char is another, const on char * is another). Structures/Unions/Enums/Functions/Variables is quite self explicit (Variables is exported extern variables). 
The files number is important to consider. let use the --list-files options to have the list.

f0    : /usr/share/gccxml-0.9/GCC/4.9/gccxml_builtins.h    - Functions(0), Enums(0), Structures(0)
f1    : /usr/local/include/gumbo.h                         - Functions(12), Enums(6), Structures(10)                                                      
f2    : /usr/lib/gcc/i586-linux-gnu/4.9/include/stddef.h   - Functions(0), Enums(0), Structures(0)

f1, f0, f2 are the id used by gccxml and gptrixie. The first file is irrelevant. We can see that gumbo.h holds the interesting stuff. stddef.h is here because gumbo use the C99 bool type. In the case of Gumbo we are lucky because only one file hold what we need (gumbo.h). It's not necessary the case with all the lib. Here the output with mysql.h

f0    : /usr/share/gccxml-0.9/GCC/4.9/gccxml_builtins.h    - Functions(0), Enums(0), Structures(0)
f1    : /usr/include/mysql/mysql.h                         - Functions(104), Enums(5), Structures(16)
f2    : /usr/include/i386-linux-gnu/bits/time.h            - Functions(0), Enums(0), Structures(1)
f3    : /usr/include/i386-linux-gnu/sys/types.h            - Functions(0), Enums(0), Structures(0)
f4    : /usr/include/i386-linux-gnu/bits/pthreadtypes.h    - Functions(0), Enums(0), Structures(5)
f5    : /usr/include/mysql/typelib.h                       - Functions(7), Enums(0), Structures(1)
f6    : /usr/include/i386-linux-gnu/bits/byteswap.h        - Functions(0), Enums(0), Structures(0)
f7    : /usr/include/i386-linux-gnu/bits/sigset.h          - Functions(0), Enums(0), Structures(1)
f8    : /usr/include/i386-linux-gnu/sys/select.h           - Functions(2), Enums(0), Structures(1)
f9    : /usr/include/mysql/my_list.h                       - Functions(7), Enums(0), Structures(1)
f10   : /usr/include/mysql/mysql_com.h                     - Functions(30), Enums(6), Structures(6)
f11   : /usr/include/i386-linux-gnu/bits/types.h           - Functions(0), Enums(0), Structures(1)
f12   : /usr/include/time.h                                - Functions(0), Enums(0), Structures(1)
f13   : /usr/include/mysql/my_alloc.h                      - Functions(0), Enums(0), Structures(2)
f14   : /usr/include/i386-linux-gnu/bits/select2.h         - Functions(0), Enums(0), Structures(0)
f15   : /usr/include/mysql/mysql_time.h                    - Functions(0), Enums(1), Structures(1)
f16   : /usr/include/i386-linux-gnu/sys/sysmacros.h        - Functions(3), Enums(0), Structures(0)
f17   : /usr/lib/gcc/i586-linux-gnu/4.9/include/stddef.h   - Functions(0), Enums(0), Structures(0)

Here we can see that lot of files are involved, some are part of mysql, some are just from the standard C library. By default GPTrixie will generate everything from all the files involved, it can be a nightmare if pthread is involved (it's like 200 functions/structures)

Let's be crazy and generate everything from gumbo.h, we run gptrixie with the options --all and --files=gumbo.h  files take a list of the basename of the file (or you can use @f1 if you want to work with the file id).

The output is available here


trixie_cm20x20.pngAnalysing the output

The output looks great at first glance, we could probably have directly our Binding.pm6. Let's look at some snippet.

enum GumboAttributeNamespaceEnum is export (
   GUMBO_ATTR_NAMESPACE_NONE => 0,
   GUMBO_ATTR_NAMESPACE_XLINK => 1,
   GUMBO_ATTR_NAMESPACE_XML => 2,
   GUMBO_ATTR_NAMESPACE_XMLNS => 3
);
A standard enumeration.


class GumboText is repr('CStruct') is export {
        has Str                           $.text; # const char* text
        HAS GumboStringPiece              $.original_text; # GumboStringPiece original_text
        HAS GumboSourcePosition           $.start_pos; # GumboSourcePosition start_pos
}
Here a structure, you can see that gptrixie adds as comments what the original C definition look like for the field. char * are transformed in Str, in some case you probably want to add the encoding, but Gumbo work mainly with unicode so we are safe.


#-From /usr/local/include/gumbo.h:104
#/**
# * Compares two GumboStringPieces, and returns true if they're equal or false
# * otherwise.
# */
#bool gumbo_string_equals(
#    const GumboStringPiece* str1, const GumboStringPiece* str2);
sub gumbo_string_equals(Pointer[GumboStringPiece]     $str1 # const GumboStringPiece*
                       ,Pointer[GumboStringPiece]     $str2 # const GumboStringPiece*
                        ) is native(LIB) returns bool is export { * }

Since functions can be tricky, GPTrixie tries to get the original function definition from the header to help you see if you need to change something. Here it catches also the dioxygen documentation, it's a fortunate side effect on how it get the function definition (GCCXML give us the end line of the definition, not the startline, so it works backward)

The code generated look fine at first glance but there are some issues:

- The first enumeration is too big (GumboTag), it's not really an issue per se, but the API provide a function that gives you the tag name, and if you look in gumbo code source, this list is mainly generated. I decided that I don't want it.

- Some structures are named GumboInternalxx, it's the name of the structure, but they all have a typedef associated with them to have a more user-friendly way to write the type. By default, GPTrixie does not make the link between a structure and an eventual typedef associated with it.

- You will see a lot of Pointer[GumboInternalNode] it's because the generator is dumb, everything that is a pointer to something will be written this way with the exception of char * NativeCall actually work mainly with pointer when working with struct, so a void foo(struct mystruct*) can be translated to sub foo(mystruct) 


trixie_cm20x20.pngWriting the .gpt file

We can create the .gpt file. The .gpt file is simply a perl6 hash that will get evaluated.

(
module-name => 'Gumbo::Binding',
env-name => 'PERL6_GUMBOLIB',
clib-name => 'gumbo',
clib-abiversion => v1,
merge-struct-typedef => True,
files => ['gumbo.h'],
exclude-enums => ['GumboTag']
);

Some options are quite self-explantory. merge-struct-typedef will make gptrixie replaces a structure name with an associated typedef (and remove the typedef from its know type), fixing the GumboInternalNode issue. excludes-enums allows excludingthe list of enums.
env-name, clib-name and clib-abiversion are used to define the LIB constant. env-name is used for an environment variable that allow for an user to manually specify the library file to use.
You can run gptrixie --gptfile gumbo.gpt /usr/local/include/gumbo.h and it will generate a Gumbo-Binding.pm6 file. At this point we should have an usable file.

trixie_cm20x20.pngFixing the Pointer[MyStruct] issue, or a glimpse of GPTrixie internals
It's not an option, so we need to modify how gptrixie generate the perl6 string representing a type in the default (only) generator. The default generator is dumb (it's his name). But it look like an easy fix, just when we found a pointer to a structure, just generate normally like a structure and ignore the pointer.
Let have a look at the existing code. (here) and how it change a Pointer to char to a Str.

return 'Str' if ($t.ref-type ~~ FundamentalType and $t.ref-type.name eq 'char') ||
      ($t.ref-type ~~ QualifiedType and $t.ref-type.ref-type ~~ FundamentalType and $t.ref-type.ref-type.name eq 'char');

It look a bit lengly because it handle 2 cases, char *, const char *. GPTrixie keep types a bit like how they appear in gccxml, a complex type is generally a combinaison of type. FundamentalType are char, int, void..., QualifiedType are const.
The fix look quite easy. when we found a Pointer to structure, we just call the function itself on the structure type, but sadly it translate to something more complicated because of typedef or const
return resolve-type($t.ref-type, $cpt + 1) if ($t.ref-type ~~ StructType) ||
      ($t.ref-type ~~ QualifiedType and $t.ref-type.ref-type ~~ StructType) ||
      ($t.ref-type ~~ TypeDefType and $t.ref-type.ref-type ~~ StructType);
That seems to be enought for this

trixie_cm20x20.pngUsing the file in Gumbo
It's now time to use the generated file to replace my manually written binding file. I just copied the Gumbo-Binding.pm6 file in place of lib/Gumbo/Binding.pm6. First step is to change the name of the Gumbo type on the file that use the binding, since I used camel case when writing the binding. Second is to remove some nativecast because I declared some functions as returning Pointer instead of the proper structs.
Let's run a test. After fixing some other compilation errors it finally runs but it segfault. After some investigation it appear than exporting cglobal does not work. The output generated by gptrixie does not work and even after changing the affectation (=) to a binding (:=) it give me an (Any) value for the variable. If I look back at the old binding I already run into this issue as it commented, and I have to use cglobal directly in the code that use the binding.
It make me change the generated file by GPTrixie to be able to reuse the LIB variable that define how to find the Gumbo library.
our $GUMBO_LIB is export = LIB;
You can notice that I used a $ sigil and not a constant, but currently rakudo misapply the is export trait if you do so and produce an error.

trixie_cm20x20.pngConclusion
Generating the Binding/Raw file for the C library you want to use with GPTrixie works quite fine. I am surprised it worked with very few additionnal work on my side. We can argue that gumbo is a nice case with only one header file and not outside weird definition, but I think it's a good example and also a nice validation for me to see that GPTrixie work how I imaged it.

Does it end here? Well for Gumbo binding probably. But there is still lot of work to do in GPTrixie, the part that generate file is a bad copy paste of the main code that use the DumbGenerator to produce an output and I probably want to write a generator that allow for a finner control of the perl6 generated for function/structure. Also I probably want ways for the generator to do extra work like me adding the $GUMBO_LIB and really to never have to touch the generated file.

(I am sorry for the font changing mid-way, the code snippet mess up with it and I can't figure how to restore the font)

Leave a comment

About Sylvain Colinet

user-pic I blog about Perl 6.