Learning XS - Regular Expressions
Over the past year, I’ve been self-studying XS and have now decided to share my learning journey through a series of blog posts. This eighth post introduces you to Perl regular expressions in XS.
We should all know what a regular expression is?
But if you don’t, here’s a quick refresher: A regular expression (regex) is a sequence of characters that forms a search pattern. They can be used for string matching, searching, and manipulation. Regular expressions are widely used in programming languages, including Perl, to perform complex text processing tasks.
In Perl regular expressions can be precompiled and executed using the 'qr//' operator, which allows you to create a regex object that can be reused multiple times. This is particularly useful for performance when the same regex is used repeatedly. You can also use the 'm//' operator to match a regex against a string, and the 's///' operator to perform substitutions.
In XS, we can use the Perl regular expression engine to perform regex operations. The XS interface provides access to the Perl regex engine, allowing us to compile and execute regular expressions directly from C code. Here is a summary of the REGEXP related functions we are going to cover today:
Function/Macro | Description |
---|---|
pregcomp | Compiles a regular expression and returns a REGEXP pointer. |
pregexec | Executes a compiled regular expression against a string. |
pregfree | Frees a compiled regular expression. |
For more details, see perlreguts.
Now lets write an example XS module that uses regular expressions to match a string against a pattern. We will create a new module called 'Regex::Match' that will export three functions 'match_any', 'match_all' and 'match_count'. These functions will take a string or a compiled regex as the first argument and a list of 'strings' as the remaining. The 'match_any' function will return the first match if any of the strings match the regex, 'match_all' will return all matches if any strings match the regex, and 'match_count' will return the number of matches. In perl this looks like:
package Regex::Match;
use 5.006;
use strict;
use warnings;
use parent 'Exporter';
our @EXPORT = qw/match_any match_all match_count/;
sub match_any {
my ($reg, @items) = @_;
my $match;
for (@items) {
next if ref $_;
if ($_ =~ $reg) {
$match = $_;
last;
}
}
return $match;
}
sub match_all {
my ($reg, @items) = @_;
my @matches;
for (@items) {
next if ref $_;
if ($_ =~ $reg) {
push @matches, $_;
}
}
return \@matches;
}
sub match_count {
my ($reg, @items) = @_;
return scalar @{ match_all($reg, @items) };
}
1;
Now lets create the XS code that implements these functions. We will create a new distribution called 'Regex::Matched'.
module-starter --module="Regex::Match" --author="Your Name" --email="your email"
Then update the Makefile.PL to include XSMULTI and open the default generated 'Regex/Match.pm' file and update to the following boilerplate:
package Regex::Match;
use 5.006;
use strict;
use warnings;
our $VERSION = '0.01';
require XSLoader;
XSLoader::load('Regex::Match', $VERSION);
1;
Now create a new file called 'Regex/Match.xs' and add the following code:
#define PERL_NO_GET_CONTEXT
#include "EXTERN.h"
#include "perl.h"
#include "XSUB.h"
MODULE = Regex::Match PACKAGE = Regex::Match
PROTOTYPES: ENABLE
void
import(...)
CODE:
char *pkg = HvNAME((HV*)CopSTASH(PL_curcop));
int pkg_len = strlen(pkg);
STRLEN retlen;
int i = 1;
for (i = 1; i < items; i++) {
char * ex = SvPV(ST(i), retlen);
int name_len = pkg_len + retlen + 3;
char *name = (char *)malloc(name_len);
snprintf(name, name_len, "%s::%s", pkg, ex);
// Register the XS functions for the import
free(name);
}
We have extended our default boilerplate to include the 'import' function, which will be used to register our XS functions. Now we can implement the three functions 'match_any', 'match_all', and 'match_count'. We have a couple of options on how to do this, but today we will create a reusable c function that accepts three arguments an SV containing our regex and an AV containing our parameters, and a boolean to identify whether or not to return early. We will start with match_any, first lets add a new test file called 't/01_match_any.t' with the following content:
use Test::More;
use Regex::Match qw(match_any);
my @emails = qw/
foo@perl.org
bar@perl.com
baz@perl.co.uk
pow@perl.com
zap@perl.org
/;
is(match_any("perl.com", @emails), 'bar@perl.com');
is(match_any("nope.com", @emails), undef);
is(match_any(qr/perl.com/, @emails), 'bar@perl.com');
is(match_any(qr/nope.com/, @emails), undef);
done_testing();
We test twice the same checks, once with a string and once with a compiled regex object. Now open the 'Regex/Match.xs' file and add the following code above the package definition:
AV * regex_match(SV *pattern_sv, AV *input_av, int match_all) {
dTHX;
STRLEN retlen;
REGEXP *rx;
if (!SvROK(pattern_sv)) {
rx = pregcomp(pattern_sv, 0);
} else if (SvTYPE(SvRV(pattern_sv)) == SVt_REGEXP) {
SvREFCNT_inc(pattern_sv);
rx = (REGEXP *)SvRV(pattern_sv);
} else {
croak("Invalid pattern not a string or regex object");
}
AV *matches = newAV();
int len = av_len(input_av) + 1;
for (int i = 0; i < len; i++) {
SV *input_sv = *av_fetch(input_av, i, 0);
if (SvROK(input_sv)) croak("Input must be a string, not a reference");
char *input = SvPVutf8(input_sv, retlen);
I32 nmatch = pregexec(rx, input + 0, input + retlen, input, 0, input_sv, 0);
if (nmatch > 0) {
av_push(matches, input_sv);
if (!match_all) break;
}
}
return matches;
}
We will step through this code, first we check if the 'pattern_sv' is not a reference aka a string, if it is then we compile it using 'pregcomp'. If it is a reference and is of type 'SVt_REGEXP', we increment the reference count to ensure it is not freed while we are using it, else we croak as the SV does not contain a string or regex. Next, we create a new AV to hold our matches and iterate over the input AV. For each input SV, we extract the string and execute the regex using 'pregexec'. If there is a match, we push the SV onto our matches AV. If 'match_all' is false, we break out of the loop after the first match. We then return the matches AV. You will notice a new type definitions REGEXP, which is documented here and two functions we use 'pregcomp' to compile the regex pattern and 'pregexec' to execute the regex against the input string. The 'pregexec' function returns the number of matches found, which we use to determine if we should add the input string to our matches AV. It accepts the regex, a pointer to the start of the input string, a pointer to the end of the input string, the input string iteself, the min length of the match, thd input SV aka the SV to match against and some flag I always set to 0
Now we can implement the 'match_any' function in XS. Add the following code above the import XSUB.
SV *
match_any(...)
CODE:
AV * array = av_make(items, MARK+1);
SV * regex = av_shift(array);
AV * matches = regex_match(regex, array, 0);
RETVAL = av_shift(matches);
OUTPUT:
RETVAL
This is pretty straightforward, however av_make might be new to you. It creates a new AV with the specified number of elements, in this case, the number of items passed to the function. We then shift the first item off the array, which is our regex pattern, and call our 'regex_match' function with the regex and the rest of the items in the array. Finally, we return the first match found as a scalar. With that code in place you will need to extend your import XSUB to export the new function:
if (strcmp(ex, "match_any") == 0) {
newXS(name, XS_Regex__Match_match_any, __FILE__);
}
And then test:
make
make test
Now lets implement the 'match_all' function, its pretty straight forward as most of the work is already done. First add a test file called 't/02_match_all.t' with the following content:
use Test::More;
use Regex::Match qw(match_all);
my @emails = qw/
foo@perl.org
bar@perl.com
baz@perl.co.uk
pow@perl.com
zap@perl.org
/;
is_deeply(match_all('perl.com', @emails), ['bar@perl.com', 'pow@perl.com']);
is_deeply(match_all('nope.com', @emails), []);
is_deeply(match_all(qr/perl.com/, @emails), ['bar@perl.com', 'pow@perl.com']);
is_deeply(match_all(qr/nope.com/, @emails), undef);
And then extend the XS code in 'Regex/Match.xs' with the following XSUB:
SV *
match_all(...)
CODE:
AV * array = av_make(items, MARK+1);
SV * regex = av_shift(array);
AV * matches = regex_match(regex, array, 1);
RETVAL = newRV_noinc((SV*)matches);
OUTPUT:
RETVAL
Update your import XSUB to export the new function:
else if (strcmp(ex, "match_all") == 0) {
newXS(name, XS_Regex__Match_match_all, __FILE__);
}
And you can test again. Finally we repeat the step for match_count. Create a new test file called 't/03_match_count.t' with the following content:
use Test::More;
use Regex::Match qw(match_count);
my @emails = qw/
foo@perl.org
bar@perl.com
baz@perl.co.uk
pow@perl.com
zap@perl.org
/;
is(match_count('perl.com', @emails), 2);
is(match_count('nope.com', @emails), 0);
is(match_count(qr/perl.com/, @emails), 2);
is(match_count(qr/nope.com/, @emails), 0);
extend the XS code in 'Regex/Match.xs' with the following XSUB:
SV *
match_count(...)
CODE:
AV * array = av_make(items, MARK+1);
SV * regex = av_shift(array);
AV * matches = regex_match(regex, array, 1);
RETVAL = newSViv(av_len(matches) + 1);
SvREFCNT_dec(matches);
OUTPUT:
RETVAL
And update your import XSUB to export the new function:
else if (strcmp(ex, "match_count") == 0) {
newXS(name, XS_Regex__Match_match_count, __FILE__);
}
Now you can test again. If all goes well, you should see all tests pass.
That concludes our introduction to using regular expressions in XS. We have created a simple XS module that allows us to match strings against regex patterns, both as strings and compiled regex objects. This is a powerful feature of XS that allows us to leverage the full power of Perl's regex engine directly from C code.
Until next time, happy coding!
Leave a comment