Benchmarking string trimming

Clever Regexps vs Multiple Simple Regexps:

In reading some code I ran across the expression s/^\s*|\s*$//g which is a trim function. It is not the optimal way to write this. The optimal way is two simpler expressions: s/^\s+//; s/\s+$//. Justification follows.


  • Use of + instead of * means regexps that will would do no effective work will also fail to match. Failing to match when the work would be useless yielded some 3x to 4x improvement.

  • Use of multiple simpler patterns like s/^...//;s/...$// instead of compound patterns like s/^...|...$//g enabled boundary checking optimizations.


String length:

long:  +80 chars
short: -80 chars


pre/post: "  string  "
pre:      "  string"
post:       "string  "
base:       "string"

Coding styles:

g*: s/^\s*|\s*$//g
g+: s/^\s+|\s+$//g
2*: s/^\s*//
2+: s/^\s+//

Calculated results:

>>  short pre 2+      1638810/s
>>  short base 2+     1622457/s
>>  short post 2+     1351812/s
>>  short pre/post 2+ 1152253/s
>>  long base 2+       564477/s
>>  long pre 2+        534890/s
    short base +g      532709/s
    short post +g      502626/s
>>  long post 2+       501015/s
    short pre +g       479683/s
    short pre/post +g  465137/s
>>  long pre/post 2+   463741/s
    short base 2*      462448/s
    short pre 2*       456719/s
    short pre/post 2*  450081/s
    short post 2*      449661/s
    short base *g      394226/s
    short pre *g       384360/s
    short post *g      367736/s
    short pre/post *g  367624/s
    long post 2*       114832/s
    long base 2*       113787/s
    long pre 2*        110305/s
    long pre/post 2*   110169/s
    long post +g       100847/s
    long base +g        99830/s
    long pre +g         98871/s
    long pre/post +g    98331/s
    long base *g        87066/s
    long post *g        86520/s
    long pre *g         84080/s
    long pre/post *g    81429/s


I usually prefer the two statements because I find such code easier to maintain. Interesting to see there is a concrete benefit as well, though even the slowest clocked in at 81,000 operations per second. I wouldn't expect that to be a bottleneck in many situations.

This approach is documented in Jeffrey Friedl's Mastering Regular Expressions (pp 199-200):
"I use the language's equivalent of

because it's almost always fastest, and is certainly the easiest to understand.

Leave a comment

About Josh ben Jore

user-pic I blog about Perl.