Benchmarking string trimming

By Josh ben Jore on August 16, 2010 11:10 AM

Clever Regexps vs Multiple Simple Regexps:

In reading some code I ran across the expression s/^\s*|\s*$//g which is a trim function. It is not the optimal way to write this. The optimal way is two simpler expressions: s/^\s+//; s/\s+$//. Justification follows.

Conclusion:

Use of + instead of * means regexps that will would do no effective work will also fail to match. Failing to match when the work would be useless yielded some 3x to 4x improvement.
Use of multiple simpler patterns like s/^...//;s/...$// instead of compound patterns like s/^...|...$//g enabled boundary checking optimizations.

Testing:

String length:

long:  +80 chars
short: -80 chars

Pre/postfixes:

pre/post: "  string  "
pre:      "  string"
post:       "string  "
base:       "string"

Coding styles:

g*: s/^\s*|\s*$//g
g+: s/^\s+|\s+$//g
2*: s/^\s*//
    s/\s*$//
2+: s/^\s+//
    s/\s+$//

Calculated results:

>>  short pre 2+      1638810/s
>>  short base 2+     1622457/s
>>  short post 2+     1351812/s
>>  short pre/post 2+ 1152253/s
>>  long base 2+       564477/s
>>  long pre 2+        534890/s
    short base +g      532709/s
    short post +g      502626/s
>>  long post 2+       501015/s
    short pre +g       479683/s
    short pre/post +g  465137/s
>>  long pre/post 2+   463741/s
    short base 2*      462448/s
    short pre 2*       456719/s
    short pre/post 2*  450081/s
    short post 2*      449661/s
    short base *g      394226/s
    short pre *g       384360/s
    short post *g      367736/s
    short pre/post *g  367624/s
    long post 2*       114832/s
    long base 2*       113787/s
    long pre 2*        110305/s
    long pre/post 2*   110169/s
    long post +g       100847/s
    long base +g        99830/s
    long pre +g         98871/s
    long pre/post +g    98331/s
    long base *g        87066/s
    long post *g        86520/s
    long pre *g         84080/s
    long pre/post *g    81429/s

3 comments

Tagged as:

benchmark, profiling, regexp, string trimming

3 Comments

cbt | August 16, 2010 12:40 PM | Reply

I usually prefer the two statements because I find such code easier to maintain. Interesting to see there is a concrete benefit as well, though even the slowest clocked in at 81,000 operations per second. I wouldn't expect that to be a bottleneck in many situations.

Erez Schatz | August 16, 2010 1:10 PM | Reply

This approach is documented in Jeffrey Friedl's Mastering Regular Expressions (pp 199-200):
"I use the language's equivalent of

s/^\s+//;

s/\s+$//;

because it's almost always fastest, and is certainly the easiest to understand.

Josh ben Jore | August 18, 2010 9:24 AM | Reply

the point is, the ways to get optimal performance is to know what sorts of optimizations and work-avoidances your regexp engine provides and in extremis, to look at the compiled form of your regexp.

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Josh ben Jore

More info »

Josh ben Jore