Benchmarking string trimming
Clever Regexps vs Multiple Simple Regexps:
In reading some code I ran across the expression s/^\s*|\s*$//g
which is a trim function. It is not the optimal way to write this. The
optimal way is two simpler expressions: s/^\s+//;
s/\s+$//
. Justification follows.
Conclusion:
Use of
+
instead of*
means regexps that will would do no effective work will also fail to match. Failing to match when the work would be useless yielded some 3x to 4x improvement.Use of multiple simpler patterns like
s/^...//;s/...$//
instead of compound patterns likes/^...|...$//g
enabled boundary checking optimizations.
Testing:
String length:
long: +80 chars
short: -80 chars
Pre/postfixes:
pre/post: " string "
pre: " string"
post: "string "
base: "string"
Coding styles:
g*: s/^\s*|\s*$//g
g+: s/^\s+|\s+$//g
2*: s/^\s*//
s/\s*$//
2+: s/^\s+//
s/\s+$//
Calculated results:
>> short pre 2+ 1638810/s
>> short base 2+ 1622457/s
>> short post 2+ 1351812/s
>> short pre/post 2+ 1152253/s
>> long base 2+ 564477/s
>> long pre 2+ 534890/s
short base +g 532709/s
short post +g 502626/s
>> long post 2+ 501015/s
short pre +g 479683/s
short pre/post +g 465137/s
>> long pre/post 2+ 463741/s
short base 2* 462448/s
short pre 2* 456719/s
short pre/post 2* 450081/s
short post 2* 449661/s
short base *g 394226/s
short pre *g 384360/s
short post *g 367736/s
short pre/post *g 367624/s
long post 2* 114832/s
long base 2* 113787/s
long pre 2* 110305/s
long pre/post 2* 110169/s
long post +g 100847/s
long base +g 99830/s
long pre +g 98871/s
long pre/post +g 98331/s
long base *g 87066/s
long post *g 86520/s
long pre *g 84080/s
long pre/post *g 81429/s
I usually prefer the two statements because I find such code easier to maintain. Interesting to see there is a concrete benefit as well, though even the slowest clocked in at 81,000 operations per second. I wouldn't expect that to be a bottleneck in many situations.
This approach is documented in Jeffrey Friedl's Mastering Regular Expressions (pp 199-200):
because it's almost always fastest, and is certainly the easiest to understand."I use the language's equivalent of
the point is, the ways to get optimal performance is to know what sorts of optimizations and work-avoidances your regexp engine provides and in extremis, to look at the compiled form of your regexp.