Three Sort Functions
Everyone (I’d
have to assume) has written the two basic sort functions
# sort strings
case-insensitively
sub insensitive {
return uc $a cmp uc $b || $a cmp $b;
}
and
# sort values
numerically
sub numeric {
return $a <=> $b;
}
Here are three
fun—and hopefully useful—sort functions.
This first one
sorts strings with
numeric suffixes (such as room numbers like “White 102”) first by the string
part, then numerically by the suffix. The function assumes the data has been vetted;
i.e. matches $rgx_strnum. Obviously
you may need to tweak $rgx_strnum to meet your specific
needs; you can also add calls to uc to sort the
string part case-insensitively.
our $rgx_strnum = qr/^([A-Za-z]+)\s*(\d+)$/;
sub strnum {
my ($aS, $aN) = $a =~ $rgx_strnum;
my ($bS, $bN) = $b =~ $rgx_strnum;
return $aS cmp $bS || $aN <=> $bN;
}
This function sort
strings ASCIIbetically except blank strings sort to the end—I am always
surprised how often this comes up. As above, you can add calls to uc for a case-insensitive blanks-last sort.
sub blanks_last {
return [[$a cmp $b, -1], [1, 0]]->[$a eq
""][$b eq ""];
}
Finally, this
function sorts strings that consist of dot-delimited numbers (like IP addresses
or RCS version numbers). Again, the function assumes the data has been vetted
and all values have the same number of fields (i.e. don’t try to sort data that
includes both IP addresses and RCS version numbers).
sub
dotted_nums {
my @bits = split m/\./, "$a.$b";
return (map { $bits[$_] <=> $bits[$_ +
@bits/2] || () }
(0 .. @bits/2-1))[0] || 0;
}
Most of these sort functions should not be used directly, but instead should be recoded to use the Shwarzian Transform, or the Guttman Rosler Transform.
Here is an example of a complex sort function that is used in perl core (in modern perls) to sort manifest files into a sane order, which uses the ST for efficiency.
Do not miss Sort::Key. It’s faster and leaves the code more readable than if you try to write a faster sort in pure Perl.
Certainly using a Schwartzian Transform (let alone a GRT) will be faster; in fact my testing indicates they are ~75% and ~90% faster, respectively.
This post is just intended to be some sort functions that I hope are interesting and maybe useful to someone. I'm going for more of an educational/interesting vibe on this blog.
My other comment is "know your data". In my environment, I primarily use Perl on data sets with less than 1,000 elements. (I do work with huge data sets but most of that processing is done in C/C++ or SQL.) The above experiment reveals that the impressive-sounding 75% savings of a Schwartzian Transform is cashed in for only about 0.1 seconds on my typical data. That's just not noticeable to a human.
I generally believe that the programmer's time, including maintenance time, is more valuable than the CPU's time, especially considering that CPUs are getting faster all the time, and programmers, as a whole, are not. So IMHO, when an inline sort function is easy to implement and leads to more-readable code, the CPU time savings of going with an ST/GRT are rarely worth the trade-off.
Certainly if you are working with enormous data--and want to stick with Perl--efficiency becomes way more important.