Three Sort Functions

By morandimus on January 27, 2017 2:40 PM under useful?

Everyone (I’d have to assume) has written the two basic sort functions

# sort strings case-insensitively
sub insensitive {
return uc $a cmp uc $b || $a cmp $b;
}

and

# sort values numerically
sub numeric {
return $a <=> $b;
}

Here are three fun—and hopefully useful—sort functions.

This first one sorts strings with numeric suffixes (such as room numbers like “White 102”) first by the string part, then numerically by the suffix. The function assumes the data has been vetted; i.e. matches $rgx_strnum. Obviously you may need to tweak $rgx_strnum to meet your specific needs; you can also add calls to uc to sort the string part case-insensitively.

our $rgx_strnum = qr/^([A-Za-z]+)\s*(\d+)$/;
sub strnum {
  my ($aS, $aN) = $a =~ $rgx_strnum;
  my ($bS, $bN) = $b =~ $rgx_strnum;
  return $aS cmp $bS || $aN <=> $bN;
}

This function sort strings ASCIIbetically except blank strings sort to the end—I am always surprised how often this comes up. As above, you can add calls to uc for a case-insensitive blanks-last sort.

sub blanks_last {
return [[$a cmp $b, -1], [1, 0]]->[$a eq ""][$b eq ""];
}

Finally, this function sorts strings that consist of dot-delimited numbers (like IP addresses or RCS version numbers). Again, the function assumes the data has been vetted and all values have the same number of fields (i.e. don’t try to sort data that includes both IP addresses and RCS version numbers).

sub dotted_nums {
  my @bits = split m/\./, "$a.$b";
  return (map { $bits[$_] <=> $bits[$_ + @bits/2] || () }
              (0 .. @bits/2-1))[0] || 0;
}

3 comments

Tagged as:

sort, useful

3 Comments

demerphq | January 28, 2017 11:58 PM | Reply

Most of these sort functions should not be used directly, but instead should be recoded to use the Shwarzian Transform, or the Guttman Rosler Transform.

Here is an example of a complex sort function that is used in perl core (in modern perls) to sort manifest files into a sane order, which uses the ST for efficiency.

# Try to get a sane sort. case insensitive, more or less
# sorted such that path components are compared independently,
# and so that lib/Foo/Bar sorts before lib/Foo-Alpha/Baz
# and so that lib/Foo/Bar.pm sorts before lib/Foo/Bar/Alpha.pm
# and so that configure and Configure sort together.
sub sort_manifest {
  return
  # case insensitive sorting of directory components independently.
  map { $_->[0] } # extract the full line
  sort {
    $a->[1] cmp $b->[1] || # sort in order of munged filename
    $a->[0] cmp $b->[0]    # then by the exact text in full line
   }
   map {
     # split out the filename and the description
     my ($f) = split /\s+/, $_, 2;
     # lc the filename so Configure and configure
     # sort together in the list
     my $m= lc $f; # $m for munged
     # replace slashes by nulls, this makes short
     # directory names sort before
     # longer ones, such as "foo/" sorting before "foo-bar/"
     $m =~ s!/!\0!g;
     # replace the extension (only one) by 
     # null null extension.
     # this puts any foo/blah.ext before any 
     # files in foo/blah/
     $m =~ s!(\.[^.]+\z)!\0\0$1!;
     # return the original string, 
     # and the munged filename
     [ $_, $m ];
  } @_;
}

Aristotle | January 29, 2017 1:38 PM | Reply

Do not miss Sort::Key. It’s faster and leaves the code more readable than if you try to write a faster sort in pure Perl.

morandimus replied to comment from demerphq | January 30, 2017 11:59 AM | Reply

Certainly using a Schwartzian Transform (let alone a GRT) will be faster; in fact my testing indicates they are ~75% and ~90% faster, respectively.

This post is just intended to be some sort functions that I hope are interesting and maybe useful to someone. I'm going for more of an educational/interesting vibe on this blog.

My other comment is "know your data". In my environment, I primarily use Perl on data sets with less than 1,000 elements. (I do work with huge data sets but most of that processing is done in C/C++ or SQL.) The above experiment reveals that the impressive-sounding 75% savings of a Schwartzian Transform is cashed in for only about 0.1 seconds on my typical data. That's just not noticeable to a human.

I generally believe that the programmer's time, including maintenance time, is more valuable than the CPU's time, especially considering that CPUs are getting faster all the time, and programmers, as a whole, are not. So IMHO, when an inline sort function is easy to implement and leads to more-readable code, the CPU time savings of going with an ST/GRT are rarely worth the trade-off.

Certainly if you are working with enormous data--and want to stick with Perl--efficiency becomes way more important.

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About morandimus

My real name is Jeremy Holland. I've been a programmer for 25 years, using primarily Perl since 2000.

More info »

Thoughts::Perl::Random