Perl Weekly Challenge 62: Sort Email Addresses
These are some answers to the Week 62 of the Perl Weekly Challenge organized by Mohammad S. Anwar.
Task 1: Sort Email Addresses
Write a script that takes a list of email addresses (one per line) and sorts them first by the domain part of the email address, and then by the part to the left of the @ (known as the mailbox).
Note that the domain is case-insensitive, while the mailbox part is case sensitive. (Some email providers choose to ignore case, but that’s another matter entirely.)
If your script is invoked with arguments, it should treat them as file names and read them in order, otherwise your script should read email addresses from standard input.
Bonus
Add a -u option which only includes unique email addresses in the output, just like sort -u.
Example:
If given the following list:
name@example.org
rjt@cpan.org
Name@example.org
rjt@CPAN.org
user@alpha.example.org
Your script (without -u) would return:
user@alpha.example.org
rjt@cpan.org
rjt@CPAN.org
Name@example.org
name@example.org
With -u, the script would return:
user@alpha.example.org
rjt@CPAN.org
Name@example.org
name@example.org
Sorting Email Addresses in Raku
The first idea that came to my mind was to write the cmp-addr
comparison subroutine, to be passed as a parameter to the sort
built-in function, leading to something like this:
sub cmp-addr (Str $addr1, Str $addr2) {
my ($box1, $domain1) = split /'@'/, $addr1;
my ($box2, $domain2) = split /'@'/, $addr2;
$domain1.uc leg $domain2.uc or $box1 leg $box2;
}
my @sorted = sort &cmp-addr, @unsorted;
This subroutine takes two email addresses as parameters, splits them into mailbox and domain components, and returns the result of the lexicographic comparison between the uppercase domains; if the domains are the same, then it returns the the result of the lexicographic comparison between the mailbox components. When the first parameter to the sort
function is a subroutine taking two parameters, then sort
uses that subroutine as a comparator to know which of the two parameters should come first in the sorted list.
This works fine, but when thinking about the -u
option, I figured that I would need to do more or less the same thing again do remove duplicate addresses from the list. So, I wanted to see if I could try to use the same subroutine to take care of the comparison for the ordering and the duplicate removal. The great thing with Raku is that, when you’re looking for something like that, there is usually a solution. When the first parameter to the sort
function is a subroutine that accepts only one argument, then sort
uses that subroutine to canonicalize or normalize the parameters before performing a standard comparison (using implicitly the cmp
smart comparison operator between the normalized parameters. In addition, the normalized values are cached, so that normalization occurs only once per item in the input list. In addition, the unique
built-in can also use a subroutine as a parameter to perform input normalization in order to find duplicates. As a result, our normalize
subroutine in the code below can be used both for sorting and for removing duplicates.
use v6;
sub normalize (Str $addr) {
my ($box, $domain) = split /'@'/, $addr;
$domain.uc ~ '@' ~ $box;
}
sub process-addresses( @addresses, Bool $unique) {
@addresses = @addresses.unique(:as(&normalize)) if $unique;
my @sorted = sort &normalize, @addresses;
.say for @sorted;
}
multi sub MAIN (Bool :u(:$unique) = False, *@files) {
my @addresses = @files.IO.lines;
process-addresses @addresses, $unique;
}
multi sub MAIN (Bool :u(:$unique) = False) {
my @addresses = $*IN.lines;
process-addresses @addresses, $unique;
}
Using the following addresses.txt
input file:
name@example.org
rjt@cpan.org
Name@example.org
rjt@CPAN.org
rjt@cpan.org
user@alpha.example.org
rjt@cpan.org
we get the following results:
$ perl6 sort_addresses.p6 addresses.txt
user@alpha.example.org
rjt@cpan.org
rjt@CPAN.org
rjt@cpan.org
rjt@cpan.org
Name@example.org
name@example.org
$ perl6 sort_addresses.p6 -u addresses.txt
user@alpha.example.org
rjt@cpan.org
Name@example.org
name@example.org
The same tests using the content of the addresses.txt
input file as standard input:
$ cat addresses.txt | perl6 sort_addresses.p6
user@alpha.example.org
rjt@cpan.org
rjt@CPAN.org
rjt@cpan.org
rjt@cpan.org
Name@example.org
name@example.org
$ cat addresses.txt | perl6 sort_addresses.p6 -u
user@alpha.example.org
rjt@cpan.org
Name@example.org
name@example.org
Note that, since we’re sorting the addresses anyway, we could use the squish
built-in method instead of unique
. squish
only removes adjacent duplicates, but is equivalent to unique
with sorted input. The process-addresses
subroutine would be rewritten as follows:
sub process-addresses( @addresses, Bool $unique) {
my @sorted = sort &normalize, @addresses;
@sorted = @sorted.squish(:as(&normalize)) if $unique;
.say for @sorted;
}
This produces the same output as earlier. With large input, squish
should presumably be faster than unique
, but I haven’t bothered to run benchmarks with large input files.
Sorting Email Addresses in Perl
In Perl, the sort
built-in function is slightly less expressive than in Raku, so we will use a technique known as the Schwartzian Transform to do the sort:
my @sorted = map { $_->[0] }
sort { $a->[1] cmp $b->[1] }
map { [$_, normalize $_] } @addresses;
This code is a data pipeline and has to be read from right to left and from bottom up. It starts with the input data at the bottom right, @addresses
. The map
on the bottom line transforms the array items into array references containing the original data item and a normalized version of it. Those array references are fed into the sort
function on the previous line, which uses the second item (i.e. the normalized form) of each array reference as the sort key. The output of the sort is fed to another map
, whose role is simply to extract back the original values from the sorted array references. It may be argued that the technique used here is a variation of the Schwartzian transform known as the Guttman Rosler transform, but it is really a kind of combination of the Schwartian and Guttman-Rosler transforms.
Removing duplicates (when unique is required) is done with a grep
after the sort.
use strict;
use warnings;
use feature qw /say/;
sub normalize {
my $addr = shift;
my ($box, $domain) = split /@/, $addr;
return (uc $domain . "@" . $box);
}
my ($unique, @files);
my $first_param = shift;
if ($first_param eq "-u") {
$unique = 1;
@files = @ARGV;
} else {
$unique = 0;
@files = ($first_param, @ARGV);
}
my @addresses;
for my $file(@files) {
open my $IN, "<", $file or die "Unable to open $file";
push @addresses, <$IN>;
}
chomp @addresses;
my @sorted = map { $_->[0] }
sort { $a->[1] cmp $b->[1] }
map { [$_, normalize $_] } @addresses;
if ($unique) {
my $last = "";
@sorted = grep { my $bool = $last ne normalize $_; $last = normalize $_; $bool} @sorted;
}
say for @sorted;
Running this program (with the same input file as in Raku) displays the following output:
$ perl sort_addresses.pl -u addresses.txt
user@alpha.example.org
rjt@cpan.org
name@example.org
$ perl sort_addresses.pl addresses.txt
user@alpha.example.org
rjt@cpan.org
rjt@CPAN.org
rjt@cpan.org
rjt@cpan.org
name@example.org
Name@example.org
Note that I was running late, so I don’t have time to implement a version using standard input. For the same reason, I wasn’t able to perform task 2 (N-queens).
Wrapping up
The next week Perl Weekly Challenge is due to start soon. If you want to participate in this challenge, please check https://perlweeklychallenge.org/ and make sure you answer the challenge before 23:59 BST (British summer time) on Sunday, June 7, 2020. And, please, also spread the word about the Perl Weekly Challenge if you can.
Leave a comment