Perl Weekly Challenge 62: Sort Email Addresses

These are some answers to the Week 62 of the Perl Weekly Challenge organized by Mohammad S. Anwar.

Task 1: Sort Email Addresses

Write a script that takes a list of email addresses (one per line) and sorts them first by the domain part of the email address, and then by the part to the left of the @ (known as the mailbox).

Note that the domain is case-insensitive, while the mailbox part is case sensitive. (Some email providers choose to ignore case, but that’s another matter entirely.)

If your script is invoked with arguments, it should treat them as file names and read them in order, otherwise your script should read email addresses from standard input.

Bonus

Add a -u option which only includes unique email addresses in the output, just like sort -u.

Example:

If given the following list:

name@example.org
rjt@cpan.org
Name@example.org
rjt@CPAN.org
user@alpha.example.org

Your script (without -u) would return:

user@alpha.example.org
rjt@cpan.org
rjt@CPAN.org
Name@example.org
name@example.org

With -u, the script would return:

user@alpha.example.org
rjt@CPAN.org
Name@example.org
name@example.org

Sorting Email Addresses in Raku

The first idea that came to my mind was to write the cmp-addr comparison subroutine, to be passed as a parameter to the sort built-in function, leading to something like this:

sub cmp-addr (Str $addr1, Str $addr2) {
    my ($box1, $domain1) = split /'@'/, $addr1;
    my ($box2, $domain2) = split /'@'/, $addr2;
    $domain1.uc leg $domain2.uc or $box1 leg $box2;
}
my @sorted = sort &cmp-addr, @unsorted;

This subroutine takes two email addresses as parameters, splits them into mailbox and domain components, and returns the result of the lexicographic comparison between the uppercase domains; if the domains are the same, then it returns the the result of the lexicographic comparison between the mailbox components. When the first parameter to the sort function is a subroutine taking two parameters, then sort uses that subroutine as a comparator to know which of the two parameters should come first in the sorted list.

This works fine, but when thinking about the -u option, I figured that I would need to do more or less the same thing again do remove duplicate addresses from the list. So, I wanted to see if I could try to use the same subroutine to take care of the comparison for the ordering and the duplicate removal. The great thing with Raku is that, when you’re looking for something like that, there is usually a solution. When the first parameter to the sort function is a subroutine that accepts only one argument, then sort uses that subroutine to canonicalize or normalize the parameters before performing a standard comparison (using implicitly the cmp smart comparison operator between the normalized parameters. In addition, the normalized values are cached, so that normalization occurs only once per item in the input list. In addition, the unique built-in can also use a subroutine as a parameter to perform input normalization in order to find duplicates. As a result, our normalize subroutine in the code below can be used both for sorting and for removing duplicates.

use v6;

sub normalize (Str $addr) {
    my ($box, $domain) = split /'@'/, $addr;
    $domain.uc ~ '@' ~ $box;
}
sub process-addresses( @addresses, Bool $unique) {
    @addresses = @addresses.unique(:as(&normalize)) if $unique;
    my @sorted = sort &normalize, @addresses;
    .say for @sorted;
}
multi sub MAIN (Bool :u(:$unique) = False, *@files) {
    my @addresses = @files.IO.lines;
    process-addresses @addresses, $unique;
}
multi sub MAIN (Bool :u(:$unique) = False) {
    my @addresses = $*IN.lines;
    process-addresses @addresses, $unique;
}

Using the following addresses.txt input file:

name@example.org
rjt@cpan.org
Name@example.org
rjt@CPAN.org
rjt@cpan.org
user@alpha.example.org
rjt@cpan.org

we get the following results:

$  perl6 sort_addresses.p6  addresses.txt
user@alpha.example.org
rjt@cpan.org
rjt@CPAN.org
rjt@cpan.org
rjt@cpan.org
Name@example.org
name@example.org

$  perl6 sort_addresses.p6 -u addresses.txt
user@alpha.example.org
rjt@cpan.org
Name@example.org
name@example.org

The same tests using the content of the addresses.txt input file as standard input:

$ cat addresses.txt | perl6 sort_addresses.p6
user@alpha.example.org
rjt@cpan.org
rjt@CPAN.org
rjt@cpan.org
rjt@cpan.org
Name@example.org
name@example.org

$ cat addresses.txt | perl6 sort_addresses.p6 -u
user@alpha.example.org
rjt@cpan.org
Name@example.org
name@example.org

Note that, since we’re sorting the addresses anyway, we could use the squish built-in method instead of unique. squish only removes adjacent duplicates, but is equivalent to unique with sorted input. The process-addresses subroutine would be rewritten as follows:

sub process-addresses( @addresses, Bool $unique) {
    my @sorted = sort &normalize, @addresses;
    @sorted = @sorted.squish(:as(&normalize)) if $unique;
    .say for @sorted;
}

This produces the same output as earlier. With large input, squish should presumably be faster than unique, but I haven’t bothered to run benchmarks with large input files.

Sorting Email Addresses in Perl

In Perl, the sort built-in function is slightly less expressive than in Raku, so we will use a technique known as the Schwartzian Transform to do the sort:

my @sorted = map { $_->[0] }
    sort { $a->[1] cmp $b->[1] }
    map { [$_, normalize $_] } @addresses;

This code is a data pipeline and has to be read from right to left and from bottom up. It starts with the input data at the bottom right, @addresses. The map on the bottom line transforms the array items into array references containing the original data item and a normalized version of it. Those array references are fed into the sort function on the previous line, which uses the second item (i.e. the normalized form) of each array reference as the sort key. The output of the sort is fed to another map, whose role is simply to extract back the original values from the sorted array references. It may be argued that the technique used here is a variation of the Schwartzian transform known as the Guttman Rosler transform, but it is really a kind of combination of the Schwartian and Guttman-Rosler transforms.

Removing duplicates (when unique is required) is done with a grep after the sort.

use strict;
use warnings;
use feature qw /say/;

sub normalize {
    my $addr = shift;
    my ($box, $domain) = split /@/, $addr;
    return (uc $domain . "@" . $box);
}
my ($unique, @files);
my $first_param = shift;
if ($first_param eq "-u") {
    $unique = 1;
    @files = @ARGV;
} else { 
    $unique = 0;
    @files = ($first_param, @ARGV);
}
my @addresses;
for my $file(@files) {
    open my $IN, "<", $file or die "Unable to open $file";
    push @addresses, <$IN>;
}
chomp @addresses;
my @sorted = map { $_->[0] }
    sort { $a->[1] cmp $b->[1] }
    map { [$_, normalize $_] } @addresses;
if ($unique) {
    my $last = "";
    @sorted = grep { my $bool = $last ne normalize $_; $last = normalize $_; $bool} @sorted;
}
say for @sorted;

Running this program (with the same input file as in Raku) displays the following output:

$ perl sort_addresses.pl -u addresses.txt
user@alpha.example.org
rjt@cpan.org
name@example.org

$ perl sort_addresses.pl  addresses.txt
user@alpha.example.org
rjt@cpan.org
rjt@CPAN.org
rjt@cpan.org
rjt@cpan.org
name@example.org
Name@example.org

Note that I was running late, so I don’t have time to implement a version using standard input. For the same reason, I wasn’t able to perform task 2 (N-queens).

Wrapping up

The next week Perl Weekly Challenge is due to start soon. If you want to participate in this challenge, please check https://perlweeklychallenge.org/ and make sure you answer the challenge before 23:59 BST (British summer time) on Sunday, June 7, 2020. And, please, also spread the word about the Perl Weekly Challenge if you can.

Leave a comment

About laurent_r

user-pic I am the author of the "Think Perl 6" book (O'Reilly, 2017) and I blog about the Perl 5 and Raku programming languages.