Perl Weekly Challenge 81: Frequency Sort

These are some answers to the Week 81 of the Perl Weekly Challenge organized by Mohammad S. Anwar

Frequency Sort

You are given file named input.

Write a script to find the frequency of all the words.

It should print the result as the first column of each line should be the frequency of the the word followed by all the words of that frequency arranged in lexicographical order. Also sort the words in the ascending order of frequency.

Input file

West Side Story

The award-winning adaptation of the classic romantic tragedy “Romeo and Juliet”. The feuding families become two warring New York City gangs, the white Jets led by Riff and the Latino Sharks, led by Bernardo. Their hatred escalates to a point where neither can coexist with any form of understanding. But when Riff’s best friend (and former Jet) Tony and Bernardo’s younger sister Maria meet at a dance, no one can do anything to stop their love. Maria and Tony begin meeting in secret, planning to run away. Then the Sharks and Jets plan a rumble under the highway—whoever wins gains control of the streets. Maria sends Tony to stop it, hoping it can end the violence. It goes terribly wrong, and before the lovers know what’s happened, tragedy strikes and doesn’t stop until the climactic and heartbreaking ending.

Note

For the sake of this task, please ignore the following in the input file:

. " ( ) , 's --

Output

1 But City It Jet Juliet Latino New Romeo Side Story Their Then West York adaptation any anything at award-winning away become before begin best classic climactic coexist control dance do doesn't end ending escalates families feuding form former friend gains gangs goes happened hatred heartbreaking highway hoping in know love lovers meet meeting neither no one plan planning point romantic rumble run secret sends sister streets strikes terribly their two under understanding until violence warring what when where white whoever wins with wrong younger

2 Bernardo Jets Riff Sharks The by it led tragedy

3 Maria Tony a can of stop

4 to

9 and the

Frequency Sort in Raku

We slurp the full contents of the file into a string, remove the:

. " ( ) , 's --

characters or pairs of characters, use the words method to split the string into words and dump these words into a $h (for histogram) Bag. We then use the %summary hash to dispatch the words according to their frequency. We then iterate over the sorted keys of the hash and output the frequencies and the sorted words for each frequency.

my Str $str = slurp "./WestSideStory.txt";
$str ~~ s:g/<[."(),]>+//;
$str ~~ s:g/[\'s]||['--']//;
my $h = bag $str.words; # histogram by words
my %summary;    # histogram by values
push %summary{$h{$_}}, $_ for $h.keys;
for %summary.keys.sort -> $k {
  say "$k ", %summary{$k}.sort.join(" ");
}

With the West Side Story summary provided above, the following output is displayed:

$ raku frequency-sort.raku
1 But City It Jet Juliet Latino New Romeo Side Story Their Then West York adaptation any anything at award-winning away become before begin best classic climactic coexist control dance do doesn't end ending}; escalates families feuding form former friend gains gangs goes happened hatred heartbreaking highway whoever hoping in know love lovers meet meeting neither no one plan planning point romantic rumble run secret sends sister streets strikes terribly their two under understanding until violence warring what when where white wins with wrong younger
2 Bernardo Jets Riff Sharks The by it led tragedy
3 Maria Tony a can of stop
4 to
9 and the

Frequency Sort in Perl

We need to explicitly read the file. We slurp the file into an array and then convert the resulting array into a single string and finally process that string in a way quite similar to what we just did in Raku, except that we use a hash instead of a Bag.

use strict;
use warnings;
use feature "say";

my $input = "WestSideStory.txt";
open my $IN, "<", $input or die "Unable to open $input $!";
my @in = <$IN>;
chomp @in;
my $str = join " ", @in;
$str =~ s/[."(),]+//g;
$str =~ s/(\'s)||(--)//g;
my %histogram;
$histogram{$_}++ for split /\s+/, $str;
my %summary;
push @{$summary{$histogram{$_}}}, $_ for keys %histogram;
for my $k (sort {$a <=> $b} keys %summary) {
    say "$k ", join " ", sort @{$summary{$k}};
}

With the West Side Story summary provided above, this program displays the following output:

$ perl frequency-sort.pl
1 But City It Jet Juliet Latino New Romeo Side Story Their Then West York adaptation any anything at award-winning away become before begin best classic climactic coexist control dance do doesn't end ending}; escalates families feuding form former friend gains gangs goes happened hatred heartbreaking highway whoever hoping in know love lovers meet meeting neither no one plan planning point romantic rumble run secret sends sister streets strikes terribly their two under understanding until violence warring what when where white wins with wrong younger
2 Bernardo Jets Riff Sharks The by it led tragedy
3 Maria Tony a can of stop
4 to
9 and the

Wrapping up

The next week Perl Weekly Challenge will start soon. If you want to participate in this challenge, please check https://perlweeklychallenge.org/ and make sure you answer the challenge before 23:59 BST (British summer time) on Sunday, October 18, 2020. And, please, also spread the word about the Perl Weekly Challenge if you can.

2 Comments

Leave a comment

About laurent_r

user-pic I am the author of the "Think Perl 6" book (O'Reilly, 2017) and I blog about the Perl 5 and Raku programming languages.