Perl101: finding words in words

By Ovid on October 27, 2010 9:47 AM

Recently a friend of mine back in the US mentioned an art project she was working on. She was looking for words which are composites of two words. She didn't necessarily want obvious composites like "bluebird", but less obvious composites that can be worked into her art project. For example, "homeless" could be "home less", or "garbage" to "garb age". A few people struggled to come up with examples. I came up with /usr/share/dict/words and a few lines of Perl. I use a few nifty idioms that every Perler should be familiar with.

First, I made a few assumptions which may or may not be appropriate for you. I'm using Perl 5.10.1 in this example, but it's easy enough to tweak for an older Perl.

I start with the following two lines:

#!/usr/bin/env perl
use Modern::Perl;

By using /usr/bin/env perl in my shebang instead of /usr/bin/perl, the program will be run with whatever perl you get when you type perl -e on the command line. This is useful if you don't necessarily know (or care) about the path to your Perl executable.

The use Modern::Perl; bit enables strict, warnings, and 5.10 features (plus the C3 MRO which is useless for this example). You generally want these enabled for all code and when you don't, you should be able to explain why.

Next, we have the following:

use constant MIN_WORD_LENGTH => 2;

@ARGV = '/usr/share/dict/words' unless @ARGV;

my %is_word;
while (<>) {
    chomp;
    next unless length >= MIN_WORD_LENGTH * 2 && /^[[:lower:]]/;
    $is_word{$_} = 1;
}

The first thing I did was assume that the mininum word length was 2, because I thought doubles like "irate = i + rate" were not interesting. I made it a constant, but you might want to look at Getopt::Long or similar module to make it controllable from the command line.

Next, you'll see that we set @ARGV to a default value if it's not already set. Since @ARGV merely contains whatever arguments are in the command line, if you want to use a different wordlist, you just have to do this:

./find_double_words.pl /path/to/my/words

But that implies you understand the while (<>) {...} construct. From perldoc perlopentut, we have this:

One of the most common uses for open is one you never even notice. When you process the ARGV filehandle using <ARGV>, Perl actually does an implicit open on each file in @ARGV.

Then it goes on to show that this means (for most cases), that you just need this to process each line of each file in @ARGV:

while (<>) {
    # do something with $_
}

What I did with it in my code is to ensure that each word can be a doubled word (min length times two) and isn't a proper noun (starts with a lower case letter). When I store the word in the %is_word hash for easy lookup later. As I only have a quarter million entries in my word file, this is fine. A different strategy would be useful if you had a much larger file to process (or were running this on an antique computer).

Then, to find the words in question:

for my $word ( sort keys %is_word ) {
    for my $i ( MIN_WORD_LENGTH .. length($word) - MIN_WORD_LENGTH ) {
        my $first = substr $word, 0, $i;
        my $second = substr $word, $i;
        if ( $is_word{$first} and $is_word{$second} ) {
            say "$word = $first + $second";
        }
    }
}

We sort the word list because I wanted results in alphabetical order. The inner for loop is more interesting, though. What we are doing here is finding a place to split a word into two parts and slide forward, breaking the word into two parts each time, checking that each part is a word.

So take the obsolete word "withinside" (a synonym for "inside"). That has a length of 10 characters, so our index ($i) will range from 2 to 8. We get the following word breakdowns:

# $first    $second
# wi      thinside
# wit      hinside
# with      inside *
# withi      nside
# within      side *
# withins      ide
# withinsi      de

As you can see, we have two ways of breaking this word down. As we slide down the $first and $second words, you'll note why we named the hash as we did:

if ( $is_word{$first} and $is_word{$second} ) {
    # ...
}

This reads very naturally and anyone should be able to see, at a glance, what's going on here. Variable naming is extremely important. If I had simply been lazy, we might have had this:

if ( $dict{$first} and $dict{$second} ) {
    # ...
}

That's not nearly as clear. What are the values? Definitions? Further, because I prefixed the variable name with "is_", it should be clear that it's boolean in nature. Some would argue that properly we should have done this:

if ( exists $is_word{$first} and exists $is_word{$second} ) {
    # ...
}

Understanding the exists keyword is very important and merely checking for "truth" may be foolish. For example:

if ( $age_in_days{$part_no} ) {
    # item is in inventory
}

What if it's the first day that $part_no has been here and the value is 0 (zero)? Zero evaluates as false and there's a good chance that you have a bug here. Don't adhere to any hard and fast rules here. You'll have to exercise good judgment and decide for yourself what's the best approach. A truly defensive programmer would have used exists in my code. An experienced programmer hacking a quick, one-off script is more likely to focus on getting the job done.

We have everything in place to actually run the code, but when I ran it, the end of the output had things like this:

zoonerythrin = zoon + erythrin
zoonotic = zoon + otic
zootomically = zootomic + ally
zwitterionic = zwitter + ionic
zymotically = zymotic + ally

That didn't seem helpful to me, so after loading the dictionary but before entering the nested for loops, I added the following:

my @to_ignore = qw/
  acetyl
  erythrin
  lessness
  ness
  ology
  ological
  ologist
  pfund
  prescutal
  zibet
  zinco
  zoll
  zoogeographic
  zoographic
  zoon
  zwitter
  zymotic
  /;

delete @is_word{@to_ignore};

(I had a lot more words to ignore, but this just gives you a quick idea).

You probably want to read up on delete if you're not familiar with what I've done. I've deleted from a hash slice, not just a single scalar element. When accessing a hash, you can tell it's a hash because it uses curly braces ({}) instead of square braces ([]):

my @words = @is_word{@some_vals}; # must be a hash
my @words = @is_word[@some_vals]; # must be an array

By using an at sign (@) for the sigil, you're saying that you want to access several elements at a time. If you had a dollar sign ($) for the sigil, you'd only access one element at a time:

my $word = $is_word{$some_val}; # must be a hash
my $word = $is_word[$some_val]; # must be an array

This is confusing to many new Perl programmers, but it's one of the first things that you must understand about the language. It's been changed in Perl 6, but this is about Perl 5, so you'll have to learn to live with it.

What this means is that you could have written the above delete as follows:

delete $is_word{$_} foreach @to_ignore;

Or:

foreach my $ignore (@to_ignore) {
    delete $is_word{$ignore};
}

There's a lot more you could do with this simple script, but it was a fun way to help out a friend's art project and is a quick and easy demonstration of simple hacking in Perl.

6 comments

Tagged as:

perl101

6 Comments

jnareb.openid.pl | October 27, 2010 4:54 PM

You should limit your words to check for length longer than MINWORDLENGTH * 2 in a for loop, not in while loop when reading words.txt file. As it currently stands your program considers only splits into parts each of which is 4 characters or more ($is_word{$first} can be only true if $first is 4 letter word).

Limbic~Region | October 27, 2010 7:29 PM

Check out Challenge: "Words" In A String

I later developed it into a full application with options like minimum word length, allow non-word tokens, etc.

If interested, let me know.

A. Sinan Unur | October 28, 2010 1:03 AM

This is somewhat related to an interesting problem from Stackoverflow which I blogged about.

Do you have any suggestions for that problem?

David Cantrell | October 28, 2010 11:38 AM

I've come across at least one platform where /usr/bin/env didn't exist: on Unicos, env is somewhere else in the $PATH, but I forget exactly where. I've found this to be more portable, at the cost of being exceedingly ugly, and harder to explain:

#!/bin/sh

perl -x "$0"

exit 0

#!perl

print "hello world, I'm $^X\n";

David Cantrell | October 28, 2010 11:40 AM

or if you care about the exit status of the perl script, exec perl and don't exit from the shell script

Marcus | December 27, 2010 10:42 AM

Did you consider the following?

for my $first (keys %isword) { for my $second (keys %isword) { if ($is_word{$first . $second})

About Ovid

Freelance Perl/Testing/Agile consultant and trainer. See http://www.allaroundtheworld.fr/ for our services. If you have a problem with Perl, we will solve it for you. And don't forget to buy my book! http://www.amazon.com/Beginning-Perl-Curtis-Poe/dp/1118013840/

More info »

Ovid