Sets operations

By Flavio Poletti on March 24, 2012 6:19 PM under Perl

To help some coworkers I whipped up a program to perform set operations in Perl. It's quite basic but it's been pretty effective so far and it's on github.

Sets are assumed to be files where each line is a different element. It is assumed that equal lines are either not present or can be filtered out with no consequence. The inner working assumes that at a certain point the input files are sorted, and in general the external sort program is used automatically, which limits the applicability in some platforms.

The three basic operations that are supported are union, intersection and difference.

# intersect two files, also with "intersect", "i",
# "I" (uppercase "i") and "^"
sets file1 & file2

# union of two files, also with "union", "u", "U",
# "v", "V" and "|"
sets file1 + file2

# subtraction of second file from first one, also
# with "minus", "less" and "\"
sets file1 - file2

Other operations, e.g. symmetric difference, can be obtained with a combination of the predefined ones. Operations can be grouped and in general the expression will be provided as a single string to avoid the shell to creep in:

# symmetric difference, alternative 1
sets '(file1 - file2) + (file2 - file1)'

# symmetric difference, how you probably saw it somewhere
sets '(file1 + file2) - (file1 & file2)'

Operations associate from left to right, so the first group above is not needed. Anyway I usually prefer to be explicit.

As anticipated, at some point the program needs to work with a sorted input. The basic motivation for the program is handling operations on files with a few million elements, so putting all the stuff in memory is not an option; on the other hand, sort is quite efficient and reinventing the wheel is not an option as well!

Sorting is usually handled automatically with a call to the external sort utility (with the -u option, because sets are assumed to not contain duplicates); anyway, this can be a time consuming activity that is not necessary if you already know that your inputs are sorted, so you can tell the program when this is actually the case:

sets -s sorted-file1 ^ sorted-file2

When sorting is performed, it is usually done on the fly without saving the intermediate sorted files. They can be useful for following sets operations, or when the same input is used multiple times (e.g. in the case of the symmetric difference examples above), so it is possible to save the sorted files with the same name and a suffix appended:

sets -S .sorted '(file1 - file2) + (file2 - file1)'

If the sorted version of a file is found (i.e. file1.sorted and/or file2.sorted in the example above) it is used with no further sorting, speeding things up automatically.

Sometimes inputs might come from different platforms, so the line terminator would be different. In our case we don't need leading or trailing whitespaces, so there is a trimming options to avoid problems:

sets -t file1-unix - file2-dos

If you think that it can be useful for you, it's possible to download a bundled version that does not need external modules installed anywhere: enjoy sets!

5 comments

Tagged as:

sets perl

5 Comments

Gábor Szabó - גאבור סבו | March 26, 2012 8:51 AM | Reply

Nice but doesn't sort itself read the whole file in memory anyway?
What about platforms that have no sort commands? I am thinking about Windows of course.

Nitpicking: why have you used lower case s in the name of the module and not called it App::Sets ?

Farhad | March 26, 2012 11:06 AM | Reply

It sounds very interesting script, but by running it I got this Error:

./sets --help

String found where operator expected at /loader/0x798108/App/sets.pm line 306, near "LOGDIE "parse error at char $pos --> $offending\n""
(Do you need to predeclare LOGDIE?)
String found where operator expected at /loader/0x798108/App/sets.pm line 504, near "LOGDIE "unknown $_""
(Do you need to predeclare LOGDIE?)
syntax error at /loader/0x798108/App/sets.pm line 306, near "LOGDIE "parse error at char $pos --> $offending\n""
syntax error at /loader/0x798108/App/sets.pm line 504, near "LOGDIE "unknown $_""
BEGIN not safe after errors--compilation aborted at /loader/0x798108/App/sets.pm line 508.
Compilation failed in require at ./sets line 1833.
BEGIN failed--compilation aborted at ./sets line 1833.

Dario Meloni | March 26, 2012 12:40 PM | Reply

I was looking at the code but I can't understand if this brings any improvement with respect to the command 'comm'...

Flavio Poletti | March 26, 2012 11:08 PM | Reply

@Gabor: sort relies upon temporary files for its operation, in order to be able to work on big files without going out of memory. And of course relying upon it is one of the major assumptions of the program, unless there's an efficient way to perform this kind of operation in the Win32 platform. I hate to say "patches welcome" but I use Win32 only for the Office suite and do all the rest in Linux (although I'm willing to do some testing).

@Farhad: thanks for pointing out, there was a bug involved with packing all the modules in one single file but now it should be fixed.

@Dario: first of all I'm a bit ashamed that I didn't know about comm. I see that it assumes to receive a sorted file in input, which is handled automatically in sets case; anyway, I understand that you can handle this easily with some scripting. One additional feature of sets is its parsing of a complex operation over multiple sets, which would require multiple invocations and temporary files. Anyway thanks for the hint about comm!

@Gabor, take 2 (from Perl Weekly comment): my coworkers have to deal with big lists of phone numbers (some millions) in order to generate other lists for migrating users from one system to another one. sets has been useful for filtering these lists, checking whether they intersect, filtering out some numbers in a file from another file, etc.

fibo | March 29, 2012 5:36 PM | Reply

Hi Flavio, seems really usefull. Can you put it on CPAN ?

Name

Email Address

URL

Remember personal info?

Comments (You may use HTML tags for style)

About Flavio Poletti

More info »

polettix

Sets operations

Tagged as:

5 Comments

Leave a comment

About Flavio Poletti

Search this blog