Sets operations

To help some coworkers I whipped up a program to perform set operations in Perl. It's quite basic but it's been pretty effective so far and it's on github.

Sets are assumed to be files where each line is a different element. It is assumed that equal lines are either not present or can be filtered out with no consequence. The inner working assumes that at a certain point the input files are sorted, and in general the external sort program is used automatically, which limits the applicability in some platforms.

The three basic operations that are supported are union, intersection and difference.

# intersect two files, also with "intersect", "i",
# "I" (uppercase "i") and "^"
sets file1 & file2

# union of two files, also with "union", "u", "U",
# "v", "V" and "|"
sets file1 + file2

# subtraction of second file from first one, also
# with "minus", "less" and "\"
sets file1 - file2

Other operations, e.g. symmetric difference, can be obtained with a combination of the predefined ones. Operations can be grouped and in general the expression will be provided as a single string to avoid the shell to creep in:

# symmetric difference, alternative 1
sets '(file1 - file2) + (file2 - file1)'

# symmetric difference, how you probably saw it somewhere
sets '(file1 + file2) - (file1 & file2)'

Operations associate from left to right, so the first group above is not needed. Anyway I usually prefer to be explicit.

As anticipated, at some point the program needs to work with a sorted input. The basic motivation for the program is handling operations on files with a few million elements, so putting all the stuff in memory is not an option; on the other hand, sort is quite efficient and reinventing the wheel is not an option as well!

Sorting is usually handled automatically with a call to the external sort utility (with the -u option, because sets are assumed to not contain duplicates); anyway, this can be a time consuming activity that is not necessary if you already know that your inputs are sorted, so you can tell the program when this is actually the case:

sets -s sorted-file1 ^ sorted-file2

When sorting is performed, it is usually done on the fly without saving the intermediate sorted files. They can be useful for following sets operations, or when the same input is used multiple times (e.g. in the case of the symmetric difference examples above), so it is possible to save the sorted files with the same name and a suffix appended:

sets -S .sorted '(file1 - file2) + (file2 - file1)'

If the sorted version of a file is found (i.e. file1.sorted and/or file2.sorted in the example above) it is used with no further sorting, speeding things up automatically.

Sometimes inputs might come from different platforms, so the line terminator would be different. In our case we don't need leading or trailing whitespaces, so there is a trimming options to avoid problems:

sets -t file1-unix - file2-dos

If you think that it can be useful for you, it's possible to download a bundled version that does not need external modules installed anywhere: enjoy sets!

5 Comments

Nice but doesn't sort itself read the whole file in memory anyway?
What about platforms that have no sort commands? I am thinking about Windows of course.

Nitpicking: why have you used lower case s in the name of the module and not called it App::Sets ?

It sounds very interesting script, but by running it I got this Error:

./sets --help

String found where operator expected at /loader/0x798108/App/sets.pm line 306, near "LOGDIE "parse error at char $pos --> $offending\n""
(Do you need to predeclare LOGDIE?)
String found where operator expected at /loader/0x798108/App/sets.pm line 504, near "LOGDIE "unknown $_""
(Do you need to predeclare LOGDIE?)
syntax error at /loader/0x798108/App/sets.pm line 306, near "LOGDIE "parse error at char $pos --> $offending\n""
syntax error at /loader/0x798108/App/sets.pm line 504, near "LOGDIE "unknown $_""
BEGIN not safe after errors--compilation aborted at /loader/0x798108/App/sets.pm line 508.
Compilation failed in require at ./sets line 1833.
BEGIN failed--compilation aborted at ./sets line 1833.

I was looking at the code but I can't understand if this brings any improvement with respect to the command 'comm'...

Hi Flavio, seems really usefull. Can you put it on CPAN ?

Leave a comment

About Flavio Poletti

user-pic