Sets operations
To help some coworkers I whipped up a program to perform set operations in Perl. It's quite basic but it's been pretty effective so far and it's on github.
Sets are assumed to be files where each line is a different element. It is assumed that equal lines are either not present or can be filtered out with no consequence. The inner working assumes that at a certain point the input files are sorted, and in general the external sort
program is used automatically, which limits the applicability in some platforms.
The three basic operations that are supported are union, intersection and difference.
# intersect two files, also with "intersect", "i",
# "I" (uppercase "i") and "^"
sets file1 & file2
# union of two files, also with "union", "u", "U",
# "v", "V" and "|"
sets file1 + file2
# subtraction of second file from first one, also
# with "minus", "less" and "\"
sets file1 - file2
Other operations, e.g. symmetric difference, can be obtained with a combination of the predefined ones. Operations can be grouped and in general the expression will be provided as a single string to avoid the shell to creep in:
# symmetric difference, alternative 1
sets '(file1 - file2) + (file2 - file1)'
# symmetric difference, how you probably saw it somewhere
sets '(file1 + file2) - (file1 & file2)'
Operations associate from left to right, so the first group above is not needed. Anyway I usually prefer to be explicit.
As anticipated, at some point the program needs to work with a sorted input. The basic motivation for the program is handling operations on files with a few million elements, so putting all the stuff in memory is not an option; on the other hand, sort
is quite efficient and reinventing the wheel is not an option as well!
Sorting is usually handled automatically with a call to the external sort
utility (with the -u
option, because sets are assumed to not contain duplicates); anyway, this can be a time consuming activity that is not necessary if you already know that your inputs are sorted, so you can tell the program when this is actually the case:
sets -s sorted-file1 ^ sorted-file2
When sorting is performed, it is usually done on the fly without saving the intermediate sorted files. They can be useful for following sets operations, or when the same input is used multiple times (e.g. in the case of the symmetric difference examples above), so it is possible to save the sorted files with the same name and a suffix appended:
sets -S .sorted '(file1 - file2) + (file2 - file1)'
If the sorted version of a file is found (i.e. file1.sorted
and/or file2.sorted
in the example above) it is used with no further sorting, speeding things up automatically.
Sometimes inputs might come from different platforms, so the line terminator would be different. In our case we don't need leading or trailing whitespaces, so there is a trimming options to avoid problems:
sets -t file1-unix - file2-dos
If you think that it can be useful for you, it's possible to download a bundled version that does not need external modules installed anywhere: enjoy sets!
Nice but doesn't sort itself read the whole file in memory anyway?
What about platforms that have no sort commands? I am thinking about Windows of course.
Nitpicking: why have you used lower case s in the name of the module and not called it App::Sets ?
It sounds very interesting script, but by running it I got this Error:
./sets --help
String found where operator expected at /loader/0x798108/App/sets.pm line 306, near "LOGDIE "parse error at char $pos --> $offending\n""
(Do you need to predeclare LOGDIE?)
String found where operator expected at /loader/0x798108/App/sets.pm line 504, near "LOGDIE "unknown $_""
(Do you need to predeclare LOGDIE?)
syntax error at /loader/0x798108/App/sets.pm line 306, near "LOGDIE "parse error at char $pos --> $offending\n""
syntax error at /loader/0x798108/App/sets.pm line 504, near "LOGDIE "unknown $_""
BEGIN not safe after errors--compilation aborted at /loader/0x798108/App/sets.pm line 508.
Compilation failed in require at ./sets line 1833.
BEGIN failed--compilation aborted at ./sets line 1833.
I was looking at the code but I can't understand if this brings any improvement with respect to the command 'comm'...
@Gabor: sort relies upon temporary files for its operation, in order to be able to work on big files without going out of memory. And of course relying upon it is one of the major assumptions of the program, unless there's an efficient way to perform this kind of operation in the Win32 platform. I hate to say "patches welcome" but I use Win32 only for the Office suite and do all the rest in Linux (although I'm willing to do some testing).
@Farhad: thanks for pointing out, there was a bug involved with packing all the modules in one single file but now it should be fixed.
@Dario: first of all I'm a bit ashamed that I didn't know about comm. I see that it assumes to receive a sorted file in input, which is handled automatically in sets case; anyway, I understand that you can handle this easily with some scripting. One additional feature of sets is its parsing of a complex operation over multiple sets, which would require multiple invocations and temporary files. Anyway thanks for the hint about comm!
@Gabor, take 2 (from Perl Weekly comment): my coworkers have to deal with big lists of phone numbers (some millions) in order to generate other lists for migrating users from one system to another one. sets has been useful for filtering these lists, checking whether they intersect, filtering out some numbers in a file from another file, etc.
Hi Flavio, seems really usefull. Can you put it on CPAN ?