To help some coworkers I whipped up a program to perform set operations in Perl. It's quite basic but it's been pretty effective so far and it's on github.
Sets are assumed to be files where each line is a different element. It is assumed that equal lines are either not present or can be filtered out with no consequence. The inner working assumes that at a certain point the input files are sorted, and in general the external
sort program is used automatically, which limits the applicability in some platforms.
The three basic operations that are supported are union, intersection and difference.
# intersect two files, also with "intersect", "i", # "I" (uppercase "i") and "^" sets file1 & file2 # union of two files, also with "union", "u", "U", # "v", "V" and "|" sets file1 + file2 # subtraction of second file from first one, also # with "minus", "less" and "\" sets file1 - file2
Other operations, e.g. symmetric difference, can be obtained with a combination of the predefined ones. Operations can be grouped and in general the expression will be provided as a single string to avoid the shell to creep in:
# symmetric difference, alternative 1 sets '(file1 - file2) + (file2 - file1)' # symmetric difference, how you probably saw it somewhere sets '(file1 + file2) - (file1 & file2)'
Operations associate from left to right, so the first group above is not needed. Anyway I usually prefer to be explicit.
As anticipated, at some point the program needs to work with a sorted input. The basic motivation for the program is handling operations on files with a few million elements, so putting all the stuff in memory is not an option; on the other hand,
sort is quite efficient and reinventing the wheel is not an option as well!
Sorting is usually handled automatically with a call to the external
sort utility (with the
-u option, because sets are assumed to not contain duplicates); anyway, this can be a time consuming activity that is not necessary if you already know that your inputs are sorted, so you can tell the program when this is actually the case:
sets -s sorted-file1 ^ sorted-file2
When sorting is performed, it is usually done on the fly without saving the intermediate sorted files. They can be useful for following sets operations, or when the same input is used multiple times (e.g. in the case of the symmetric difference examples above), so it is possible to save the sorted files with the same name and a suffix appended:
sets -S .sorted '(file1 - file2) + (file2 - file1)'
If the sorted version of a file is found (i.e.
file2.sorted in the example above) it is used with no further sorting, speeding things up automatically.
Sometimes inputs might come from different platforms, so the line terminator would be different. In our case we don't need leading or trailing whitespaces, so there is a trimming options to avoid problems:
sets -t file1-unix - file2-dos
If you think that it can be useful for you, it's possible to download a bundled version that does not need external modules installed anywhere: enjoy sets!