Day 9: Set operations with files (App::setop)
About the series: perlancar's 2014 Advent Calendar: Introduction to a selection of 24 modules which I published in 2014. Table of contents.
Have you ever encountered one or all of these situations?
1) You have two or ten log files, and want to know which IP addresses occur in all of them.
2) You have two SQL dump files, and want to know which tables are in the first but not in the second.
3) You have several files containing a list of usernames to create. You want to create all the users in order, but don't want to create twice.
Sure you have. These are quite common. And they are all basic set operations (intersection, diff, union). However, the Unix standard tools' solution is not optimal (read: simple enough). At least in the third case sometimes you can get away with something like cat FILES | sort | uniq if preserving order is not absolutely important. But if you search StackOverflow questions, the answers usually involve comm (which requires sorted input and limited to two files at a time), or a combination of awk and sed (in which case, if you are a Perl programmer, you might just as well use perl).
setop (short for "set operation", from the App-setop distribution) will help you make our life easier. It supports the set operations union (return lines from the first file and others, duplicate removed, order preserved), intersect (return common lines found in every file, order preserved), diff (return lines found in the first file but not the second, duplicate removed, order preserved), symdiff (short for "symetric difference", return lines found in either file but not both, duplicate removed, order preserved). Note that for all operations, multiple files are supported and order are preserved.
So, to find lines in A not in B and C:
% setop --diff A B C
To find lines found in A and B and C (common lines):
% setop --intersect A B C
To find lines found in A or B or C (union):
% setop --union A B C
To find lines found only exactly one of A/B/C (symmetric difference):
% setop --symdiff A B C
And you can, like me, just forget about how to construct a pipeline of perl/awk/sed/sort/uniq/comm and go about your busy work day.
Leave a comment