Day 2: When uniq is not unique enough (App::nauniq)
About the series: perlancar's 2014 Advent Calendar: Introduction to a selection of 24 modules which I published in 2014. Table of contents.
Any Unix user probably knows about the uniq command, most often used in combination with sort. For example, to get distinct lines from a file:
% sort somefile.txt | uniq
sort is usually needed because uniq can only skip non-unique adjacent lines. In other words, it does not keep a memory of past lines. This probably has to do with most traditional Unix utilities created in a decade where memory was very scarce (think of an age where the saying "640K ought to be enough for everybody" was considered a truism).
If you want to print distinct lines from a file but keeping the original order, it is surprisingly cumbersome to do using standard Unix utilities. That's why I wrote nauniq (distributed in App-nauniq). With this utility, you can simply do:
% sort somefile.txt | nauniq
The main difference with uniq is that nauniq remembers past lines. To help keep memory usage low, there are several options like: --num-entries (to only remember a certain number of lines), -w (to only remember a certain number of characters for each line), and --md5 (to only remember a line's MD5 hash instead of its content).
One place where I use this utility regularly is in this cron entry:
( echo -n '# '; date -R; grep-search-query-in-url ~/.opera/global_history.dat ) | nauniq -a - ~/logs/opera-search.log
The above command maintains a history of search queries done on my computer, with the original chronological order and duplicates removed. In general you can use nauniq to append new (non-duplicate) lines to a file.
I took this as a bit of a challenge. I think standard Linux tools do the trick just fine:
I think the easiest way to keep first occurrence of each line is one of:
$ awk '!a[$0]++' input.txt
or
$ perl -ne'$h{$_}++||print' input.txt
To keep the last occurrence is enough use command tac in input.txt before awk or perl inline scripts.
Yup, that is what nauniq essentially does (plus some options).