Day 2: When uniq is not unique enough (App::nauniq)

About the series: perlancar's 2014 Advent Calendar: Introduction to a selection of 24 modules which I published in 2014. Table of contents.

Any Unix user probably knows about the uniq command, most often used in combination with sort. For example, to get distinct lines from a file:

% sort somefile.txt | uniq

sort is usually needed because uniq can only skip non-unique adjacent lines. In other words, it does not keep a memory of past lines. This probably has to do with most traditional Unix utilities created in a decade where memory was very scarce (think of an age where the saying "640K ought to be enough for everybody" was considered a truism).

If you want to print distinct lines from a file but keeping the original order, it is surprisingly cumbersome to do using standard Unix utilities. That's why I wrote nauniq (distributed in App-nauniq). With this utility, you can simply do:

% sort somefile.txt | nauniq

The main difference with uniq is that nauniq remembers past lines. To help keep memory usage low, there are several options like: --num-entries (to only remember a certain number of lines), -w (to only remember a certain number of characters for each line), and --md5 (to only remember a line's MD5 hash instead of its content).

One place where I use this utility regularly is in this cron entry:

( echo -n '# '; date -R; grep-search-query-in-url ~/.opera/global_history.dat ) | nauniq -a - ~/logs/opera-search.log

The above command maintains a history of search queries done on my computer, with the original chronological order and duplicates removed. In general you can use nauniq to append new (non-duplicate) lines to a file.

3 Comments

I took this as a bit of a challenge. I think standard Linux tools do the trick just fine:

perl -ne 'print qq($. $_)' < input.txt | \
      sort -k2                         | \
      uniq -f1                         | \
      sort -n                          | \
      sed 's/^[^ ]* //'

I think the easiest way to keep first occurrence of each line is one of:

$ awk '!a[$0]++' input.txt

or

$ perl -ne'$h{$_}++||print' input.txt

To keep the last occurrence is enough use command tac in input.txt before awk or perl inline scripts.

Leave a comment

About perlancar

user-pic #perl #indonesia