Keeping a Reasonable Number of Incremental Backups

My servers are backed up using a little script that works just like Apple's TimeMachine. Every day a backup job moves the latest backup to a folder named by the date of its creation and creates a new backup in the folder latest. Rsync has a nifty feature that creates hardlinks to files in a different folder if the file matches with the current folder. Meaning that you don't create a full backup every time but only sync the differences, i.e. an incremental backup.

All in all that works quite well and I can go back to any date in time to see the state of the machine. This script is now in production for quite a while and I ended up running out of space because I haven't figured out a good way to get rid of old backups. A listing for one of my servers looks currently like this:


20130521  20130606  20130906  20130922  20131008  20131024  20131114
20130522  20130607  20130907  20130923  20131009  20131025  20131115
20130523  20130608  20130908  20130924  20131010  20131026  20131116
20130524  20130609  20130909  20130925  20131011  20131027  20131117
20130525  20130610  20130910  20130926  20131012  20131102  20131118
20130526  20130611  20130911  20130927  20131013  20131103  20131119
20130527  20130612  20130912  20130928  20131014  20131104  20131120
20130528  20130613  20130913  20130929  20131015  20131105  20131121
20130529  20130614  20130914  20130930  20131016  20131106  20131122
20130530  20130615  20130915  20131001  20131017  20131107  20131123
20130531  20130617  20130916  20131002  20131018  20131108  20131124
20130601  20130618  20130917  20131003  20131019  20131109  20131125
20130602  20130619  20130918  20131004  20131020  20131110  latest
20130603  20130902  20130919  20131005  20131021  20131111
20130604  20130903  20130920  20131006  20131022  20131112
20130605  20130905  20130921  20131007  20131023  20131113

My idea was to keep a number of daily backups and make the backup density more sparse as we go back in time:

  1. Delete all backups older than a year (I think that's reasonable, I keep log files on the server for more than that so I don't actually get rid of them since they are still on the live system)
  2. Keep a daily backup for the last seven days (if someone, most probably me, screws up and deletes a bunch of stuff, I will know within seven days and will have a good chance to restore without loosing to much data)
  3. Keep weekly backups for the last month (there's no need to keep daily backups for such old data)
  4. Monthly backups for the last year

I ended up writing a couple of bash one-liners that do the job for me.

Deleting backups older than a year was easy, nothing fancy:

find /mnt/local/backup/ -maxdepth 1 -mindepth 1 -mtime +366 | xargs rm -rf

Keeping a daily backup for seven days and then weekly backups are simply done by excluding the last seven days from the selection. The printf will output the day of the week, a tab and then the full path to the backup folder. Grep gets rid of all backups that were not created on a Sunday, awk extracts the second parameter (the full path) and xargs finally takes care of deleting it.

find /mnt/local/backup/ -maxdepth 1 -mindepth 1 -mtime +7 -printf "%Tw\t%p\n" | grep -v '^0\s' | awk '{print $2}' | xargs rm -r

Keeping one backup per month was a little bit more tricky and this is where Perl comes into play. The printf was modified to include the year and month, a tab, the day of the month, the weekday and finally the full path. This structure is then passed to sort which will give me a nicely sorted list of backups by date. I want to keep the first backup of the month that falls on a Sunday, since the script above will get rid of all other days of the week. The perl one-liner will do just that. It crawls through all backups and excludes everything but the first Sunday of the month from the output. Finally xargs takes care of deleting them.

find /mnt/local/backup/ -maxdepth 1 -mindepth 1 -mtime +28 -printf "%TY%Tm\t%Td\t%Tw\t%p\n" | sort | perl -E '$m = {}; while(<>) { my @s = split(/\s/); $m{$s[0]} ||= $s[3] unless $s[2]; say $s[3] unless $m{$s[0]} eq $s[3] }' | xargs rm -rf

4 Comments

That looks very much like http://www.rsnapshot.org/

I've been using it for years and it's very very simple.

I have a similar app which I've been using personally and on production servers: File::RsyBak (also includes a command-line interface, rsybak). There are multiple similar modules on CPAN. Apparently the rsync hardlink feature gets used a lot in backup settings.

For the problem you mentioned, File::RsyBak uses backup history levels, e.g. 7 daily + 4 weekly + 4 monthly. Selected backups from the lower levels (e.g. daily) will be promoted to higher level (e.g. weekly). The POD of File::RsyBak describes the process in more details.

I started backing my repository and databases up to Amazon's S3 storage. It's cheap and you can configure the lifecycle of data from the AWS console to either delete the files after a certain period of time or move them to Glacier.

Essentially I tar up the current development environment for my partner and I, the source repo and use an S3 client to send the tarballs to S3. I then dump all the MySQL (prod) databases (application, bugzilla, etc), tar them up and do the same. Once a week I tar up our wiki site and do the same.

I set the lifecycle specs on the CVS repo and other files to delete after 30 days, since presumably a repo has all the history anyway. Databases that have been dumped and tarred are moved to Glacier after 120 days. I then have 3 months of database snapshots in S3 available to me, and perpetual snapshots on Glacier. I sleep like a baby.

Glacier is so incredibly cheap that it begs the question why delete things? It may take you a few days to get things restored from Glacier, but if you're a pack rat, it's the perfect solution. ;-)

Anyone interested in the bash script I use to get an idea how easy this is, just email me and I'd be happy to share.

I use ZFS and a small script -- https://github.com/abh/zfs-snapshot-cleaner -- to clean them up.

(I also have a version that supports btrfs, but I guess I didn't put that up on github).

Ask

Leave a comment

About mo

user-pic I blog about Perl.