Keeping a Reasonable Number of Incremental Backups
My servers are backed up using a little script that works just like Apple's TimeMachine. Every day a backup job moves the latest backup to a folder named by the date of its creation and creates a new backup in the folder latest
. Rsync has a nifty feature that creates hardlinks to files in a different folder if the file matches with the current folder. Meaning that you don't create a full backup every time but only sync the differences, i.e. an incremental backup.
All in all that works quite well and I can go back to any date in time to see the state of the machine. This script is now in production for quite a while and I ended up running out of space because I haven't figured out a good way to get rid of old backups. A listing for one of my servers looks currently like this:
20130521 20130606 20130906 20130922 20131008 20131024 20131114
20130522 20130607 20130907 20130923 20131009 20131025 20131115
20130523 20130608 20130908 20130924 20131010 20131026 20131116
20130524 20130609 20130909 20130925 20131011 20131027 20131117
20130525 20130610 20130910 20130926 20131012 20131102 20131118
20130526 20130611 20130911 20130927 20131013 20131103 20131119
20130527 20130612 20130912 20130928 20131014 20131104 20131120
20130528 20130613 20130913 20130929 20131015 20131105 20131121
20130529 20130614 20130914 20130930 20131016 20131106 20131122
20130530 20130615 20130915 20131001 20131017 20131107 20131123
20130531 20130617 20130916 20131002 20131018 20131108 20131124
20130601 20130618 20130917 20131003 20131019 20131109 20131125
20130602 20130619 20130918 20131004 20131020 20131110 latest
20130603 20130902 20130919 20131005 20131021 20131111
20130604 20130903 20130920 20131006 20131022 20131112
20130605 20130905 20130921 20131007 20131023 20131113
My idea was to keep a number of daily backups and make the backup density more sparse as we go back in time:
- Delete all backups older than a year (I think that's reasonable, I keep log files on the server for more than that so I don't actually get rid of them since they are still on the live system)
- Keep a daily backup for the last seven days (if someone, most probably me, screws up and deletes a bunch of stuff, I will know within seven days and will have a good chance to restore without loosing to much data)
- Keep weekly backups for the last month (there's no need to keep daily backups for such old data)
- Monthly backups for the last year
I ended up writing a couple of bash one-liners that do the job for me.
Deleting backups older than a year was easy, nothing fancy:
find /mnt/local/backup/ -maxdepth 1 -mindepth 1 -mtime +366 | xargs rm -rf
Keeping a daily backup for seven days and then weekly backups are simply done by excluding the last seven days from the selection. The printf will output the day of the week, a tab and then the full path to the backup folder. Grep gets rid of all backups that were not created on a Sunday, awk extracts the second parameter (the full path) and xargs finally takes care of deleting it.
find /mnt/local/backup/ -maxdepth 1 -mindepth 1 -mtime +7 -printf "%Tw\t%p\n" | grep -v '^0\s' | awk '{print $2}' | xargs rm -r
Keeping one backup per month was a little bit more tricky and this is where Perl comes into play. The printf was modified to include the year and month, a tab, the day of the month, the weekday and finally the full path. This structure is then passed to sort which will give me a nicely sorted list of backups by date. I want to keep the first backup of the month that falls on a Sunday, since the script above will get rid of all other days of the week. The perl one-liner will do just that. It crawls through all backups and excludes everything but the first Sunday of the month from the output. Finally xargs takes care of deleting them.
find /mnt/local/backup/ -maxdepth 1 -mindepth 1 -mtime +28 -printf "%TY%Tm\t%Td\t%Tw\t%p\n" | sort | perl -E '$m = {}; while(<>) { my @s = split(/\s/); $m{$s[0]} ||= $s[3] unless $s[2]; say $s[3] unless $m{$s[0]} eq $s[3] }' | xargs rm -rf
That looks very much like http://www.rsnapshot.org/
I've been using it for years and it's very very simple.
I have a similar app which I've been using personally and on production servers: File::RsyBak (also includes a command-line interface, rsybak). There are multiple similar modules on CPAN. Apparently the rsync hardlink feature gets used a lot in backup settings.
For the problem you mentioned, File::RsyBak uses backup history levels, e.g. 7 daily + 4 weekly + 4 monthly. Selected backups from the lower levels (e.g. daily) will be promoted to higher level (e.g. weekly). The POD of File::RsyBak describes the process in more details.
I started backing my repository and databases up to Amazon's S3 storage. It's cheap and you can configure the lifecycle of data from the AWS console to either delete the files after a certain period of time or move them to Glacier.
Essentially I tar up the current development environment for my partner and I, the source repo and use an S3 client to send the tarballs to S3. I then dump all the MySQL (prod) databases (application, bugzilla, etc), tar them up and do the same. Once a week I tar up our wiki site and do the same.
I set the lifecycle specs on the CVS repo and other files to delete after 30 days, since presumably a repo has all the history anyway. Databases that have been dumped and tarred are moved to Glacier after 120 days. I then have 3 months of database snapshots in S3 available to me, and perpetual snapshots on Glacier. I sleep like a baby.
Glacier is so incredibly cheap that it begs the question why delete things? It may take you a few days to get things restored from Glacier, but if you're a pack rat, it's the perfect solution. ;-)
Anyone interested in the bash script I use to get an idea how easy this is, just email me and I'd be happy to share.
I use ZFS and a small script -- https://github.com/abh/zfs-snapshot-cleaner -- to clean them up.
(I also have a version that supports btrfs, but I guess I didn't put that up on github).
Ask