Backing up private data securely
My situation: work on PC at the office, on laptop elsewhere. All of data already in git repositories--well almost, photos and videos are currently out, due to their size. I use SSD on both computers, and my partitions are encrypted.
Some of the data are public, like open source Perl projects. Some work-related, where we have our company git repos. And some private, like personal notes, addressbook, agendas/todos, ledger files, etc.
Since I've read that SSDs usually die without any warning, I am quite paranoid and install post-commit hook to backup data after each commit (however, when the lag due to backup is becoming annoying, I disable it momentarily).
For public repos, I've got it covered. There are services like github, bitbucket, or whatever which you can just push your work to.
What about private repos though? On PC I install a secondary (magnetic) hard drive where I also push the work after each commit to. On laptop, I backup after each working session to a USB external harddrive using Git::Bunch, which is also used to synchronize repos between the laptop and PC.
Sometimes I forget to bring my external HDD though, so to ease my mind I also want to backup over the Internet. The requirement is that the data must be encrypted on the server side. Aside from free accounts on github/bitbucket/etc, I also have SSH access to company server and a VPS.
The first alternative is to tarball + encrypt with GnuPG (.tar.pgp) each git repo and send it to server, but this is not incremental and very taxing, bandwidth-wise.
The second alternative is to setup a network block device (NBD) on the server and mount it as a cryptoloop container on the client. Then you can use rsync your git repos to the container efficiently. Except that it's not efficient either, because you operate on a low level (blocks) and the rsync protocol will also need to read lots of files on the remote cryptocontainer during checksumming. This might be OK over hi-speed LAN, but certainly not over regular Wi-Fi or mobile.
A third alternative that I've thought of is to create a "mirror" git repo for each repo that I have, where each file is encrypted. This is also not very efficient because git will not be able to diff the revisions very well, due to encryption. And the filenames and directory structures are not encrypted. And it looks cumbersome to me. But at least with this you can still push to your free github account, if you don't mind exposing your filenames.
The fourth alternative is the one that I use because it is the most efficient in terms of bandwidth: dump each commit diff to a separate file and encrypt it using PGP. I name the patches with repo name, timestamp, and commit hash. I just need to rsync the files over to the server. Should my SSD get toasted, I can rebuild from the most recent backup on the external/backup HDD and reapply the commits.
For data not residing in git repos, the second method is worth a try. There's also a recent HN thread which I haven't really read.
What are your strategies?
Check out BitTorrent Sync. I used to keep my work and home projects synchronized with git, but I found that I kept pushing incomplete work to my repo just so I could work on it at home. If you keep on machine on 24/7, BitTorrent Sync is a perfect personal cloud solution. It uses encrypted transport with no intermediate cloud storage. I *highly* recommend it.
I had used http://www.tarsnap.com/ sometime back
Bitbucket offers private repositories on their free accounts. I often use these for worky things. The only limitation is the number of collaborators you can add on a free private repo.
Regarding your third alternative, it should not be too difficult to also encrypt and decrypt the file names. Something along the lines of this.
I've heard of, but never used, https://github.com/blake2-ppc/git-remote-gcrypt
I second tarsnap - I use it for all the most important stuff (besides full-disk copies of everything to a home-server with raid-z).
You can export Git repo to single file using "git bundle command". And do this incrementally if you only have one branch.
Thanks for mentioning tarsnap. Why did you stop using it?
Forgive me for oversimplifying, I'm just trying to get the gist: after reading about tarsnap, duplicity, and the mentioned HN thread, I think in general these incremental dir backup tools work like this:
1) first time backup: encryt the whole thing then send it to server. copy to local cache dir (which will contain the last/most recent backup).
2) subsequent backups: compare the local cache dir (containing the last backup) with the source dir. Produce a binary diff (e.g rdiff/xdelta). Encrypt that, send to server.
If that's how it works, I would prefer something assembled from good ol' tools like duplicity (rdiff/rsync + OpenPGP + any SSH/FTP/scp server) over tarsnap (CMIIW, only client portion is open source, single vendor), unless the performance/efficiency diference is significant.
Also, from the cost structure I think I'd prefer rsync.net + duplicity ($0.32/GB/month storage before discounts + unlimited free data transfer) over tarsnap ($0.30/GB/month storage + $0.30/GB data transfer).
I'm aware of BitBucket offering free private repos, but I don't think they encrypt them on the servers.
Thanks for pointing out about encrypting the filenames. Yes it's doable, but in the end that makes the third method looks even more complicated :)
Thanks, looks interesting and might just be what one needs in this case. Will be checking it out!
Thanks for pointing out about "git bundle". I'd guess it's more convenient (+faster) than dumping + reapplying patch, since it stores all the original metadata (like commit hash and date/time) intact.
Hi David,
Does BitTorrent Sync handle conflicts? Do you use tablet/phone? Because if the answer to both question is no, then I'm hard-pressed to find its advantage over standard tools like rsync or unison.
As for encrypted transport, there's always SSH.
If you want to encrypt something and find GPG inconvenient in your workflow, you can try "encfs --reverse" it represent you directory in encrypted form, with encrypted filenames. You can sync it incrementally after that with existing sync tools.
Also, if you ever decide you Amazon Glacier as backup backend, you can try my tool written in Perl: https://github.com/vsespb/mt-aws-glacier
( And seems people use it with "encfs --reverse" https://github.com/vsespb/mt-aws-glacier/issues/43#issuecomment-25765930 )
Thanks. I should've mentioned EncFS too in my blog post. Will certainly check out your module as I plan to write a Glacier remote backup backend for my control panel.
I use https://github.com/shadowhand/git-encrypt to selectively encrypt part of one repo, though it can easily be told to do the whole thing.
Another option is http://git-annex.branchable.com/ which supports encrypted backends. In fact, your use cases sound like what it was designed for.
Sure, feel free to submit github ticket if you have questions, or contact me either way.
Re: git-encrypt: It's cool that git can be configured to allow things like this. But I would probably prefer something like encrypted remotes (e.g. git-remote-gcrypt from another comment above) which requires less configuration to my repos.
Re: git-annex: I did evaluate git-annex a while back when looking for options to backup media files, but it seemed big and complex (adds another layer of complexity) so I didn't use it. Perhaps it's time to take another look.