Backing up private data securely

My situation: work on PC at the office, on laptop elsewhere. All of data already in git repositories--well almost, photos and videos are currently out, due to their size. I use SSD on both computers, and my partitions are encrypted.

Some of the data are public, like open source Perl projects. Some work-related, where we have our company git repos. And some private, like personal notes, addressbook, agendas/todos, ledger files, etc.

Since I've read that SSDs usually die without any warning, I am quite paranoid and install post-commit hook to backup data after each commit (however, when the lag due to backup is becoming annoying, I disable it momentarily).

For public repos, I've got it covered. There are services like github, bitbucket, or whatever which you can just push your work to.

What about private repos though? On PC I install a secondary (magnetic) hard drive where I also push the work after each commit to. On laptop, I backup after each working session to a USB external harddrive using Git::Bunch, which is also used to synchronize repos between the laptop and PC.

Sometimes I forget to bring my external HDD though, so to ease my mind I also want to backup over the Internet. The requirement is that the data must be encrypted on the server side. Aside from free accounts on github/bitbucket/etc, I also have SSH access to company server and a VPS.

The first alternative is to tarball + encrypt with GnuPG (.tar.pgp) each git repo and send it to server, but this is not incremental and very taxing, bandwidth-wise.

The second alternative is to setup a network block device (NBD) on the server and mount it as a cryptoloop container on the client. Then you can use rsync your git repos to the container efficiently. Except that it's not efficient either, because you operate on a low level (blocks) and the rsync protocol will also need to read lots of files on the remote cryptocontainer during checksumming. This might be OK over hi-speed LAN, but certainly not over regular Wi-Fi or mobile.

A third alternative that I've thought of is to create a "mirror" git repo for each repo that I have, where each file is encrypted. This is also not very efficient because git will not be able to diff the revisions very well, due to encryption. And the filenames and directory structures are not encrypted. And it looks cumbersome to me. But at least with this you can still push to your free github account, if you don't mind exposing your filenames.

The fourth alternative is the one that I use because it is the most efficient in terms of bandwidth: dump each commit diff to a separate file and encrypt it using PGP. I name the patches with repo name, timestamp, and commit hash. I just need to rsync the files over to the server. Should my SSD get toasted, I can rebuild from the most recent backup on the external/backup HDD and reapply the commits.

For data not residing in git repos, the second method is worth a try. There's also a recent HN thread which I haven't really read.

What are your strategies?

16 Comments

Check out BitTorrent Sync. I used to keep my work and home projects synchronized with git, but I found that I kept pushing incomplete work to my repo just so I could work on it at home. If you keep on machine on 24/7, BitTorrent Sync is a perfect personal cloud solution. It uses encrypted transport with no intermediate cloud storage. I *highly* recommend it.

I had used http://www.tarsnap.com/ sometime back

Bitbucket offers private repositories on their free accounts. I often use these for worky things. The only limitation is the number of collaborators you can add on a free private repo.

Regarding your third alternative, it should not be too difficult to also encrypt and decrypt the file names. Something along the lines of this.

I second tarsnap - I use it for all the most important stuff (besides full-disk copies of everything to a home-server with raid-z).

You can export Git repo to single file using "git bundle command". And do this incrementally if you only have one branch.

If you want to encrypt something and find GPG inconvenient in your workflow, you can try "encfs --reverse" it represent you directory in encrypted form, with encrypted filenames. You can sync it incrementally after that with existing sync tools.

Also, if you ever decide you Amazon Glacier as backup backend, you can try my tool written in Perl: https://github.com/vsespb/mt-aws-glacier
( And seems people use it with "encfs --reverse" https://github.com/vsespb/mt-aws-glacier/issues/43#issuecomment-25765930 )

I use https://github.com/shadowhand/git-encrypt to selectively encrypt part of one repo, though it can easily be told to do the whole thing.

Another option is http://git-annex.branchable.com/ which supports encrypted backends. In fact, your use cases sound like what it was designed for.

Sure, feel free to submit github ticket if you have questions, or contact me either way.

Leave a comment

About Steven Haryanto

user-pic A programmer (mostly Perl 5 nowadays). My CPAN ID: SHARYANTO. I'm sedusedan on perlmonks. My twitter is stevenharyanto (but I don't tweet much). Follow me on github: sharyanto.