Last time I posted about git-annex, I introduced it and described the basics of my set up. Over the past year, I’ve added quite a bit of data to my main git-annex. It manages just over 100G of data for me across 9 repositories. Here’s a few bits of information that may be useful to others considering git-annex (or who are already knee deep in).
Archive, not backup
The website for git-annex explicitly states that it is not a backup system. An alternate description, that’s more appropriate, is that it’s part of an archival system. An archival system is somewhat concerned with backups of data, but it also deals with cataloging and retrieval.
I imagine that it’s a library system (books, not code) with the ability to do instantaneous inter-library loans. I have one repository (by the name of ‘silo’) that contains copies of all my data. I then have linked repositories on each computer that I use regularly that have little or no data in them, just git-annex style symlinks. If I find that I need something from the main repository on one of those computers, I can query where that file is with
git annex whereis:
$ git annex whereis media/pictures/2002-02-08-olympics.tgz whereis media/pictures/2002-02-08-olympics.tgz (4 copies) 8314baa2-4193-8d77-bb7f-489bd73e7db4 -- calvin_dr 8b22886e-14f2-98f0-31ec-6770b0a08f22 -- silo f8ec3d60-47bf-a392-4739-b39dd609d554 -- hobbes_dr ok
(I actually have three full copies of my data, in the *_dr repositories, but that’s a story for another day. Suffice it to say that calvin_dr and hobbes_dr are two identical external drives.)
I can retrieve the contents with
git annex get. git-annex is smart enough to know that the silo remote is over a network connection and the ‘calvin_dr’ is local, so it copies the data from there:
$ git annex get media/pictures/2002-02-08-olympics.tgz get media/pictures/2002-02-08-olympics.tgz (from calvin_dr...) SHA256E-s48439263--67c0de0e883c5d5d62a615bb97dce624370127e5873ae22770b200889367ae1c.tgz 48439263 100% 25.10MB/s 0:00:01 (xfer#1, to-check=0/1) sent 48445343 bytes received 42 bytes 19378154.00 bytes/sec total size is 48439263 speedup is 1.00 ok (Recording state in git...)
git annex whereis shows the file contents are local as well:
$ git annex whereis media/pictures/2002-02-08-olympics.tgz whereis media/pictures/2002-02-08-olympics.tgz (5 copies) 8314baa2-4193-8d77-bb7f-489bd73e7db4 -- calvin_dr 8b22886e-14f2-98f0-31ec-6770b0a08f22 -- silo f8ec3d60-47bf-a392-4739-b39dd609d554 -- hobbes_dr ae7e4cde-0023-1f1f-b1e2-7efd2954ec01 -- here (home_laptop) ok
And I can view the contents of the file like normal:
$ tar -tzf media/pictures/2002-02-08-olympics.tgz | head 2002-02-08-olympics/ 2002-02-08-olympics/p2030001.jpg 2002-02-08-olympics/p2030002.jpg ...
Then, when I’m done, I can just
git annex drop the file to remove the local copy of the data. git-annex, in good form, checks to make sure that there’s another copy before deleting it.
$ git annex drop media/pictures/2002-02-08-olympics.tgz drop media/pictures/2002-02-08-olympics.tgz ok (Recording state in git...)
All along the way, git-annex is tracking which repositories have each file, making it easy to find what I want. This sort of quick access and query-ability means that I know where my data is and I can access it when I need it.
Transporting large files
My work laptop used to be my only laptop, and so it had a number of my personal files, mostly pictures. I’ve transfered most of those off of that system, but every once in a while, I come across some personal data that I need to transfer to my home repository.
I usually add it to the local git-annex repository on my work laptop and then use
git annex move to move it to my home server. However, if it’s a significant amount of data and I don’t feel like waiting for the long transfer over my slow DSL line, I can copy the data to my external drive at work and then copy it off when I get home. Doing this manually can get tedious if there are more than a few files, but git-annex makes it a cinch. First, I can query what files are not on my home server and then copy those to the calvin_dr drive.
work-laptop$ git annex add huge-file1.tgz huge-file2.tgz huge-file3.tgz work-laptop$ git annex sync work-laptop$ git annex copy --not --in silo --to calvin_dr
Then, when I get home, I attach the drive to my personal laptop and run
git annex copy to copy the files to the server:
personal-laptop$ git annex copy --to silo --not --in silo
Many of my backups are the “snapshot” style, where I rsync’d a tree of files to another drive or server in an attempt to make sure that data was safe. The net effect of this strategy is that I have several mostly-identical backups of the same data. So, when I find a new copy of data that I’ve previously added to my git-annex system, I don’t know if I can safely delete it just based on the top level directory name.
For example, if I discover a tree of pictures that are organized by date and event:
$ find pictures -type d pictures pictures/2002-02-08-olympics pictures/2002-04-20-tahoe pictures/2004-11-18-la-zoo
And, checking in my git-annex repo, I can see that there are three files that correspond to those directories:
$ find backup/pictures -type l backup/pictures/2002-02-08-olympics.tgz backup/pictures/2002-04-20-tahoe.tgz backup/pictures/2004-11-18-la-zoo.tgz
I can probably remove the found files, but I might have modified the pictures in this set and I’d like to know before I toss them. After running into this scenario a few times, I wrote a little utility called archdiff that I can use to get an overview of the differences between two archives (or directories). It’s just a fancy wrapper around
diff -r --brief that automatically handles unpacking any archives found. For example:
$ archdiff 2002-04-20-tahoe/ ~/backup/pictures/2002-04-20-tahoe.tgz $
Since there was no output, the directory has the same contents as the archive and can be safely deleted. Here’s another example:
$ archdiff 2002-02-08-olympics/ ~/backup/pictures/2002-02-08-olympics.tgz Files 2002-02-08-olympics/p2030001.jpg and 2002-02-08-olympics.tgz-_RhD/2002-02-08-olympics/p2030001.jpg differ $
One of the files in this directory has modifications, so I can now take the time to look at the two files and see if I want to keep it or not.
Archdiff behaves like a good UNIX program and its exit code reflects whether or not differences were found, so it’s possible to script the checking of multiple directories. Here’s an example script that would check the above three directories:
#!/bin/bash cd ~/backup/pictures for dir in ~/pictures/*; do basedir=$(basename $dir) echo "checking $dir" # retrieve the file from another git-annex repo git annex get $basedir.tgz if archdiff $dir $basedir.tgz; then echo "$dir is the same, removing" rm -rf $dir # drop the git-annex managed file, we no longer need it git annex drop $basedir.tgz fi done
Once this is done, the only directories left will be those with differences and the tarball will still be present in the git-annex repository for investigation. I end up writing little scripts like this as I go through old backups to help me process large amounts of data quickly.
That’s it for now. If you have any questions about this or git-annex in general, tweet at me @ndj.