endot

eschew obfuscation (and espouse elucidation)

My Tmux Configuration, Refined

When I wrote about tmux for the first time, I was just getting into the idea of nesting sessions. I ran a local tmux session that wrapped remote tmux sessions for more than a year before I switched it up again.

I added another level.

Background

I originally started nesting tmux sessions so that I wouldn’t have to use tabs in Terminal to keep track of different remote tmux sessions. This allowed me to connect to my work machine from home and get my entire working session instantly. While that worked well, I began to see a few issues with that approach:

  1. At work, I ran my top level tmux session on my work laptop. The downside of this is that I had to leave my laptop open and running all the time to be able to access it remotely. This also necessitated some tricky SSH tunnels that I wasn’t entirely comfortable leaving open.
  2. The top level tmux session at home was on my home server, and so it was convenient to connect to from work, but if I connected to that session from my top level work session, the key bindings would end up conflicting.

Solution

I solved the first issue by running my top level work session on a server at work. This allowed me to close my laptop when I wasn’t in the office and it afforded me a location to run things that weren’t specific to a particular system but that I didn’t want to live and die with my laptop.

I solved the second issue by adding a new level of tmux. I called this new level uber and assigned it the prefix C-q to differentiate it from the other levels1,

With that in place, I would start the uber session on my laptop and then connect to both my home and work mid-level sessions, and via those, the leaf tmux sessions. Then, I could choose what level I wanted to operate on just by changing the prefix that I used.

Multiple sockets

Another thing that I wanted to do from time to time was run two independent tmux sessions on my local laptop. I could have used the built-in multi-session support in tmux, but I also wanted the ability to nest sessions locally, and tmux doesn’t support that natively. In looking for a solution, I stumbled on the idea of running each level on it’s own server socket. By adding that, I can now run all three on the same system and running two independent tmux sessions is as easy as running two different levels in separate windows. Plus, I can still use the native multi-session support within each level.

Sharing sessions

The most recent modification I made was to add easy support for sharing a tmux session between two Terminal windows. This allows me to treat my local Terminal windows as viewports into my tmux session tree, attaching where ever I need without necessarily detaching another Terminal window.

To enable this, I added an optional command line flag to the session start scripts that makes tmux start a new view of the session instead of detaching other clients. I also enabled ‘aggressive-resize’ so that the size of the tmux sessions aren’t limited to the smallest Terminal window unless more than one are looking at the exact same tmux window.

How it all looks

tmux sessions

It can look a little overwhelming, but in reality it’s quite simple to use. Most of my time is spent in the leaf node sessions, and that interaction is basically vanilla tmux.

Installing this for yourself

Configuration

The configuration for my set up is available in my dotfiles repository on Github:

  1. .tmux.shared - contains shared configuration and bindings that are common to all levels
  2. .tmux.uber - configuration unique to the top-level session
  3. .tmux.master - configuration unique to mid-level tmux sessions
  4. .tmux.conf - configuration unique to the lowest-level (leaf) sessions

Wrapper scripts

The heart of the wrapper scripts is tmux-sess. It holds all the logic for setting the socket and sharing sessions.

The rest of the scripts are thin wrappers around tmux-sess. For instance, here is tmux-uber:

1
2
3
#!/bin/sh

tmux-sess -s uber -f ~/.tmux.uber $*

The other level scripts are tmux-home for the mid-level session and tmux-main for the lowest-level.

Wrapping up

I hope that this information is helpful. If you have any questions, please ask me on twitter.

Enjoy.

  1. I also quickly decided that this uber level didn’t need to have its own status line. That would be crazy.

Talk Notes: February 2014

I was out of town two of the Fridays this month, so I was only able to get two talks in:

  • Clojure core.async - Continuing in my fascination with Clojure, I picked this talk to explore the non-Java techniques for handling concurrency. I’m familiar with CSP from my Go experience, and it was interesting to hear Clojure’s take on the same foundation. Clojure also implements a macro that turns the spaghetti code that is callbacks into a sequential function that still operates asynchronously.
  • Inventing on Principle - Several people have recommended this talk to me, and I finally got around to watching it. It’s worth the watch just for the amazing demos that he built, but the deeper notion that there could be an underlying principle that guides your life is thought provoking. It also makes me want to play with Light Table.

Enjoy.

Setting Up Vim for Clojure

I’ve been experimenting with Clojure lately. A few of my coworkers had begun the discovery process as well, so I suggested that we have a weekly show-and-tell, because a little accountability and audience can turn wishes into action.

Naturally, I looked around for plug-ins that would be of use in my editor of choice. Here’s what I have installed:

These are all straightforward to install, as long as you already have a Pathogen or Vundle setup going. If you don’t, you really should, because nobody likes a messy Vim install.

All of these plug-ins automatically work when a Clojure file is opened, with the exception of rainbow parentheses. To enable those, a little .vimrc config is necessary:

1
2
3
4
au BufEnter *.clj RainbowParenthesesActivate
au Syntax clojure RainbowParenthesesLoadRound
au Syntax clojure RainbowParenthesesLoadSquare
au Syntax clojure RainbowParenthesesLoadBraces

Now, once that’s all set up, it’s time to show a little bit of what this setup can do. I have a little clojure test app over here on github. After cloning it (and assuming you’ve already installed leiningen):

  1. Open up dev.clj and follow the instructions to set up the application in a running repl.
  2. Then open testclj/core.clj and make any modification, such as changing “Hello” to “Hi”.
  3. Then after a quick cpr to reload the namespace in the repl, you can reload your web browser to see the updated code.

This setup makes for a quick dev/test cycle, which is quite useful for experimentation. Of course, there are many more features of each of the above plugins. I’ve barely scratched the surface and I’m already very impressed.

Enjoy.

Introducing Talk Notes

In the course of my work and my online reading and research, I often come across videos of talks that I want to watch. I rarely take the time to watch those videos, mostly because of the time commitment; I usually only have a few minutes to spare.

Lately, I’ve done something to change that. I’m taking a little bit of time out of my Friday schedule each week to watch a talk that looks interesting. I also try and focus on the talk. Rather than checking my email or chatting while the talk is playing, I take notes, sometimes including screenshots of important slides.

Over the course of the past month, I’ve had some success with this strategy, and I was able to watch three talks. Here are links to my notes:

  • Using Datomic With Riak - I picked this talk because we’ve used a bit of Riak at work and a buddy of mine keeps raving about Datomic. This talk is actually a great overview of the philosophy and design behind Datomic.
  • Raft - the Understandable Distributed Protocol - CoreOS’s etcd has been getting some mention lately, and Raft is the consensus algorithm used to keep all of its data consistent. At the end of watching this talk, I found another one (by one of the Raft authors), and it balanced the practicality of the first with some more of the theory.
  • React - Rethinking Best Practices - The functional programming paradigm is gathering steam, and Facebook’s React JavaScript library is a fascinating take on building modern web UIs in a functional manner.

I really enjoyed the process of taking notes in this way, and I hope to continue this as the year progresses.

Oh, and if you know of a good talk, please let me know on twitter.

Using Docker to Generate My Octopress Blog

When I originally set up Octopress, I set it up on my Mac laptop using rvm, as recommended at the time. It worked very well for me until just a few minutes after my last post, when I decided to sync with the upstream changes.

After merging in the changes, I tried to generate my blog again, just to make sure everything worked. Well, it didn’t, and things went downhill from there. The rake generate command failed because a new gem was required. So, I ran bundle install to get the latest gems. That failed when a gem required ruby 1.9.3. Then installing ruby 1.9.3 failed in rvm because I needed a newer version of rvm. After banging on that problem for a few minutes, I decided to take a break and come back to the problem later.

Docker to the rescue

Fast forward a few weeks, and I came up with a better idea. I decided to dockerize Octopress. This keeps all the dependencies sanely bottled up in an image that I can run like a command.

Here is the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
FROM ubuntu:12.10
MAINTAINER  Nate Jones <nate@endot.org>

# instal system dependencies and ruby
RUN apt-get update
RUN apt-get install git ruby1.9.3 build-essential language-pack-en python python-dev -y

# make sure we're working in UTF8
ENV LC_ALL en_US.utf8

# add the current blog source
ADD . /o
WORKDIR /o

# install octopress dependencies
RUN gem install bundler
RUN bundle install

# set up user so that host files have correct ownership
RUN addgroup --gid 1000 blog
RUN adduser --uid 1000 --gid 1000 blog
RUN chown -R blog.blog /o
USER blog

# base command
ENTRYPOINT ["rake"]

How to use it

To use this Dockerfile, I put it at the root of my blog source and ran this command:

1
$ docker build -t ndj/octodock .

Then, since rake is set as the entry point, I can run the image as if it were a command. I use the -v switch to overlay the current blog source over the one cached in the image and -rm switch to throw away the container when it’s done.

1
2
3
4
5
6
7
8
$ docker run -rm -v `pwd`:/o ndj/octodock generate
## Generating Site with Jekyll
   remove .sass-cache/
   remove source/stylesheets/screen.css
   create source/stylesheets/screen.css
Configuration from /o/_config.yml
Building site: source -> public
Successfully generated site: source -> public

A few notes

  • I had to force the UTF8 locale in order to get ruby to stop complaining about non-ascii characters in the blog entries.
  • I add a user called blog with the same UID/GID as my system user, so that any commands that generate files aren’t owned by root. I look forward to proper user namespaces so that I won’t have to do this.
  • Deploying the blog doesn’t use my SSH key, as the ‘blog’ user in the image is doing the rsync, not my host system user. I’m ok with typing my password in or just rsync’ing the data directly.

Docker is a great piece of technology, and I keep finding new uses for it.

Enjoy.

Git-annex Tips

Last time I posted about git-annex, I introduced it and described the basics of my set up. Over the past year, I’ve added quite a bit of data to my main git-annex. It manages just over 100G of data for me across 9 repositories. Here’s a few bits of information that may be useful to others considering git-annex (or who are already knee deep in).

Archive, not backup

The website for git-annex explicitly states that it is not a backup system. An alternate description, that’s more appropriate, is that it’s part of an archival system. An archival system is somewhat concerned with backups of data, but it also deals with cataloging and retrieval.

I imagine that it’s a library system (books, not code) with the ability to do instantaneous inter-library loans. I have one repository (by the name of ‘silo’) that contains copies of all my data. I then have linked repositories on each computer that I use regularly that have little or no data in them, just git-annex style symlinks. If I find that I need something from the main repository on one of those computers, I can query where that file is with git annex whereis:

1
2
3
4
5
6
$ git annex whereis media/pictures/2002-02-08-olympics.tgz
whereis media/pictures/2002-02-08-olympics.tgz (4 copies)
        8314baa2-4193-8d77-bb7f-489bd73e7db4 -- calvin_dr
        8b22886e-14f2-98f0-31ec-6770b0a08f22 -- silo
        f8ec3d60-47bf-a392-4739-b39dd609d554 -- hobbes_dr
ok

(I actually have three full copies of my data, in the *_dr repositories, but that’s a story for another day. Suffice it to say that calvin_dr and hobbes_dr are two identical external drives.)

I can retrieve the contents with git annex get. git-annex is smart enough to know that the silo remote is over a network connection and the ‘calvin_dr’ is local, so it copies the data from there:

1
2
3
4
5
6
7
8
9
$ git annex get  media/pictures/2002-02-08-olympics.tgz
get media/pictures/2002-02-08-olympics.tgz (from calvin_dr...)
SHA256E-s48439263--67c0de0e883c5d5d62a615bb97dce624370127e5873ae22770b200889367ae1c.tgz
    48439263 100%   25.10MB/s    0:00:01 (xfer#1, to-check=0/1)

sent 48445343 bytes  received 42 bytes  19378154.00 bytes/sec
total size is 48439263  speedup is 1.00
ok
(Recording state in git...)

Then, running git annex whereis shows the file contents are local as well:

1
2
3
4
5
6
7
$ git annex whereis media/pictures/2002-02-08-olympics.tgz
whereis media/pictures/2002-02-08-olympics.tgz (5 copies)
    8314baa2-4193-8d77-bb7f-489bd73e7db4 -- calvin_dr
    8b22886e-14f2-98f0-31ec-6770b0a08f22 -- silo
    f8ec3d60-47bf-a392-4739-b39dd609d554 -- hobbes_dr
    ae7e4cde-0023-1f1f-b1e2-7efd2954ec01 -- here (home_laptop)
ok

And I can view the contents of the file like normal:

1
2
3
4
5
$ tar -tzf media/pictures/2002-02-08-olympics.tgz | head
2002-02-08-olympics/
2002-02-08-olympics/p2030001.jpg
2002-02-08-olympics/p2030002.jpg
...

Then, when I’m done, I can just git annex drop the file to remove the local copy of the data. git-annex, in good form, checks to make sure that there’s another copy before deleting it.

1
2
3
$ git annex drop media/pictures/2002-02-08-olympics.tgz
drop media/pictures/2002-02-08-olympics.tgz ok
(Recording state in git...)

All along the way, git-annex is tracking which repositories have each file, making it easy to find what I want. This sort of quick access and query-ability means that I know where my data is and I can access it when I need it.

Transporting large files

My work laptop used to be my only laptop, and so it had a number of my personal files, mostly pictures. I’ve transfered most of those off of that system, but every once in a while, I come across some personal data that I need to transfer to my home repository.

I usually add it to the local git-annex repository on my work laptop and then use git annex move to move it to my home server. However, if it’s a significant amount of data and I don’t feel like waiting for the long transfer over my slow DSL line, I can copy the data to my external drive at work and then copy it off when I get home. Doing this manually can get tedious if there are more than a few files, but git-annex makes it a cinch. First, I can query what files are not on my home server and then copy those to the calvin_dr drive.

1
2
3
work-laptop$ git annex add huge-file1.tgz huge-file2.tgz huge-file3.tgz
work-laptop$ git annex sync
work-laptop$ git annex copy --not --in silo --to calvin_dr

Then, when I get home, I attach the drive to my personal laptop and run git annex copy to copy the files to the server:

1
personal-laptop$ git annex copy --to silo --not --in silo

Detecting duplicates

Many of my backups are the “snapshot” style, where I rsync’d a tree of files to another drive or server in an attempt to make sure that data was safe. The net effect of this strategy is that I have several mostly-identical backups of the same data. So, when I find a new copy of data that I’ve previously added to my git-annex system, I don’t know if I can safely delete it just based on the top level directory name.

For example, if I discover a tree of pictures that are organized by date and event:

1
2
3
4
5
$ find pictures -type d
pictures
pictures/2002-02-08-olympics
pictures/2002-04-20-tahoe
pictures/2004-11-18-la-zoo

And, checking in my git-annex repo, I can see that there are three files that correspond to those directories:

1
2
3
4
$ find backup/pictures -type l
backup/pictures/2002-02-08-olympics.tgz
backup/pictures/2002-04-20-tahoe.tgz
backup/pictures/2004-11-18-la-zoo.tgz

I can probably remove the found files, but I might have modified the pictures in this set and I’d like to know before I toss them. After running into this scenario a few times, I wrote a little utility called archdiff that I can use to get an overview of the differences between two archives (or directories). It’s just a fancy wrapper around diff -r --brief that automatically handles unpacking any archives found. For example:

1
2
$ archdiff 2002-04-20-tahoe/ ~/backup/pictures/2002-04-20-tahoe.tgz
$ 

Since there was no output, the directory has the same contents as the archive and can be safely deleted. Here’s another example:

1
2
3
$ archdiff 2002-02-08-olympics/ ~/backup/pictures/2002-02-08-olympics.tgz
Files 2002-02-08-olympics/p2030001.jpg and 2002-02-08-olympics.tgz-_RhD/2002-02-08-olympics/p2030001.jpg differ
$ 

One of the files in this directory has modifications, so I can now take the time to look at the two files and see if I want to keep it or not.

Archdiff behaves like a good UNIX program and its exit code reflects whether or not differences were found, so it’s possible to script the checking of multiple directories. Here’s an example script that would check the above three directories:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#!/bin/bash

cd ~/backup/pictures

for dir in ~/pictures/*; do
    basedir=$(basename $dir)
    echo "checking $dir"

    # retrieve the file from another git-annex repo
    git annex get $basedir.tgz

    if archdiff $dir $basedir.tgz; then
        echo "$dir is the same, removing"
        rm -rf $dir

        # drop the git-annex managed file, we no longer need it
        git annex drop $basedir.tgz
    fi
done

Once this is done, the only directories left will be those with differences and the tarball will still be present in the git-annex repository for investigation. I end up writing little scripts like this as I go through old backups to help me process large amounts of data quickly.

All done

That’s it for now. If you have any questions about this or git-annex in general, tweet at me @ndj.

Enjoy.

Remotecopy, Two Years Later

It’s been over two years since I wrote remotecopy and I still use it every day.

The most recently added feature is the -c option, which will remove the trailing newline from the copied data if it only contains one line. I found myself writing little scripts that would only output one line with the intent of using that output to build a command line on a different system, and the extra newline at the end often messed up the new command. The -c solves this problem.

For instance, I have git-url, which outputs the origin url of the current git repository. This makes it easy to clone the repo on a new system (rc is my alias for remotecopy -c):

1
2
3
firsthost:gitrepo$ git url | rc
Input secret:
rc-alaelifj3lij2ijli3ajfwl3iajselfiae

Now the clone url is in my clipboard, so I just type git clone and then paste to clone on a different system:

1
2
3
secondhost:~$ git clone git@github.com:justone/gitrepo.git
Cloning into 'gitrepo'...
...

No tmux pbcopy problems

Most OSX tmux users are familiar with the issues with pbcopy and the current workarounds.

Since remotecopy works by accessing the server over a tcp socket, it’s immune to these problems. I just use remotecopy on my local system as if I were on a remote system.

LA Perl Mongers

At the latest LA Perl Mongers meeting, the talks were lightning in nature, so I threw together a presentation about remotecopy. The interesting source bits are up on github, including a pdf copy of the slides.

For the presentation, I used the excellent js-sequence-diagrams to make this diagram, that hopefully helps show the data flow in a remotecopy interaction.

git annex map

Enjoy.

A Script to Ease SCP Use

Since I work on remote systems all the time, I use SCP repeatedly to transfer files around. One of the more cumbersome tasks is specifying the remote file or directory location.

So I wrote a helper script to make it easier. It’s called scptarget, and it generates targets for SCP, either the source or the destination.

For instance, if I want to copy a file down from a remote server, I run scptarget like this and copy the output:

1
2
$ scptarget file.pl
endot.org:/home/nate/file.pl

Then it’s easy to paste it into my SCP command on my local system:

1
2
$ scp endot.org:/home/nate/file.pl .
...

I usually use remotecopy (specifically remotecopy -c) to copy it so that I don’t even have to touch my mouse.

Examples

Here are a few example uses.

First, without any arguments, it targets the current working directory. This is useful when I want to upload something from my local system to where I’m remotely editing files.

1
2
$ scptarget
endot.org:/home/nate

Specifying a file targets the file directly.

1
2
$ scptarget path/to/file.pl
endot.org:/home/nate/path/to/file.pl

Absolute paths are handled correctly:

1
2
$ scptarget /usr/local/bin/file
endot.org:/usr/local/bin/file

Vim SCP targets

Vim supports editing files over SCP, so passing -v in generates a target that it can use:

1
2
$ scptarget -v path/to/file.pl
scp://endot.org//home/nate/file.pl

And to edit, just pass that in to Vim:

1
$ vim scp://endot.org//home/nate/file.pl

IP based targets

Sometimes I need the target to use the IP of the server instead of its hostname. This usually happens with development VMs (a la Vagrant), which are only addressable via IP. Passing -i to scptarget causes it behave this way. Under the hood, it uses getip, which is a script I wrote that prints out the first IP of the current host. If there is no non-private IP, then it will return the first private IP. (I am fully aware that there may be better ways of doing the above. Let me know if you have a better script.)

1
2
$ scptarget path/to/file.pl
64.13.192.60:/home/nate/path/to/file.pl

That’s it. I find it incredibly useful and I hope you do too.

Enjoy.

Seeing the Shuttle

Launch

A little over thirteen years ago, I embarked on a cross-country trip with one of my college buddies. I’ll elaborate more on the trip in another post, but the pertinent part of that story is that we happened to be in Florida in late May, 2000.

We’d originally planned to see certain sights along the way, but by the time we reached the east coast we had grown quite good at adding extra stops to the itinerary. When we stopped in Orlando, we quickly added a trip to the Kennedy Space Center, as we are both great fans of NASA. While we were there, we learned that in a few days a shuttle (Atlantis) was going to launch, so we quickly rearranged the next leg of our trip so that we could be back in the area and then purchased tickets.

Since it was an early AM launch window, they let us into the main building of the space center just before three in the morning. Most of the exhibits were open and since the only people there were the ones going to see the launch, there were no crowds. We’d spent most of our previous visit in the other buildings on site, so it was quite a treat to wander around uninhibited. One of the theaters that usually shows documentary style films was showing live video of the close out crew getting the astronauts into the shuttle while a staff person up in front answered questions from the dozen or so people in the audience. I remember sitting in that room for some time, intently watching the video and enjoying every minute.

When the time came for us to head out to the launch site, we loaded into shuttles that took us out to where NASA Parkway East crosses the Banana River. The causeway over the river is the closest the public can get to a shuttle launch at just over six miles away. We waited out there for about two hours before the final nine minute countdown began, and when the clock struck zero it lifted off, almost effortlessly. From our vantage point it was silent until a few seconds later when the shock wave rolled across the water and hit us. It was an experience like none other.

Retirement

Shortly before the shuttle program ended a couple years ago, NASA announced which museums around the country would receive a retired orbiter and we were lucky enough to get the Endeavour for the California Science Center.

Over the holiday break, I was able to visit it with my family. It’s on display in a purpose-built hanger while they work on a permanent home. It was great to see it up close, but the hanger and the pre-exhibit room were packed with holiday crowds.

Then, this past week, I was able to return for a second visit with another college friend and his family. This time, there were only a few schoolchildren to maneuver around while looking up at the orbiter. While my friend and his family wandered around, I was able to just sit and study the vehicle itself.

When I saw it thirteen years ago, it was a speck on the horizon. This time it was so big that I couldn’t take it all in at once. I noticed where the black heat tiles begin and the other locations (beside the underbelly) where they’ve been placed. I could appreciate the enormity of the engine nozzles at the back and the texture of the thermal blankets that cover most of the top half. I counted the maneuvering thrusters on the nose and tail and could see the backwards flag on the right side. Again, it was an experience like none other.

There’s a lot to learn about the shuttle program and about Endeavour in particular. For instance, I learned that the reason for Endeavour’s British spelling is that it was named for the HMS Endeavour, the ship that Captain Cook explored Australia and New Zealand with. Also, I learned that Endeavour was built as the replacement for Challenger, and 22 years after the Challenger disaster it was Endeavour who took the first teacher into space.

If you’re in the LA area and are a fan of space flight, then don’t miss seeing the Endeavour. I’ll definitely be going back.

Endeavour Endeavour Endeavour Endeavour Endeavour

Managing Backups With Git-annex

My Situation

I have backups. Many backups. Too many backups.

I use time machine to back up my macs, but that only covers the systems that I currently run. I have archives of older systems, some for nostalgic reasons, some for reference. I also have a decent set of digital artifacts (pictures, videos and documents) that I’d rather not lose.

So I keep backups.

Unfortunately, I’m not very organized. When I encounter data that I want to keep, I usually rsync it onto one or another external drive or server. However, since the data is not organized, I can’t tell how much of it can simply be deleted instead of backed up again. The actual amount of data that should be backed up is probably less than half of the amount of data that exists on the various internal and external drives both at home and at work. This also means that most of my hard drives are at 90% capacity and I don’t know what I can safely delete.

I really needed a way of organizing the data and getting it somewhere that I can trust.

git-annex

I initially heard of git-annex a while ago, when I was perusing the git wiki. It seemed like an interesting extension but I didn’t take another look at it until the creator started a kickstarter project to extend it into a dropbox replacement.

git-annex is great. It’s an extension to git that allows managing files with git without actually checking them in. git-annex does this by replacing each file with a symlink that points to the real content in the .git/annex directory (named after a checksum of the file’s contents). Only the symlink gets checked into git.

To illustrate, here’s how to get from nothing to tracking a file with git-annex:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$ mkdir repo && cd repo
$ git init && git commit -m initial --allow-empty
Initialized empty Git repository in /Users/nate/repo/.git/
[master (root-commit) c8562e6] initial
$ git annex init main
init main ok
(Recording state in git...)
$ mv ~/big.tar.gz .
$ ls -lh
-rw-r--r--  1 nate  staff    10M Dec 23 15:31 big.tar.gz
$ git annex add big.tar.gz
add big.tar.gz (checksum...) ok
(Recording state in git...)
$ ls -lh
lrwxr-xr-x  1 nate  staff   206B Dec 23 15:32 big.tar.gz -> .git/annex/objects/PP/wZ/SHA256E-s10485760--7c8fdf649d2b488cc6c545561ba7b9f00c52741a5db3b0130a8c9de8f66ff44f.tar.gz/SHA256E-s10485760--7c8fdf649d2b488cc6c545561ba7b9f00c52741a5db3b0130a8c9de8f66ff44f.tar.gz
$ git commit -m 'adding big tarball'
...

When the repository is cloned, only the symlink exists. To get the file contents, run git annex get:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$ cd .. && git clone repo other && cd other
Cloning into 'other'...
done.
$ git annex init other
init other ok
(Recording state in git...)
$ file -L big.tar.gz
big.tar.gz: broken symbolic link to .git/annex/objects/PP/wZ/SHA256E-s10485760--7c8fdf649d2b488cc6c545561ba7b9f00c52741a5db3b0130a8c9de8f66ff44f.tar.gz/SHA256E-s10485760--7c8fdf649d2b488cc6c545561ba7b9f00c52741a5db3b0130a8c9de8f66ff44f.tar.gz
$ git annex get big.tar.gz
get big.tar.gz (merging origin/git-annex into git-annex...)
(Recording state in git...)
(from origin...) ok
(Recording state in git...)
$ file -L big.tar.gz
big.tar.gz: data

By using git-annex, every clone doesn’t have to have the data for every file. git-annex keeps track of which repositories contain each file (in a separate git branch that it maintains) and provides commands to move file data around. Every time file content is moved, git-annex updates the location information. This information can be queried to figure out where a files content is and to limit the data manipulation commands.

There is (much) more info in the walkthrough on the git-annex site.

My Setup

What I have is a set of git repositories that are linked like this:

git annex map

[git-annex has a subcommand to generate a map, but it requires that all hosts are reachable from where it’s run, and that’s not possible for me. I quickly gave up when trying to make my own Graphviz chart and ended up using Lekh Diagram on my iPad (thanks Josh).]

My main repository is on a machine at home (which started life as a mini thumper and is now an Ubuntu box), and there are clones of that repository on various remote machines. To add a new one, all I need to do is clone an existing repository and run git annex init <name> in that repository to register it in the system.

This has allowed me to start organizing my backup files in a simple directory structure. Here is a sampling of the directories in my repository:

  • VMs - VM images that I don’t want to (or can’t) recreate.
  • funny - Humorous files that I want to keep a copy of (as opposed to trusting the Internet).
  • media - Personal media archives, currently mostly tarballs of pictures going back ten years.
  • projects - Archives of inactive projects.
  • software - Downloaded software for which I’ve purchased licenses.
  • systems - Archives of files from systems I no longer access.

There are other directories, and these directories may change over time as I add more data. I can move the symlinks around, even without having the actual data on my system, and when I commit, git-annex will update its tracking information accordingly. Every time I add data or move things around, all I need to do is run git annex sync to synchronize the tracking data.

Here is the simple workflow that I go through when changing data in any git-annex managed repository:

1
2
3
4
5
$ git annex sync
$ # git annex add ...
$ # git annex get ...
$ # git annex drop ...
$ git annex sync

With this in place, it’s easy to know where to put new data since everything is just directories in a git repo. I can access files from anywhere because my home backup server is available as an ssh remote. More importantly, I can just grab what I want from there, because git-annex knows how to just grab the contents of a single file.

One caveat to this system is that using git and git-annex means that certain file attributes, like permissions and create/modify/access time are not preserved. To work around this, for files that I want to preserve completely, I just tarball them up and add that file to the git-annex.

Installing git-annex

git-annex is written in Haskell. Installing the latest version on on OS X is not the most repeatable process, and the version that comes with most linux distributions is woefully out of date. So I’ve opted for using the prebuilt OS X app (called beta) or linux tarball.

After copying the OS X app into Applications or unpacking the linux tarball, I run the included runshell script to get access to git-annex:

1
2
3
4
$ /home/nate/git-annex.linux/runshell bash                      # on linux
$ /Applications/git-annex.app/Contents/MacOS/runshell bash      # on OS X
$ git annex version
git-annex version: 3.20121211

I’ll share more scripts and tips in future blog posts.

Enjoy.