I was out of town two of the Fridays this month, so I was only able to get two talks in:
Clojure core.async - Continuing in my fascination with Clojure, I picked this talk to explore the non-Java techniques for handling concurrency. I’m familiar with CSP from my Go experience, and it was interesting to hear Clojure’s take on the same foundation. Clojure also implements a macro that turns the spaghetti code that is callbacks into a sequential function that still operates asynchronously.
Inventing on Principle - Several people have recommended this talk to me, and I finally got around to watching it. It’s worth the watch just for the amazing demos that he built, but the deeper notion that there could be an underlying principle that guides your life is thought provoking. It also makes me want to play with Light Table.
I’ve been experimenting with Clojure lately. A few of my coworkers had begun the discovery process as well, so I suggested that we have a weekly show-and-tell, because a little accountability and audience can turn wishes into action.
Naturally, I looked around for plug-ins that would be of use in my editor of choice. Here’s what I have installed:
These are all straightforward to install, as long as you already have a Pathogen or Vundle setup going. If you don’t, you really should, because nobody likes a messy Vim install.
All of these plug-ins automatically work when a Clojure file is opened, with the exception of rainbow parentheses. To enable those, a little .vimrc config is necessary:
au BufEnter *.clj RainbowParenthesesActivate
au Syntax clojure RainbowParenthesesLoadRound
au Syntax clojure RainbowParenthesesLoadSquare
au Syntax clojure RainbowParenthesesLoadBraces
Now, once that’s all set up, it’s time to show a little bit of what this setup can do. I have a little clojure test app over here on github. After cloning it (and assuming you’ve already installed leiningen):
Open up dev.clj and follow the instructions to set up the application in a running repl.
Then open testclj/core.clj and make any modification, such as changing “Hello” to “Hi”.
Then after a quick cpr to reload the namespace in the repl, you can reload your web browser to see the updated code.
This setup makes for a quick dev/test cycle, which is quite useful for experimentation. Of course, there are many more features of each of the above plugins. I’ve barely scratched the surface and I’m already very impressed.
In the course of my work and my online reading and research, I often come across videos of talks that I want to watch. I rarely take the time to watch those videos, mostly because of the time commitment; I usually only have a few minutes to spare.
Lately, I’ve done something to change that. I’m taking a little bit of time out of my Friday schedule each week to watch a talk that looks interesting. I also try and focus on the talk. Rather than checking my email or chatting while the talk is playing, I take notes, sometimes including screenshots of important slides.
Over the course of the past month, I’ve had some success with this strategy, and I was able to watch three talks. Here are links to my notes:
Using Datomic With Riak - I picked this talk because we’ve used a bit of Riak at work and a buddy of mine keeps raving about Datomic. This talk is actually a great overview of the philosophy and design behind Datomic.
Raft - the Understandable Distributed Protocol - CoreOS’s etcd has been getting some mention lately, and Raft is the consensus algorithm used to keep all of its data consistent. At the end of watching this talk, I found another one (by one of the Raft authors), and it balanced the practicality of the first with some more of the theory.
I really enjoyed the process of taking notes in this way, and I hope to continue this as the year progresses.
Oh, and if you know of a good talk, please let me know on twitter.
When I originally set up Octopress, I set it up on my Mac laptop using rvm, as recommended at the time. It worked very well for me until just a few minutes after my last post, when I decided to sync with the upstream changes.
After merging in the changes, I tried to generate my blog again, just to make sure everything worked. Well, it didn’t, and things went downhill from there. The rake generate command failed because a new gem was required. So, I ran bundle install to get the latest gems. That failed when a gem required ruby 1.9.3. Then installing ruby 1.9.3 failed in rvm because I needed a newer version of rvm. After banging on that problem for a few minutes, I decided to take a break and come back to the problem later.
Docker to the rescue
Fast forward a few weeks, and I came up with a better idea. I decided to dockerize Octopress. This keeps all the dependencies sanely bottled up in an image that I can run like a command.
Here is the code:
MAINTAINER Nate Jones <email@example.com>
# instal system dependencies and ruby
RUN apt-get update
RUN apt-get install git ruby1.9.3 build-essential language-pack-en python python-dev -y
# make sure we're working in UTF8
ENV LC_ALL en_US.utf8
# add the current blog source
ADD . /o
# install octopress dependencies
RUN gem install bundler
RUN bundle install
# set up user so that host files have correct ownership
RUN addgroup --gid 1000 blog
RUN adduser --uid 1000 --gid 1000 blog
RUN chown -R blog.blog /o
# base command
Then, since rake is set as the entry point, I can run the image as if it were a command. I use the -v switch to overlay the current blog source over the one cached in the image and -rm switch to throw away the container when it’s done.
$ docker run -rm -v `pwd`:/o ndj/octodock generate
## Generating Site with Jekyll remove .sass-cache/
Configuration from /o/_config.yml
Building site: source -> public
Successfully generated site: source -> public
A few notes
I had to force the UTF8 locale in order to get ruby to stop complaining about non-ascii characters in the blog entries.
I add a user called blog with the same UID/GID as my system user, so that any commands that generate files aren’t owned by root. I look forward to proper user namespaces so that I won’t have to do this.
Deploying the blog doesn’t use my SSH key, as the ‘blog’ user in the image is doing the rsync, not my host system user. I’m ok with typing my password in or just rsync’ing the data directly.
Docker is a great piece of technology, and I keep finding new uses for it.
Last time I posted about git-annex, I introduced it and described the basics of my set up. Over the past year, I’ve added quite a bit of data to my main git-annex. It manages just over 100G of data for me across 9 repositories. Here’s a few bits of information that may be useful to others considering git-annex (or who are already knee deep in).
Archive, not backup
The website for git-annex explicitly states that it is not a backup system. An alternate description, that’s more appropriate, is that it’s part of an archival system. An archival system is somewhat concerned with backups of data, but it also deals with cataloging and retrieval.
I imagine that it’s a library system (books, not code) with the ability to do instantaneous inter-library loans. I have one repository (by the name of ‘silo’) that contains copies of all my data. I then have linked repositories on each computer that I use regularly that have little or no data in them, just git-annex style symlinks. If I find that I need something from the main repository on one of those computers, I can query where that file is with git annex whereis:
(I actually have three full copies of my data, in the *_dr repositories, but that’s a story for another day. Suffice it to say that calvin_dr and hobbes_dr are two identical external drives.)
I can retrieve the contents with git annex get. git-annex is smart enough to know that the silo remote is over a network connection and the ‘calvin_dr’ is local, so it copies the data from there:
$ git annex get media/pictures/2002-02-08-olympics.tgz
get media/pictures/2002-02-08-olympics.tgz (from calvin_dr...)SHA256E-s48439263--67c0de0e883c5d5d62a615bb97dce624370127e5873ae22770b200889367ae1c.tgz
48439263 100% 25.10MB/s 0:00:01 (xfer#1, to-check=0/1)sent 48445343 bytes received 42 bytes 19378154.00 bytes/sec
total size is 48439263 speedup is 1.00
(Recording state in git...)
Then, running git annex whereis shows the file contents are local as well:
And I can view the contents of the file like normal:
$ tar -tzf media/pictures/2002-02-08-olympics.tgz | head
Then, when I’m done, I can just git annex drop the file to remove the local copy of the data. git-annex, in good form, checks to make sure that there’s another copy before deleting it.
$ git annex drop media/pictures/2002-02-08-olympics.tgz
drop media/pictures/2002-02-08-olympics.tgz ok
(Recording state in git...)
All along the way, git-annex is tracking which repositories have each file, making it easy to find what I want. This sort of quick access and query-ability means that I know where my data is and I can access it when I need it.
Transporting large files
My work laptop used to be my only laptop, and so it had a number of my personal files, mostly pictures. I’ve transfered most of those off of that system, but every once in a while, I come across some personal data that I need to transfer to my home repository.
I usually add it to the local git-annex repository on my work laptop and then use git annex move to move it to my home server. However, if it’s a significant amount of data and I don’t feel like waiting for the long transfer over my slow DSL line, I can copy the data to my external drive at work and then copy it off when I get home. Doing this manually can get tedious if there are more than a few files, but git-annex makes it a cinch. First, I can query what files are not on my home server and then copy those to the calvin_dr drive.
Many of my backups are the “snapshot” style, where I rsync’d a tree of files to another drive or server in an attempt to make sure that data was safe. The net effect of this strategy is that I have several mostly-identical backups of the same data. So, when I find a new copy of data that I’ve previously added to my git-annex system, I don’t know if I can safely delete it just based on the top level directory name.
For example, if I discover a tree of pictures that are organized by date and event:
$ find pictures -type d
And, checking in my git-annex repo, I can see that there are three files that correspond to those directories:
$ find backup/pictures -type l
I can probably remove the found files, but I might have modified the pictures in this set and I’d like to know before I toss them. After running into this scenario a few times, I wrote a little utility called archdiff that I can use to get an overview of the differences between two archives (or directories). It’s just a fancy wrapper around diff -r --brief that automatically handles unpacking any archives found. For example:
Since there was no output, the directory has the same contents as the archive and can be safely deleted. Here’s another example:
$ archdiff 2002-02-08-olympics/ ~/backup/pictures/2002-02-08-olympics.tgz
Files 2002-02-08-olympics/p2030001.jpg and 2002-02-08-olympics.tgz-_RhD/2002-02-08-olympics/p2030001.jpg differ
One of the files in this directory has modifications, so I can now take the time to look at the two files and see if I want to keep it or not.
Archdiff behaves like a good UNIX program and its exit code reflects whether or not differences were found, so it’s possible to script the checking of multiple directories. Here’s an example script that would check the above three directories:
for dir in ~/pictures/*; dobasedir=$(basename $dir)echo"checking $dir"# retrieve the file from another git-annex repo git annex get $basedir.tgz
if archdiff $dir$basedir.tgz; thenecho"$dir is the same, removing" rm -rf $dir# drop the git-annex managed file, we no longer need it git annex drop $basedir.tgz
Once this is done, the only directories left will be those with differences and the tarball will still be present in the git-annex repository for investigation. I end up writing little scripts like this as I go through old backups to help me process large amounts of data quickly.
That’s it for now. If you have any questions about this or git-annex in general, tweet at me @ndj.
It’s been over two years since I wrote remotecopy and I still use it every day.
The most recently added feature is the -c option, which will remove the trailing newline from the copied data if it only contains one line. I found myself writing little scripts that would only output one line with the intent of using that output to build a command line on a different system, and the extra newline at the end often messed up the new command. The -c solves this problem.
For instance, I have git-url, which outputs the origin url of the current git repository. This makes it easy to clone the repo on a new system (rc is my alias for remotecopy -c):
Since remotecopy works by accessing the server over a tcp socket, it’s immune to these problems. I just use remotecopy on my local system as if I were on a remote system.
LA Perl Mongers
At the latest LA Perl Mongers meeting, the talks were lightning in nature, so I threw together a presentation about remotecopy. The interesting source bits are up on github, including a pdf copy of the slides.
For the presentation, I used the excellent js-sequence-diagrams to make this diagram, that hopefully helps show the data flow in a remotecopy interaction.
Sometimes I need the target to use the IP of the server instead of its hostname. This usually happens with development VMs (a la Vagrant), which are only addressable via IP. Passing -i to scptarget causes it behave this way. Under the hood, it uses getip, which is a script I wrote that prints out the first IP of the current host. If there is no non-private IP, then it will return the first private IP. (I am fully aware that there may be better ways of doing the above. Let me know if you have a better script.)
A little over thirteen years ago, I embarked on a cross-country trip with one of my college buddies. I’ll elaborate more on the trip in another post, but the pertinent part of that story is that we happened to be in Florida in late May, 2000.
We’d originally planned to see certain sights along the way, but by the time we reached the east coast we had grown quite good at adding extra stops to the itinerary. When we stopped in Orlando, we quickly added a trip to the Kennedy Space Center, as we are both great fans of NASA. While we were there, we learned that in a few days a shuttle (Atlantis) was going to launch, so we quickly rearranged the next leg of our trip so that we could be back in the area and then purchased tickets.
Since it was an early AM launch window, they let us into the main building of the space center just before three in the morning. Most of the exhibits were open and since the only people there were the ones going to see the launch, there were no crowds. We’d spent most of our previous visit in the other buildings on site, so it was quite a treat to wander around uninhibited. One of the theaters that usually shows documentary style films was showing live video of the close out crew getting the astronauts into the shuttle while a staff person up in front answered questions from the dozen or so people in the audience. I remember sitting in that room for some time, intently watching the video and enjoying every minute.
When the time came for us to head out to the launch site, we loaded into shuttles that took us out to where NASA Parkway East crosses the Banana River. The causeway over the river is the closest the public can get to a shuttle launch at just over six miles away. We waited out there for about two hours before the final nine minute countdown began, and when the clock struck zero it lifted off, almost effortlessly. From our vantage point it was silent until a few seconds later when the shock wave rolled across the water and hit us. It was an experience like none other.
Shortly before the shuttle program ended a couple years ago, NASA announced which museums around the country would receive a retired orbiter and we were lucky enough to get the Endeavour for the California Science Center.
Over the holiday break, I was able to visit it with my family. It’s on display in a purpose-built hanger while they work on a permanent home. It was great to see it up close, but the hanger and the pre-exhibit room were packed with holiday crowds.
Then, this past week, I was able to return for a second visit with another college friend and his family. This time, there were only a few schoolchildren to maneuver around while looking up at the orbiter. While my friend and his family wandered around, I was able to just sit and study the vehicle itself.
When I saw it thirteen years ago, it was a speck on the horizon. This time it was so big that I couldn’t take it all in at once. I noticed where the black heat tiles begin and the other locations (beside the underbelly) where they’ve been placed. I could appreciate the enormity of the engine nozzles at the back and the texture of the thermal blankets that cover most of the top half. I counted the maneuvering thrusters on the nose and tail and could see the backwards flag on the right side. Again, it was an experience like none other.
There’s a lot to learn about the shuttle program and about Endeavour in particular. For instance, I learned that the reason for Endeavour’s British spelling is that it was named for the HMS Endeavour, the ship that Captain Cook explored Australia and New Zealand with. Also, I learned that Endeavour was built as the replacement for Challenger, and 22 years after the Challenger disaster it was Endeavour who took the first teacher into space.
If you’re in the LA area and are a fan of space flight, then don’t miss seeing the Endeavour. I’ll definitely be going back.
I use time machine to back up my macs, but that only covers the systems that I currently run. I have archives of older systems, some for nostalgic reasons, some for reference. I also have a decent set of digital artifacts (pictures, videos and documents) that I’d rather not lose.
So I keep backups.
Unfortunately, I’m not very organized. When I encounter data that I want to keep, I usually rsync it onto one or another external drive or server. However, since the data is not organized, I can’t tell how much of it can simply be deleted instead of backed up again. The actual amount of data that should be backed up is probably less than half of the amount of data that exists on the various internal and external drives both at home and at work. This also means that most of my hard drives are at 90% capacity and I don’t know what I can safely delete.
I really needed a way of organizing the data and getting it somewhere that I can trust.
I initially heard of git-annex a while ago, when I was perusing the git wiki. It seemed like an interesting extension but I didn’t take another look at it until the creator started a kickstarter project to extend it into a dropbox replacement.
git-annex is great. It’s an extension to git that allows managing files with git without actually checking them in. git-annex does this by replacing each file with a symlink that points to the real content in the .git/annex directory (named after a checksum of the file’s contents). Only the symlink gets checked into git.
To illustrate, here’s how to get from nothing to tracking a file with git-annex:
$ mkdir repo &&cd repo
$ git init && git commit -m initial --allow-empty
Initialized empty Git repository in /Users/nate/repo/.git/
[master (root-commit) c8562e6] initial
$ git annex init main
init main ok
(Recording state in git...)$ mv ~/big.tar.gz .
$ ls -lh
-rw-r--r-- 1 nate staff 10M Dec 23 15:31 big.tar.gz
$ git annex add big.tar.gz
add big.tar.gz (checksum...) ok
(Recording state in git...)$ ls -lh
lrwxr-xr-x 1 nate staff 206B Dec 23 15:32 big.tar.gz -> .git/annex/objects/PP/wZ/SHA256E-s10485760--7c8fdf649d2b488cc6c545561ba7b9f00c52741a5db3b0130a8c9de8f66ff44f.tar.gz/SHA256E-s10485760--7c8fdf649d2b488cc6c545561ba7b9f00c52741a5db3b0130a8c9de8f66ff44f.tar.gz
$ git commit -m 'adding big tarball'...
When the repository is cloned, only the symlink exists. To get the file contents, run git annex get:
$ cd .. && git clone repo other &&cd other
Cloning into 'other'...
$ git annex init other
init other ok
(Recording state in git...)$ file -L big.tar.gz
big.tar.gz: broken symbolic link to .git/annex/objects/PP/wZ/SHA256E-s10485760--7c8fdf649d2b488cc6c545561ba7b9f00c52741a5db3b0130a8c9de8f66ff44f.tar.gz/SHA256E-s10485760--7c8fdf649d2b488cc6c545561ba7b9f00c52741a5db3b0130a8c9de8f66ff44f.tar.gz
$ git annex get big.tar.gz
get big.tar.gz (merging origin/git-annex into git-annex...)(Recording state in git...)(from origin...) ok
(Recording state in git...)$ file -L big.tar.gz
By using git-annex, every clone doesn’t have to have the data for every file. git-annex keeps track of which repositories contain each file (in a separate git branch that it maintains) and provides commands to move file data around. Every time file content is moved, git-annex updates the location information. This information can be queried to figure out where a files content is and to limit the data manipulation commands.
What I have is a set of git repositories that are linked like this:
[git-annex has a subcommand to generate a map, but it requires that all hosts are reachable from where it’s run, and that’s not possible for me. I quickly gave up when trying to make my own Graphviz chart and ended up using Lekh Diagram on my iPad (thanks Josh).]
My main repository is on a machine at home (which started life as a mini thumper and is now an Ubuntu box), and there are clones of that repository on various remote machines. To add a new one, all I need to do is clone an existing repository and run git annex init <name> in that repository to register it in the system.
This has allowed me to start organizing my backup files in a simple directory structure. Here is a sampling of the directories in my repository:
VMs - VM images that I don’t want to (or can’t) recreate.
funny - Humorous files that I want to keep a copy of (as opposed to trusting the Internet).
media - Personal media archives, currently mostly tarballs of pictures going back ten years.
projects - Archives of inactive projects.
software - Downloaded software for which I’ve purchased licenses.
systems - Archives of files from systems I no longer access.
There are other directories, and these directories may change over time as I add more data. I can move the symlinks around, even without having the actual data on my system, and when I commit, git-annex will update its tracking information accordingly. Every time I add data or move things around, all I need to do is run git annex sync to synchronize the tracking data.
Here is the simple workflow that I go through when changing data in any git-annex managed repository:
With this in place, it’s easy to know where to put new data since everything is just directories in a git repo. I can access files from anywhere because my home backup server is available as an ssh remote. More importantly, I can just grab what I want from there, because git-annex knows how to just grab the contents of a single file.
One caveat to this system is that using git and git-annex means that certain file attributes, like permissions and create/modify/access time are not preserved. To work around this, for files that I want to preserve completely, I just tarball them up and add that file to the git-annex.
git-annex is written in Haskell. Installing the latest version on on OS X is not the most repeatable process, and the version that comes with most linux distributions is woefully out of date. So I’ve opted for using the prebuilt OS X app (called beta) or linux tarball.
After copying the OS X app into Applications or unpacking the linux tarball, I run the included runshell script to get access to git-annex:
$ /home/nate/git-annex.linux/runshell bash # on linux$ /Applications/git-annex.app/Contents/MacOS/runshell bash # on OS X$ git annex version
git-annex version: 3.20121211
I’ll share more scripts and tips in future blog posts.
I recently split dfm out into its own git repository. This should make it easier to add new features and grow the test suite without cluttering up the original dotfiles repository. I’ll sync dfm over at regular intervals, so anyone who wants to keep up to date by merging with master will be ok.
I also just finished up a major new feature: dfm can now import files. So instead of: