JASON KNIGHT

Witty subtitle here

Harnessing Computers - One (Linux command line) step at a time

2013-02-21 - Reading time: 3 minutes

This post again is mainly for Linux command line users, but it may have more general appeal as well:

Computers often make our lives easier in innumerable ways, but it is often quite a challenge to figure out the 'best' (or even a good) way to perform a single task. I'd like to share an example of where I hit an sweet spot towards this effort while backing up some pictures the other day.

The general problem is simple: you have multiple computers/devices and you aren't sure whether a set of pictures was backed up up to your house-wide backup solution (you do have one right?). This becomes especially difficult when you aren't the main operator of some of those devices, and so it becomes really confusing who took what pictures off the camera at what time and whether they were backed up.

The naiive solution is to look at the set of pictures in question, and browse through your backup trying to find the same folder name and then check the pictures inside. To speed this up, I used the find command in Linux which works as follows:
find <dir> -iname <filename>
Where -iname specifies that we don't care about the case, and dir specifies the directory to begin our search (this is recursive, so it includes all files and folders underneath).

So after picking a file name at random from the set of pictures in question (IMG_4071.JPG), I searched and found a few results. Then rather than browsing to this location and checking the files manually, I decided to use a little more of find's magic: We can also tell find to perform a command on each file that matches:
find <dir> -iname <filename> -exec <cmd> \;
Where cmd operates on each file separately and the escaped semicolon (\;) tells find that the command is finished. So I performed the following command on both the folder in question, and in the backup directory and compared the results:
find <dir> -iname IMG_4071.JPG -exec md5sum {} \;
Now md5sum is a program that computes a short string that is designed to be unique but non-random based on the data in the file. Thus if the two images had matching md5sums, then they would be the same image data with overwhelming probability.

This way I could quickly determine, at a glance, if I had already backed up an image (and by extension, a set of pictures) without having to search through hundreds of directories, and tens of thousands of pictures.

And yes, this could be extended to even fancier methods, but I think this is a very optimal point of amount of work put in (writing a single command), and what I needed from it.