The Unix Toolbox

This article originally appeared in the December 2013 issue of PHP Architect

The beauty of a Unix-based operating system is that it has a multitude of useful tools that most people don't know about. You can use them on their own, chain them together, you can do almost anything you can think of.

We'll take a look at some popular tools and some lesser known gems. As well as tools, we'll cover tips and tweaks for making your command line do all the work so that you don't have to. From search and replace to installing software without having to find it first, the command line can do it all!

The Unix Philosophy

The thing about Unix is that you can use as much or as little as you need. I think Doug McIlroy said it best:

This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.

That philosophy of making small apps that do one thing (well), and making them accept textual input from almost anywhere is what makes Unix great.


When I say Unix, what I actually mean is *nix. Linux, OSX, FreeBSD and more. Although development on these projects has diverged, they all still come from the same Unix core. This means that everything I'm going to talk about should at least be possible on your chosen OS (unfortunately, that doesn't include Windows without quite a bit of work).

The basics

When we think of the command line, most people think of screens full of text that don't make any sense, about cryptic commands that take longer to learn than a spoken language.

Whilst that can be the case for some, a lot of the commands aren't like that. Most are contractions of the action that they perform which, once you know what the expanded version is, makes them much easier to remember.

Hands on

Let's start by taking a look at the five "must have" utilities. Feel free to bring up a terminal and try the commands out as we go through them. If you're already comfortable with the basics, feel free to skip to "Making them do more" for the juicy bits.


ls is a utility that's been around since 1971 - a pretty long time. Whilst it seems like an oddly named utility, it's actually short for "list" - the function it performs.

Running ls from the command line will show a list of everything that's visible in that directory. That's all it does, it lists a directory's contents.


mkdir is another important one. Again, it sounds like a random selection of letters, but once you know that it stands for "make directory" it starts to make a lot more sense.

So, let's make a directory. mkdir has one required parameter, the directory to create. Let's create a directory called "phparch" with mkdir phparch. If everything was successful, you should just see your prompt again - if anything went wrong it should show you an error message.

We can check if the directory we just created is there by using ls and looking at the output.


Now that we have a directory, let's do something with it. We need to change our current working directory to be our new directory. To do this we use cd, which stands for (guessed it yet?) "change directory".

So, we can use cd phparch to change directory into our new directory, then type ls again to see what's in there. As we've only just created it and not put anything inside it yet, there should be nothing shown on screen.

Half way

We're half way through the essential commands, and hopefully we can see how even with just three commands, we can do things much more efficiently than we can using a GUI. Now that we can create folders and move into them, let's take a look at creating some text files to store some notes.

vim / emacs / nano

Up until now, our commands have been quite closely named to the function that they perform. Unfortunately, that rule gets thrown out of the window when it comes to text editors.

The three main editors on the command line are vim, emacs and nano. It doesn't matter which you use (I prefer vim, personally), but you're probably going to end up using one. For now, I'd recommend using nano as it's the easiest of the bunch.

So, let's create a text file in our new "phparch" folder. To do this, you just need to type nano <filename>. In this case, we're going to create a file called "owners.txt", so we type nano owners.txt and hit Enter.

The window should change, your prompt should disappear and your cursor should be at the top of an empty window. This is nano, the text editor. Type "" into the file, then we need to save your text and exit nano.

This is where it gets tricky and feels a bit more like black magic. The commands to save and exit are documented at the bottom of the screen, in a roundabout way. You might notice that there's something that says "^O Save". This means press "Ctrl+o" to save the document. Give it a go hitting Enter when prompted.

Next, we need to quit nano. You can find the command to quit at the bottom of the screen too (hint: It's ^X), so press that and you should be back on your command line. We can see if the file was created by typing ls. You should see something called "owners.txt". If we wanted to edit that, we simply type "nano owners.txt" again. All your existing text should be there, and you can edit it as you see fit. For now though, just press "^X" to exit nano again.


Grep is probably my favourite utility. It stands for global regular expression print.

In it's most basic form, it's used for searching through files in the current directory. Imagine that instead of one text file, we had 100 in this directory. Now, imagine that we need to find out which one of those 100 contained the word "musketeers". Instead of using nano to open them all up one by one, we can use grep.

The easiest way to use grep is to run grep "musketeers" *. This means search for "musketeers" in everything in the current folder. It's important to note that it's case sensitive, and will only look at files in the current directory. We can change that behaviour, but we'll come to that later on.

So, running grep "musketeers" * should return one line, containing the name of the file that the string was located in, and the entire line that it was located on for context.


Finally, we want to delete the file we just created. Thankfully, we're back to the names of commands being easy to guess. rm is short for "remove".

Type ls to make sure that our file is there, then type rm owners.txt before running ls again. Hopefully you'll notice that whilst owners.txt was there the first time that you ran ls, it had gone the second time. Congratulations, you just deleted a file!


Ok, you got me, that was six commands. Whilst grep isn't one of the commands that you need to know to use the command line, it's far too useful for us to leave it out.

With those six commands, you should be able to hold your own on the command line, creating directories and editing text files, removing anything else you don't need.

But surely there's more you can do, right? You can do all this stuff just as quickly, if not quicker, with a GUI! Well, you're right, they can do more. Let's take a look at just what they can do.

Making them do more

All of these utilities doing one thing each is great, but if they just did one thing in one way, we'd end up with hundreds of utilities that do very similar things. To get around this, we use the concept of "flags".

You remember that the ls command is used to show the contents of a folder? Try running ls -l - you should see a list of all of the files in a folder, along with loads more information (which we don't need to worry about for now). We've done the same core action, listing the contents of a directory, but we've formatted it in a different way. There's loads of flags available for most commands. You can see a complete list of supported flags by reading the manual page for each command. Type man command to show the help page e.g. man ls to get help with the ls command.

Some commands take arguments for their flags. A good example of this is grep -C. The -C flag means "context", and the argument it takes is the number of lines to show either side of the line that matched. So, grep -C 5 "foo" * would show 5 lines either side of any line that matches "foo" within files in the current directory. This is shown in the man page as -C[num], meaning that the -C flag takes a parameter.

Finally, you can generally combine flags into one parameter if you don't need to provide a value. e.g. ls -a -l is the same as ls -al, but grep -iC 5 won't work as -C requires a value.

Chaining them together

Now that we're comfortable with making the commands do what we want, wouldn't it be awesome if we could start using them in conjunction with each other?


To do this, we use something called a pipe (The | character). At the beginning of this article, I mentioned that these commands can get their input from anywhere. Pipes allow you to redirect the output of one command and use it as the input of another one.

For example, to find all of the items in a folder that I edited in December, I can run ls -l | grep "Dec", which will show the long output version of ls and search the output for the word "Dec", returning only lines that match.

That's a pretty basic example, so let me show you a few more examples that are really useful, but use commands that you've not come across yet:

Show all files in a folder, sorted by size

To do this, we use du (disk usage), specify a max depth of one (-d1) and the pass this into sort -rn, saying sort this output in reverse, sorting the strings as though they're numbers and not strings (e.g. 1, 2, 3, 10 not 1, 10, 2, 3).

du -d1 | sort -rn
Show every line of a file, removing any lines that contain the word "bob".

To do this we use cat, which outputs a file's contents, and grep which searches a text stream. We provide the -v flag which means "invert match", making it select the non-matching lines.

cat afile.txt | grep -v "bob"

xargs and redirection

As well as pipes, we have two other tools in our chaining arsenal. The first is xargs. xargs allows you to pipe output from one command and use each new line as an input parameter to another command.

For example, to delete everything containing ".git" in the current path:

find . -name "*.git*" | xargs rm -r

This would find everything whose name contains ".git" and output it one item per line. Then each line will be passed in to rm -r, meaning the file is removed.

If you wanted to curate the list instead of blindly passing it into xargs, you could use file redirection to store a temporary list:

find . -name "*.git*" > findgit.txt

This would create a file called "findgit.txt" which you could open, read and edit as you see fit. Once you're done, save it and then you can use it as the input file for rm by using cat and xargs.

cat findgit.txt | xargs rm -r

Running that command would delete everything from the paths specified in the file "findgit.txt"

Another real world example of how xargs can be useful is by doing a find and replace across all files that specify a certain criteria. For example, I want to rename the class "User" to be called "Account".

First, we start by finding all files in the current directory that contain "new User":

grep -r "new User" *

Next, we can test our substitution by piping it into sed. sed is a utility that is used for manipulating text. In this instance, we're going to use it's string replace functionality.

grep -r "new User" * | sed 's/new User/new Account/g'

We should see the output we saw last time, but instead of it saying "new User" it should say "new Account".

Once we're happy that the replace works, we add a few more arguments in. By passing the -l flag to grep, we say "give me only the filenames", and by passing the -i flag to sed we say "make the change in place". So, our command becomes:

grep -lr "new User" * | xargs sed -i s/new User/new Account/g'

After running that command, all your files that used to use the "User" class should now use the "Account" class.


Sometimes, feeding files blindly into another expression using xargs just isn't good enough. Fortunately, we have the bash programming language to help us do some more complicated things.

We're probably going to jump forward quite a lot here to keep things as short as possible. If I use a command or a flag that you want to know more about, remember that you can type man command to find out more.

Removing empty directories

Bear with me whilst I set the scene here. Imagine that we have a directory that has 10 folders in it. In each of those folders, we have another 10 folders. Then, in each of those we have 10 more. Inside an indeterminate number of those folders, at any level, there may be a file named "foobar.txt".

Now imagine that we want to delete all of the empty directories in there, but we don't know which of them contain "foobar.txt". Instead of checking them all one by one, we can use bash to construct a little program to do it for us.

I'm going to use the concept of subshells here, which are shown using the $(command) syntax. The command provided is run in another shell, and the data returned by the command is passed through to our current process.

So, we start off by finding all of the folders below our current directory and saving that list:

SEARCH_DIRS=$(find . -type d)

Then, we can loop through all of those directories and look at their contents.

for DIR in $SEARCH_DIRS; do
  ls $DIR

This will output the contents of every directory in our list. Now, we want to check which of them are empty and remove them. To do this, we use test.

for DIR in $SEARCH_DIRS; do
if [[ -z "$(ls $DIR)" ]]; then
  rmdir $DIR

[[ -expr ]] is a commonly used alias for the test command. -z is a check that looks to see if the string provided is empty. As we're providing the output of ls $DIR as the string, if the folder is empty the string should be too.

Finally, if the test is true (i.e. the folder is empty) we remove the directory.

Useful flags

As we've seen, you can change the behaviour of most utilities by providing flags when you're calling it. There are a few useful flags that are common across a lot of utilities. To check what flags are available for a specific command, run man command.

The ones that I've found myself most commonly using are as follows:

  • -i (Case insensitive e.g. grep -i string *)
  • -r/-R (Recursive e.g. ls -R Downloads, grep -r string *)
  • -v/-V (verbose or version, depending on the utility)
  • -h (human readable file sizes, e.g. du -h, ls -lh)

Config options

As well as passing flags to utilities, you can configure a lot of them to have smart defaults. This is done using either a .rc file, or an environment variable.

.rc files

.rc files generally live in your home directory, and change the way that programs work. They're loaded automatically whenever you use a utility that is looking for that config file.

They could be as simple as adding "startup_message off" to .screenrc to disable the startup message, or as complex as a 350 line .vimrc that configures everything from the font size to what happens when you press an arbitrary set of keys.

Environment options

Other utilities are controlled by environment variables. The best example I have is grep, which is controlled by GREP_OPTIONS.

Unlike .rc files, they're not loaded automatically. They need to be added to a file that is loaded each time a terminal is launched such as ".bashrc" (or ".zshrc"/other depending on your shell).

When searching with grep, I usually want to ignore vendor folders when programming. By adding GREP_OPTIONS='--exclude-dir=node_modules' to my .bashrc file, it means that grep will never search inside "node_modules" folder unless I run grep as GREP_OPTIONS="" grep -r "string" *, setting GREP_OPTIONS to be empty for this request only.

More utilities

The command line doesn't stop with the built in functions. Every day, people release new utilities and frameworks to make our life easier. Here are a few of my favourites:


GRC is short for the "generic coloriser". It's a utility that allows you to pipe output through it and have it apply colours via user defined stylesheets. The patterns it matches are specified via regular expressions, and it ships with a lot of useful stylesheets.


This one's ZSH specific I'm afraid, but it's too good to leave out.

Again, this utility adds colour to your terminal via user defined rules. Out of the box, it does things like:

  • Highlights strings inside commands
  • Highlights matching brackets
  • If the command you're trying to use doesn't exist, it highlights it in red. Otherwise, it's shown in green.

In addition, I added one of the suggested rules that makes the prompt's background red and turns the text white every time I type the characters rm -rf. That visual indicator is a good reminder to double check the command before I hit enter.


Finally, let's take a look at gxpr, a script that allows us to search google calculator and outputs the answer to the terminal. This is a perfect example of why Unix is great. Someone decided that they used Google Calculator enough that they didn't want to use a browser each time, they wanted to do it all from the command line.

So, they set out researching how to do it and glued together a few curl commands with some perl and they had a working solution in less than 20 lines of code.

If you're interested, you can find the script at


Hopefully that wasn't too scary as an introduction to Unix. Next time you need to solve a problem, instead of running to Google to find a GUI to do it, try searching for a command line utility. Each time you learn a new one, it will magically become part of your day to day usage. Before you know it, you'll be spending most of your time on the command line, and you'll wonder how you ever lived without it.

A better git diff

Whitespace is like git diff's krypton, it makes changes that are actually tiny look much more complicated than they actually are. Thankfully, git comes with a few flags that you can use in conjunction with git diff to make life a bit easier.

The first option is --ignore-space-at-eol. This flag makes git diff ignore any changes to whitespace at the end of a line. Most developers have options to automatically trim trailing whitespace, but if you're working in a team that doesn't have it enabled you might find this option useful.

  git diff --ignore-space-at-eol

The next flag is -b, which is an alias for --ignore-space-change. This is useful to use when someone goes through and converts tabs to spaces, or something similar. The whitespace hasn't been added or removed, it's just changed size. For most people reading a diff, that's not important (unless you're writing Python, that is).

  git diff -b

The final flag is -w, which is the same as --ignore-all-space. Imagine that we have a line with no whitespace at the beginning of a line, but we reindent and now it has spaces at the beginning. Using git diff -b would show this change as it's not a change of space, it's an addition. Using git diff -w will hide the change as the content of the line hasn't changed other than whitespace.

It's worth noting that even when using -w, the addition or removal of blank lines will still show in git diff. This is because the line didn't exist previously, but now it does.

The final thing to do is to add this new git diff to your ~/.gitconfig file. I personally like git diff -b as it hides the majority of whitespace changes, but not quite as much as -w which could hide changes in indentation levels etc. To add git wdiff to your available git commands, add the following to your ~/.gitconfig file:

    wdiff = !git diff -b

jq - sed for JSON

I can't remember the last time a day went by that I didn't end up working with JSON data. It's a lovely format to work with, but unfortunately due to how verbose it is it can be quite difficult to extract the specific details that you're looking for.

jq is a standalone binary for working with JSON. Just download it, put it in your $PATH and get to work. It has all kinds of options available, but I tend to use it just for filtering data.

My most common use case is this:

  echo '{"foo":"bar","bees":true}' | jq .

jq reads the JSON from stdin and runs the . filter (aka "match everything") against it. This results in a nicely formatted JSON representation for me to read.

Sometimes though, the output I'm using is a bit big to search through by hand. This is where jq filters come in useful. Using the same JSON as last time, I want to see the output of "foo".

  echo '{"foo":"bar","bees":true}' | jq .foo

Things are starting to get a bit more complicated now, but still not too complicated that I couldn't look at it by hand. Let's take a look at what happens when we get arrays of data involved. We want to go through each array item and pull out "foo". So, from our root (.), look at each array item ([]), and pull out "foo" (.foo). This makes our filter .[].foo.

  echo '[{"foo":"bar","bees":true},{"foo":"baz","bees":true},{"foo":"foo","bees":true},{"foo":"bee","bees":true}]' | jq ".[].foo"

That's pretty much the extent of my experience with jq. There's a whole host of advanced features that are explained well in the manual.

There's plenty of ways to install jq - for most people it's just a case of downloading the binary. If you're on a Debian based OS it's available in apt, or if you're on OSX it's on brew.

Explain, from your shell

A few weeks ago, almost everyone on Twitter was sharing Explain Shell - with good reason. It's an awesome site that you can copy and paste a command into and it'll explain each component of it to you.

There are a few examples on the homepage, I particularly like tar xzvf archive.tar.gz and ssh -i keyfile -f -N -L host.

Now, whilst you can copy and paste a command you're working on into the site, wouldn't it be awesome if you could trigger it from your command line? Thanks to Schneems, you can!. I prefer the shell version, so add the following to your .bashrc (or .zshrc etc) and then next time you want to see what a command does, just add explain to the beginning of it.

function explain {
  # base url with first command already injected
  # $ explain tar
  #   =>

  # removes $1 (tar) from arguments ($@)

  # iterates over remaining args and adds builds the rest of the url
  for i in "$@"; do

  # opens url in browser
  open $url

Vim Tips

I've been using vim for a while now, but with Vim being Vim, there's always new things to learn. Thanks to Jacek, I've learned a few more things recently.

S to change a line

Whenever I wanted to change an entire line before, I used to use 0c$ (go to the beginning of the line and change everything to the end). Since I learned about S which does the same thing, I've been trying to use it. Muscle memory's difficult to break though, so I'm still stuck using 0c$, but I'm trying to use S more.

| to go to a column

When working Google Bigquery, if the data provided did not fit the schema it returned an error with the line and column that it found an error on. I used to use <num>gg to jump to the correct line, and then a combination of w and h/l to get the correct column. As it turns out, you can use <num>| to jump directly to a column. So, to jump to line 29, column 33 we'd use 29gg33|.

Ctrl+f to move down a page

Whilst most of the navigating I do whilst inside a file in vim is done via searching (e.g. /searchterm), sometimes it's useful to scan through a file to get a feeling of how it's structured. Previously, I just kept my finger on j to scroll through the file. ctrl+f is a much more efficient way to page through the file, one screen at a time.

:%! to run an external program

To run the contents of your buffer through an external program, you can use :%!<program>. For example, to reformat the current buffer to wrap at 80 characters, we can use the fmt command line utility. To use it through vim, we type :%!fmt -80. The % is a normal vim selector, so you could use :.!fmt -80 fo reformat just the current line, or :.,+5!fmt -80 to reformat the next five lines (including the current one).

Useful git commands

Git's a fantastic tool. If you know what you're doing, you can pretty much do anything that you can dream of.

However, to get 90% of the way there, you only need to know a handful of commands.

git add/commit/push/pull

I'm not going to cover these as you can find information about them everywhere. add/commit/push/pull are the cornerstones of git, so you need to understand them to do anything else.

git checkout

Revert a file to a certain point in time. If you specify a revision (this can be a commit hash, a tag name or a branch name) it will show the file as it looked at that commit.

My main use case however, is when I've made lots of changes to various files (generally whilst debugging) then I want to discard them all. It's as simple as typing:

git checkout .

git reset

Sometimes, I stage a file ready for committing, only to realise that I don't want to commit it with all the others. To remove a file from the index, you can use:

git reset <file>

git add -p

git add -p <file> allows you to step through each block of changed text in the specified file and say yes/no to staging that chunk for commit. This is especially useful when you fix multiple things in the same file that aren't really related and want your commits to be logically separated.

git stash

I regularly make lots of changes in a branch before realising that I want to put that development on hold and work on something else. Instead of committing half finished features, I use git stash. git stash is shorthand for git stash save <description>.

Once you've stashed changes you can do whatever work you need to do before going back to your work in progress branch and using git stash pop to apply the changes to your working tree again.

git stash is also very useful when you're merging another branch into your current one and git won't let you as your uncommitted changes will cause a conflict. git stash then merge and reapply your changes to force the merge conflict to happen.

git patch

Finally, git patch. Sometimes, the history of a branch gets so messy that you don't want to mess around with git rebase or anything like that. You just want to take the differences between your feature branch and develop and apply them as a single commit. git patch is perfect for this.

To generate a patch, use the following command:

git format-patch <branch_to_compare> --stdout > your_changes.patch

Then, checkout your base branch and apply it. I prefer git am for this over git apply as it creates a commit when applying the patch.

git am --signoff < your_changes.patch

Whilst that seems like a trivial example, patches can be used in a lot of situations. For example, we had a feature that was developed against our develop branch, but develop wasn't as stable as we wanted and couldn't be deployed. This feature was a fairly important release, so we generated a patch against develop to see only the data that we changed, then applied it to the master branch as a hotfix and deployed master.

ssh-add and SSH Forwarding

Anyone that's ever used SSH before will tell you how awesome it is when working with remote services. If you've never used it before, have a read and then come back to this.

ssh-add is a tool that adds an ssh key to the current terminal session. When you try and SSH into a machine this is done transparently for the duration of that connection. However, if you want to use the same key multiple times you can use ssh-add. If you just type ssh-add it will use your default SSH keys (id_rsa, id_dsa and identity) to the session. You can also supply a path to a key file to add that to the session (e.g. ssh-add /path/to/key).

Once your key is added, you'll be able to ssh into any server that uses that key without entering your passphrase again.

Now, the important part of this post. Once you've added you keys to your session, you can forward them onto any server that you connect to. For example, imagine I want to use Github to store all of my side projects, and I want to clone them onto a server somewhere. Normally, I'd either have to add an SSH key to the server, or use a HTTPS clone. Using ssh-add, I don't need to do either.

I should mention that you should only do this if you trust the admin of the machine you're connecting to.

ssh -A

Then, on you'll be able to perform any SSH based actions that your key has permission to do. This could be SCP-ing a file from another server, cloning a git repo or even just SSH-ing into another server.

If you want to do this every time you connect to a server, you can add it to your ~/.ssh/config file.

Host remoteserver
        ForwardAgent yes

A better git log

If you're working in a team with other developers, you probably use git log quite a lot to work out what's going on when you pull down new code. Whilst it works, it's not the greatest view of the data. You can customise this view using the --pretty flag, passing one of the following values to it (e.g. git log --pretty=oneline).

  • oneline
  • short
  • medium
  • full
  • fuller
  • email
  • raw

As well as those values, you can specify your own format using the --pretty=format:'' flag. For a list of available fields, see this useful post by Alex Fish.

Here's a few of the ones I like personally. If you want to use any of them, I'd advise adding them as an alias using:

git config --global alias.alias_name "log [options]"

e.g. To run git lg to see a nice one line summary

git config --global alias.lg "log -graph --decorate --oneline --all"

Useful git-log options

A quick one line summary of each commit:

git log --pretty=oneline


41281c5 Better cssh demo video
6f8564f Add post on cssh
5d090c9 Missing /
1938d38 Fix sed article errors
1c6f9f7 Add post about working with large files and sed

This is the one I personally use. I use the --all flag to show commits to remote branches as well as local branches.

git log --graph --decorate --oneline --all


* 5ddf436 Merge in adam's compiled changes + my code padding change
*   b597f3f Merge pull request #2 from sirbrad/styling-fixes
| * 739eed3 Fixes a few things;
* |   7adcaf7 Merge pull request #1 from AdamWhitcroft/gh-pages
|\ \  
| |/  

A colourful, super-informative view including branching history (via

log --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr)%Creset' --abbrev-commit --date=relative

Outputs (with more colours!):

* 5ddf436 - Merge in adam's compiled changes + my code padding change (7 weeks ago)
*   b597f3f - Merge pull request #2 from sirbrad/styling-fixes (8 weeks ago)
| * 739eed3 - Fixes a few things; (8 weeks ago)
* |   7adcaf7 - Merge pull request #1 from AdamWhitcroft/gh-pages (8 weeks ago)
|\ \  
| |/  

Administering multiple ssh shells simultaneously with cssh

cssh is a tool that lets you administer multiple ssh sessions at the same time. Whilst it's not advised for production use, it's a huge timesaver for some common tasks.

The best way to understand what it does it to watch the following video. Although it's recorded on Windows, it works just the same on Linux/OSX.

If you want to install it, you need the clusterssh package.

On Debian:

sudo apt-get install clusterssh

On Redhat:

sudo yum install clusterssh

As well as providing muliple hosts, you can use shell expansion to connect to lots of machines at once:

cssh michaeldev-{01..10}

Once all the windows appear, put your mouse over the grey box and start typing. Your characters should appear in every SSH window that cssh opened. It also supports sending control characters (e.g. ctrl+L to clear the screen) and pasting from the system's clipboard

Working with large files with sed

This week, I've been using sed quite a lot. sed is short for stream editor, and is designed to work with large amounts of streaming text.

I've been working with Google BigQuery, inserting large amounts of data. Unfortunately, some of the data was malformed and was causing errors whilst I was trying to ingest it. Luckily, BigQuery tells you what line number + column the error occurred on.

I initially tried to open the file in vim, but I realised that the text file was probably a bit big. As sed is designed for streaming text, it's the perfect tool for the job.

sed allows you to specify a range of line numbers to apply your command to. You can specify a certain line, or a range of lines.

If we wanted to replace "Dog" with "Cat" on the 3rd line only:

sed -i '3 s/Dog/Cat/' /path/to/file

If we wanted to replace "Dog" with "Cat" on lines 1-100:

sed -i '1,100 s/Dog/Cat/' /path/to/file

By default, sed prints every line. This isn't what I wanted, so I ran all of my commands with the -n flag, meaning "don't print by default". Then, I used the p command to say "print these lines".

Imagine that my error was on line 38224. It'd be pretty hard to get to that line in a text editor, but it's really easy using sed.

sed -n '38224p' /path/to/file

That says "don't print anything by default, but print line 32844'.

If you wanted to take lines 30129 to 33982 and work with them independently, you can use sed to write them out to another file:

sed -n '30129,33982w' /path/to/file

You can then pipe the output of sed into xclip or pbcopy to get only the lines you're interested in copied onto your clipboard.

sed is an old tool, but it still has it's uses. If you ever need to work with a subset of a large amount of text data, you could do a lot worse than sed.

Substitution with sed

This week, I've been using sed quite a lot. sed is short for stream editor, and is designed to work with large amounts of streaming text.

The most common use case that most people have for sed is text substitution. As it's a stream editor, by default it looks at one line of input at a time.

So for example, given the following input:

the number one is number one
then comes number two
and number three comes third

Running the following command would replace "number" with "NUMBER".

sed 's/number/NUMBER/'

Just like perl, substitution only affects the first match - you need to be explicit to make it replace all occurrences. In this case, that's by adding the g flag.

sed 's/number/NUMBER/g'

So, if we run:

echo "the number one is number one\nthen comes number two\nand number three comes third" | sed 's/number/NUMBER/g'

We should see:

the NUMBER one is NUMBER one
then comes NUMBER two
and NUMBER three comes third

Applying this to an existing set of text can be done by feeding filenames into sed and using the -i flag, saying "change in place".

So for example, to replace all occurrences of the word "Dog" with the word "Cat":

find . -type f | xargs sed -i 's/Dog/Cat/g'


When working with git, there's quite a few different settings that you can set to change your experience. A lot of people know about the defaults of "" and "" for identification, but there's loads more that you can set up.

Below, is a (mostly complete) copy of my ~/.gitconfig file. As well as identifying myself, I've added a few useful aliases, added colouring and configured which utilities to use for common actions.

My favourite option is probably "help.autocorrect". Without it, git suggests a command close to the one you typed, but makes you type it yourself. With this enabled, it just runs it:

$ git stats
WARNING: You called a Git command named 'stats', which does not exist.
Continuing under the assumption that you meant 'status'
in 0.1 seconds automatically...

As well as that, I force any repositories that I own to be checked out as read/write, rather than read only. It's something I don't do much, but I don't even have to worry about it now.

# Any GitHub repo with my username should be checked out r/w by default
[url ""]
insteadOf = "git://"

When looking at the alias section, you might notice that some of the commands start with a "!". This means "run the following as a shell command". As it can be anything that can be executed, you can even provide a path to a script to run. In my .gitconfig, I run a script from ~/.dotfiles/bin. I say use the $ZSH variable from the environment as the base path for the script. Anything you can do on the CLI, you can do in your .gitconfig.

wtf = !$ZSH/bin/git-wtf

You can find all of the git scripts in my dotfiles repository. They're the ones that start with git-.

Full .gitconfig

# This is me
name = Michael Heap
email =

# Set up some useful aliases
co = checkout
ci = commit
sl = !git shortlog -sn
lg = !git log --graph --pretty=oneline --abbrev-commit --decorate
wtf = !$ZSH/bin/git-wtf

# Add colour to ALL THE THINGS
diff = auto
status = auto
branch = auto
ui = true

# Set up a global excludes file
excludesfile = ~/.gitignore
# Which apps do I want to use with git?
editor = vim
pager = less -r

# Don't warn about whitespace conflicts when applying a patch
whitespace = nowarn

# Autocorrect anything I typo
autocorrect = 1

# Any GitHub repo with my username should be checked out r/w by default
[url ""]
insteadOf = "git://"


For completeness, here's my personal ~/.gitignore file too


Mirror a directory with SCP

Another short one today, courtesy of Chris H.

When working in a directory, if you need to copy everything from the current directory to an identical path on another machine, you can do it as follows:

scp -r ./*$PWD

This is also possible with rsync, but as the SCP syntax is much simpler it's a nice one liner to know.


I'm not too sure where this one comes from, but I found it whilst working with Lorenzo at work. extract is a general purpose tool for uncompressing archives. No longer will you have to remember that it's tar xzf to extract file.tar.gz, or just gunzip for file.gz, whilst it's tar xjf for file.tar.bz2. Just type extract <file> and it'll take care of the rest.

extract () {
  if [ -f $1 ] ; then
    case $1 in
      *.tar.bz2) tar xjf $1 ;;
      *.tar.gz) tar xzf $1 ;;
      *.bz2) bunzip2 $1 ;;
      *.rar) unrar e $1 ;;
      *.gz) gunzip $1 ;;
      *.tar) tar xf $1 ;;
      *.tbz2) tar xjf $1 ;;
      *.tgz) tar xzf $1 ;;
      *.zip) unzip "$1" ;;
      *.Z) uncompress $1 ;;
      *.7z) 7z x $1 ;;
      *) echo "'$1' cannot be extracted via extract()" ;;
    echo "'$1' is not a valid file"

pv | mysql

As a developer, I spend a lot of time importing backups generated with mysqldump into local mysql instances for testing. Most of the time they're small enough that it's instant, but sometimes it takes quite a while if there's tables with millions of rows in there.

This is how I always used to import databases. Open up a mysql connection and stream "backup.sql" into it using input redirection.

vagrant@precise64:~$ mysql -u root db_name < backup.sql

Did you know that you can also pipe data into mysql using pipes? This also works:

vagrant@precise64:~$ cat backup.sql | mysql -u root db_name

This brings us on to pv, a utility that allows us to monitor the progress of data through a pipe. You can place it anywhere in a pipeline and see information such as time elapsed, percentage completed (with progress bar), current throughput rate, total data transferred and ETA.

pv also has a useful function that will copy each supplied file in turn to standard output. This essentially turns it into a replacement for cat that also provides a progress bar.

vagrant@precise64:~$ pv backup.sql | mysql -u root db_name
128kB 0:00:02 [43.5kB/s] [==============>          ] 58% ETA 0:00:01

There's many more examples for pv available on the man page, just run man pv to see them.

Weekly Breakdown - gxpr

Last week, we discovered gxpr. Using a script to make life easier is good, but understanding why it works is even better. Let's do a line by line breakdown of gxpr and try to understand how it works.

First, we add a shebang to give the shell a clue which interpreter to run it through.


The we set up a couple of variables. The first is the curl command that we want to use. The second and third are URL's that we're going to pass in as parameters at a later date

CURL='curl -s --header User-Agent:gxpr/1.0'

We want to grab all of the arguments passed in and use them as one string as the parameter to Google Calculator. This line's pretty complicated, so let's break it down.

EXPR=$(echo "$@" | perl -MURI::Escape -ne 'chomp;print uri_escape($_)')

We start by executing the code in a sub shell so that we don't change anything in our current shell session. We know this because the command is surrounded by $(). The data returned from this expression is assigned to EXPR for use later.


In a script $@ means all arguments that were passed in, so we echo them out (surrounded by quotes) and use that as the input for the perl command, via the pipe | character.

... echo "$@" | perl ...

We call perl, passing the -M option to load the "URI::Escape" module and the -n and -e flags.

The -e flag allows you to specify the code to run as an argument, rather than passing in a filename.

The -n flag creates an implicit loop, meaning that the code you provide will run for as long as there is input, ensuring that all input is captured.

The script that we pass in uses chomp to trim whitespace off the end of the request, then escapes the input so that it safe to put into a URL. The $_ is a special variable that is implicitly assigned the value of what came via stdin in this case.

... perl -MURI::Escape -ne 'chomp;print uri_escape($_)') ...

The next thing to do is to make the curl request and try and get our output. Again, there's quite a lot going on so let's break it down line by line.

  perl -ne '/rhs: "?([^\[\],":]+)/ and print $1' |
  perl -pe 's/[^\x00-\x7F]//g'

Execute in in a sub shell so that we don't affect our current session


Make the curl request using the variables we defined earlier.

... $CURL "$GOOGLE?q=$EXPR" ...

This expands to:

... curl -s --header User-Agent:gxpr/1.0 "$EXPR" ...

Google calculator returns JSON output (example), so if we search for "1+1" we get the following output:

{lhs: "1 + 1",rhs: "2",error: "",icc: false}

We want to extract the "rhs" section, which we can do with the following regular expression. The regex reads as following: "Find 'rhs: ', then an option quotation mark. Next, find everything until we hit one of the following characters: [],":. If we have any output from this, print it out."

...perl -ne '/rhs: "?([^\[\],":]+)/ and print $1' ...

Once we have that return value, strip any non-ascii characters from it

...  perl -pe 's/[^\x00-\x7F]//g' ...

Now that we've grabbed whatever Google gave back to us, it's time to make sure that we actually got a return value.

test -z "$res" && {
    echo "google doesn't know" "$@" 1>&2
    echo "⌘ click: \033[4m$WOLFRAM?i=$EXPR\033[0m"
    exit 1

test is a utility that return true or false. There's loads of options, which you can see by running man test. There is also a more commonly used alternative syntax, represented by [[ -z "$res" ]]. The -z flag means "evaluate to true if the string provided is empty".

... test -z "$res" ...

The {} is a cool trick that I didn't know about before reading the gxpr source. It means "execute all of this code as one block", meaning that if the test fails, none of it will execute.

test -z "$res" && {

If we're in this block, output that google didn't have an answer, then the original arguments. The 1>&2 means redirect all output from stdin to stderr.

echo "google doesn't know" "$@" 1>&2

We want to output a link to Wolfram Alpha, and we want the link to stand out. We use control characters (\033[4m to start the underline, \033[0m to end it) to make sure that it's underlined.

echo "⌘ click: \033[4m$WOLFRAM?i=$EXPR\033[0m"

Next, we kill the script. We exit with an error code of 1 to show that there was an error executing the request.

exit 1

And finally if we didn't make it into the block that executes when there's no result, we echo the result onto the screen.

echo "$res"

That's all, that's a line by line breakdown of how gxpr works. If you have any corrections or additions, I'd love to hear them in the comments :)

Bang Bang

Two quick terminal input manipulation commands today. The first is bangbang (!!). !! is a shortcut that means "substitute this with the last command run". It's most commonly used when you forget to sudo a command.

vagrant@precise64:~$ apt-get install tmux
E: Could not open lock file /var/lib/dpkg/lock - 
open (13: Permission denied)
E: Unable to lock the administration directory (/var/lib/dpkg/),
are you root?

vagrant@precise64:~$ sudo !!
sudo apt-get install tmux
[sudo] password for vagrant:

If you want to make sure the command is right before executing it, you can add :p to the end to say "print this" instead of "execute this".

vagrant@precise64:~$ sudo !!:p
sudo apt-get install tmux

Then you can hit the up arrow and enter to run the command.

If you want a bit more versatility, you can use ^caret^substitution. This means "find this first string in the last command, replace it with the second one and run the command again".

vagrant@precise64:~$ sudo apt-get install tumx
Reading package lists... Done
Building dependency tree       
Reading state information... Done
E: Unable to locate package tumx

vagrant@precise64:~$ ^tumx^tmux
sudo apt-get install tmux
Reading package lists... Done


Today's utility is a little bash script that searches Google and returns the value that Google Calculator supplies. If Google Calculator can't work out the answer, it delegates and gives you a link to Wolfram Alpha.


Whilst the script is called gxpr, I've simply called it g as it's easier to type.

# It can do simple maths
vagrant:~$ g 1+1

# And it can do conversions
vagrant:~$ g 3984588 bytes in mb
3.79999924 megabytes

# It can even answer questions
vagrant:~$ g speed of light in mph
670616629 mph

# But sometimes it doesn't know, and asks Wolfram
# When this happens, control/option-click the link to open it
vagrant:~$ g 1368035948 unixtime
"google doesn't know 1368035948 unixtime"
⌘ click:

The script

If you want to use g, you'll need to create a file called "g", copy the following script into it and make sure that the file you created is somewhere in your $PATH and that you run chmod a+x on it to make it executable.

CURL='curl -s --header User-Agent:gxpr/1.0'
EXPR=$(echo "$@" | perl -MURI::Escape -ne 'chomp;print uri_escape($_)')

  perl -ne '/rhs: "?([^\[\],":]+)/ and print $1' |
  perl -pe 's/[^\x00-\x7F]//g'

# if we don't have a result, try wolfram alpha
test -z "$res" && {
    echo "google doesn't know" "$@" 1>&2
    echo "⌘ click: \033[4m$WOLFRAM?i=$EXPR\033[0m"
    exit 1

echo "$res"


gxpr on Github via pengwynn


z is a utility that allows you to jump around your machine very quickly. It allows you to change directory by specifying a partial path and it will choose the most relevant one based on your directory history automatically. It uses an algorithm that's based on "frecency". Directories are ranked by how frequently you visit them, and how recently the last visit was.

It's a bit tough to explain, so here's an example. I spend most of my time working in three directories:

  • /var/www/
  • /var/www/
  • ~/testing/storyplayer/main/src/

Instead of typing cd /var/www/ every time I want to work on "", I just need to cd into it once, and it will add it to my list of visited paths in z. The next time I want to work in that folder, I can just type z example and it will jump to the correct path.

Where it gets tricky is when I mean to go to /var/www/, but z thinks I want to go to the application folder. Fortunately, z lets you specify multiple search values, so if I type z example and it takes me to the application folder, I can just type z example test to jump to the tests folder.

If you try and move to a directory that doesn't exist, then z just won't do anything.

# We want to work on ""
vagrant:~$ z example

# We didn't mean the application (even though we normally do)
# So let's change to the tests directory
vagrant:/var/www/$ z example test

# You don't have to specify a full word, just the shortest 
# string that will match will do
vagrant:/var/www/$ z storypl

# Let's try and change to a directory that we've never been to
vagrant:~/testing/storyplayer/main/src$ z foobar

# Nothing happened, we're still in Storyplayer


z on Github

cd -

A short one today. In keeping with the moving around theme from yesterday, I want to take a look at cd -. cd - means "take me back to the last directory I was in".

So, starting in ~/, and running cd /etc would take us to /etc. Running cd - would take us back to ~/ and running cd - again would take us back to /etc.

vagrant@precise64:~/$ cd /etc

vagrant@precise64:/etc$ cd -

vagrant@precise64:~/$ cd -



CDPATH is an environment variable that changes how the cd command works. Normally when you try and cd into a directory, it will only try and change to the path specified from the current directory. If you spend a lot of time working in specific folders, it might be useful to assume that your search path for cd starts from one of those folders.

For example, I spend a lot of time working in /var/www. Instead of typing cd /var/www/ all the time, I want to be able to just type cd and have it know what I mean. Here's how to set that up:

export CDPATH=.:~:/var/www

Now, you can just type cd and it will look for ./ then /var/www/, stopping whenever it finds a match or runs out of search locations.

You can have as many base search paths as you like. Here's one that includes /etc as a search path too.

export CDPATH=.:~:/var/www:/etc

vagrant@precise64:~$ pwd

vagrant@precise64:~$ cd

vagrant@precise64:/var/www/$ cd mysql

vagrant@precise64:/etc/mysql$ cd Downloads