Skip to content
This repository was archived by the owner on Feb 26, 2025. It is now read-only.

1. Unix Stuff

mihinduk edited this page Jul 27, 2021 · 22 revisions

Introduction

This is not meant to be an extensive introduction to Unix. Those can be found across the internet (e.g. on evomics). Instead, this is a distilled down reference for useful Unix commands (and a few other command-line tools) that are used extensively in our lab.

Other resources

There are many, many Unix tutorials available on the web. This is not meant to replace those, but just to be a 'crash course' on commonly used commands in the lab. Here are some other tutorials/resources found around the web that may be useful for extended learning:

Incredibly useful list of one-liners useful in bioinformatics: bash one liners

If you are interested in a nice terminal replacement from the basic terminal supplied with Mac OS, then I would recommend trying out iterm. It has a lot of themeing and windowing functions that make working in the terminal a little more tolerable.

Topics

Best practices

Best practices

Files

  • Unless absolutely necessary do not use special characters or spaces in your file or directory names. For example, do not call a file my file.tsv instead call it my_file.tsv. Special characters such as @, # or ! will cause way more headaches than they are worth.
  • All lowercase all the time. cAmEl cAsE is for camels and if you use a mixture of capital and lower case you are likely to get more confused than what it is worth. Just stick to all lowercase.
  • Unix is case-sensitive so My_file.tsv will be viewed as a different file than my_file.tsv. Again, just stick with the same case as you work (all lower) to make things easier on yourself.
  • Keep filenames as short and descriptive as possible. This is a bit vague, and sometimes not possible but many tasks will be easier if you have concise filenames.
  • Hidden files. You can hide a file on Unix by preceding it with a dot (.). For example, .my_file.tsv would not be shown with ls unless you asked ls to with the -a flag.
  • File extensions such as .txt, .tsv, .csv are meaningless on Unix. However, they can also be useful for understanding what type of file you are working with and more importantly for when you transfer those files to a Mac or PC which does use file extensions. I would use them liberally to define tables (.tsv, .tsv) or text files (.txt).

Basic bash

Bash is a Unix shell and command language. People frequently use the terms bash, command line, Unix and Linux (incorrectly) interchangeably. Most of the time someone uses one of these terms they are likely referring to running a typed in command from a terminal. This can be on your local computer, or logged into a server. There are lots of things you can do from the command line more efficiently than you can from graphical user interfaces (GUIs).

Some of the more valuable basic commands you can use for work in our lab include the following:

Why are these command so useful? Well, many times the data we are interested in working with are in tables and what we more likely want to do is take quick looks at these tables (head/tail), select specific columns (cut) or combine (cat) these tables together. There are various ways to do this, but you can get along way if you master these basic commands.

As an example, consider the following table:

sampleID age status treatment
001 42 diseased none
002 46 diseased antibiotics
003 32 healthy none
004 40 diseased none
005 28 healthy antibiotics

This is a simple table with 1) a header, 2) 5 data rows, and 3) 4 data columns. More often than not your data tables are going to contain many more rows and columns than this. What if you just wanted to take a quick look at the beginning and end of the file. That is where head and tail will come into play.

head -n 2 table.tsv

What that command does is tells the head command to display just the first two (-n 2) rows of the table. So you would see this:

sampleID age status treatment
001 42 diseased none

Similarly, the tail command can show you the end/bottom of a file:

tail -n 2 table.tsv

Resulting in:

004 40 diseased none
005 28 healthy antibiotics

This may seem trivial given this example, but when you are working with files with thousands of rows this is deceptively useful. Try opening a 50,000 row file in Word and scroll down to lines 49,999 - 50,000 and you will immediately see the utility in the head and tail commands.

An under appreciated use of the tail command is ironically removing the headers of a table (or the first several lines if you wish). You may want to remove and replace headers from large files. Opening them in a text editor can be challenging, so this simple command will do so without having to load everything into memory:

tail -n+2 table.tsv

This would result in:

001 42 diseased none
002 46 diseased antibiotics
003 32 healthy none
004 40 diseased none
005 28 healthy antibiotics

You can increase the value in -n+2 if you wanted to remove additional rows.

We will discuss ways to add new column headers after you remove them in the More advanced Bash section below.

The sort command does pretty much what you think it would, but there are some nuances to its use. Let's say we wanted to sort by the status column in our current table:

sort -k3 table.tsv

Here we used the -k flag to specify the 3rd column (status) which will result in the following:

sampleID age status treatment
001 42 diseased none
002 46 diseased antibiotics
004 40 diseased none
003 32 healthy none
005 28 healthy antibiotics

If instead we wanted to sort buy age we need to specify that we want to sort a number instead of a text string:

sort -k2 -n table.tsv
sampleID age status treatment
005 28 healthy antibiotics
003 32 healthy none
004 40 diseased none
001 42 diseased none
002 46 diseased antibiotics

Another useful sort flag is -r which will sort in reverse order

sort -k2 -n -r table.tsv
sampleID age status treatment
002 46 diseased antibiotics
001 42 diseased none
004 40 diseased none
003 32 healthy none
005 28 healthy antibiotics

There can be some trickyness to the sort command. I would recommend always double checking your results from a sort and not just trusting that it works as you expected.

More advanced Bash

Pipes

Piping is one of the most simple yet powerful features when working on the command line. A simple way to think of how piping works in Unix is to literally think of it as a pipe used in plumbing, but instead of redirecting water to specific destinations you are redirecting output from one command to another command (or file).

Piping is most ordinarilly completed using the pipe character '|' which on most keyboards is entered using Shift+, but some keyboards differ. Just look for the | symbol.

The basic syntax for piping commands is

command_1 | command_2 | command_3

In that scenario output from command_1 would be automatically provided as input for command_2 and output from command_2 would be provided as input for command_3.

As with most Unix commands the output is automatically sent to screen if not indicated otherwise. A simple way to think about this is with the ls command. When you type ls in a directory it just lists the files on the screen. But what if you wanted to do something else with that file list?

A common use example here is taking a quick look at the files in a directory with lots of files. Sometimes you will be in a directory with hundreds of files and just dumping them to screen is next to useless. So in that case a quick way to view some of the files would be to pipe ls into head to only view the first 10 files.

ls -l | head -n 5

In that example we also used the -l flag to list the files in long format instead of the traditional wide format provided by the base ls command. We also told Unix to pipe the output to the head command instead of to the screen and told head to only display the first 5 rows using -n 5.

However, you will see that the final output destination is still just to the screen. If you wanted to redirect the output to a file you would use > as the final pipe in your command. For example

ls -l | head -n 5 > top5_files.txt

If you issue this command you will not see anything on the screen. Instead you will create a new file called top5_files.txt.

Piping can become very complex and we will see some further examples of that in the sections below.

grep

grep is a command line utility that searches plain text files using regular expressions. Regular experssions can get very complicated very quickly, but a majority of the time you will just want to search for a list of characters. For example, if you wanted to find the pattern ATAC then your regular expression is just ATAC. Nothing more special than that. Becomming familiar with all of the ways to create more complicated regular expressions (only at the start of a line, case-sensitive, all numbers, etc.) will be extremely useful for you, but for this document we will just look at the simplest use cases.

Let's say you have a table of taxonomy output that has a mixture of viral and bacterial taxonomy called taxonomy.txt.

| Kingdom | Family | | Bacteria | Prevotellaceae | | Bacteria | Enterobacteriaceae | | Virus | Myoviridae | | Virus | Podoviridae | | Virus | Adenoviridae |

What if you wanted to just extract the lines for taxa from the Bacteria? This is a simple task for grep

grep "Bacteria" taxonomy.txt

This would result in the following

| Bacteria | Prevotellaceae | | Bacteria | Enterobacteriaceae |

You will notice that the header (Kingdom, Family) was removed because it did not contain the regular expression "Bacteria". We will learn how to easily add-back headers using sed below.

Similarly, you could extract the Virus lines using:

grep "Virus" taxonomy.txt

As we learned in the last section, the output will just go to the screen. If you wanted to save this to a file you would do the following:

grep "Virus" taxonomy.txt > viruses.txt

There are many, many uses of grep beyond what is shown here. For example, we could have also extract the lines with Virus using the -v flag built in with grep which searches the inverse of your regular expression. In this case this command:

grep -v "Bacteria" taxonomy.txt

Will give you the exact same results as:

grep "Virus" taxonomy.txt

awk

AWK is a programming language capable of many, many, many things. Particularly when it comes to parsing large tabular text files. Similar to the grep section above we will focus only on the simplest, most used case, but encourage you to take your own deep-dive on what you can do with awk.

awk, like grep, is useful for data extraction. One of the primary distinctions from grep is that it takes in a line of text and parses it based on a set of rules you provide to it.

For example, lets use the same taxonomy.txt table from above and build off of our grep command from above.

What if you wanted to extract all lines for the viruses, but only display the Family instead of everything on each line?

We know grep will extract the lines with viruses:

grep "Virus" taxonomy.txt

Now lets pipe that into awk and tell it to only print the 2nd field which corresponds to Family:

grep "Virus" taxonomy.txt | awk -F "\t" '{ print$2 }'

Here we are telling awk to look for the Tab ("\t") field using the -F flag. If we had a csv file where each column was separated by a comma you would just issue the same command but replace "\t" with ",". Or if you were separate by spaces you would replace with " " and so on.

The '{ print$2 }' tells awk to only print the 2nd peice of data separated by the field specified by the -F flag.

The results would look like this

Myoviridae
Podoviridae
Adenoviridae

If you replaced '{ print$2 }' with '{ print$1}' you would get the first column

Virus
Virus
Virus

awk is incredibly useful for slicing-and-dicing tabular data and is particularly powerful when combined with grep.

sed

sed is similar to awk in that it parses text, however, awk is oriented around parsing around fileds (using the -F flag) while sed is not. This is a massive oversimplification in describing both sed and awk as they are both capable of so much more than that. But for the simplest of use cases it is probably a sufficient description.

How we normally use sed is in two ways. First is to replace (substitute) characters. For example, if you have a file with columns separated by spaces (" ") and you want to convert these to tabs ("\t") sed would make easy work of this. The basic syntax is:

sed 's/ /\t/g' filename.txt

In this command the s tells sed to substitute (replace) everything in the next part of the command (between two /), in this case just a space with what is in the second part (separated by the next two /) in this case the shorthand for tab which as we learned above indicates a tab. The final g indicates to do this globally so that if you have multiple places you want this to occur per line then you need to add this g. The default is only to substitute the first instance, so if you want to do that just leave the g out.

To reiterated, the basic command follows this convention:

sed 's/what you want to replace/what you want to replace it with/g if you want to do so globally' nameoffile.txt

This is a really useful way to make quick changes to very large files. It's basically a super powerful and easy to use find-and-replace.

The other useful way we use sed is to add headers to a file. When we learned about grep we noted that although it was useful for making separate tables of bacteria and viruses, this was done somewhat destructively as we lost the column headers. One way to add thes back would be to open a text editor and add them in by hand. A more programatic way to do this would be with sed:

grep "Virus" taxonomy.txt | sed '1i Kingdom\tFamily' > newfile.txt

That command would pipe the results of grep finding the regular expression "Virus" from the file taxonomy.txt into sed. The sed command takes that input and adds a row (first row indicated by the 1i) with Kingdom, a tab (\t), and Family to the top of the file. So you would get the following:

| Kingdom | Family | | Virus | Myoviridae | | Virus | Podoviridae | | Virus | Adenoviridae |

Adding column headers with sed

As mentioned above, the tail command can be used to easily remove column headers. So what if you want to add column headers and do not want to open your file in a text editor? One simple way is using the sed command.

Let's say we removed the headers as above using tail -n+2 table.tsv in the example above so that our table looks like this:

001 42 diseased none
002 46 diseased antibiotics
003 32 healthy none
004 40 diseased none
005 28 healthy antibiotics

And now we want to add new headers: id, years, status, drugs

One can specify for sed to add a line (1i) with those headers. In this case we will use the tab (\t) seperator to define each column header but you could easily substitute with , for a csv or other character.

sed '1i id\tyears\tstatus\tdrugs'

Now you should have:

id years status drugs
001 42 diseased none
002 46 diseased antibiotics
003 32 healthy none
004 40 diseased none
005 28 healthy antibiotics

Bash scripts

Now that you know about grep, sed, awk and a variety of other commands it is useful to learn how to package them all into a script so you can just run one command to do a bunch of commands. This saves time for having to retype the commands (they will get much more unwieldly than what was shown above!). One of the easier ways to tie a nuch of commands together into a single command is by placing the commands in a bash script.

A bash script is simply a text file that tells the command line to issue a series of commands using the bash shell. In order for the operating system to know that you want to do this you need to tell it to do so. The way you tell an operating system that your text file is a bash script is to start the file with the following line:

#!/bin/bash

This line is regularly referred to as a "hash bang" to indicate the # and the !. This tells the operating system that you are writing a script and to use the programming language indicated. In this case bash which is located in /bin/bash. It's possible that bash could be located somewhere else, but highly unlikely.

After that it's just a matter of supplying your command line arguments in the order that you want the script to issue them. For example:

#!/bin/bash

echo "This is my first bash script!"

IN=input_file.txt
OUT=output_file.txt

grep "Virus" $IN | awk -F '\t' '{ print$2 }' | sed '1i Family' > $OUT

There are a couple of new concepts in the script above that we have yet to discuss.

First is echo which does pretty much what it says. It will quite literally echo what comes after it to the screen.

The next section is variable assingment. In this case we are creating two variables. One called IN and the other called OUT. In this case we want these variables to refer to out input and output files and will call on them later in the script by preceeding them with a $.

Next comes the script where we first select the rows we want using grep, pipe these into awk to select the second column, and then add a header with sed. You will note that instead of typing the input and output filenames we instead just refere to their variables using $IN and $OUT.

After you create this file (using a text editor like nano, vi or emacs) you can run it directly from the command line.

It is useful (although not necessary) to give every bash script the suffix .sh. For example my_first_script.sh.

It is also useful to make the script executable. This is also not necessary but makes using the script a bit easier. To make a script executable you use the following command:

chmod +x my_first_script.sh

Once you have modified your script to be executable all you do is type the name of the command and it will run.

One caveat to this is that if the script is not in your $PATH (not discussed here) then your operating system will not be able to see it. If you are in the directory where your script is saved, then you can tell the operating system to look in that directory by preceeding the command with ./

./my_first_script.sh

Alternatively you can move it into your path.

Shortcuts and Hotkeys

There are lots of hotkeys and shortcuts when working on the command line. A detailed list can be viewed here. However, these are likely to be the ones you use most. Of note, all of these shortcuts work on Mac OS as well (except in Microsoft products as of the writing of this).

  • Ctrl+a: go to the start
  • Ctrl+e: go to the end
  • Ctrl+k: delete from the cursor to the end
  • Ctrl+u: delete from the cursor to the start
  • Ctrl+l: clear the screen
  • Ctrl+c: terminate the command
  • Ctrl+z: suspend/stop the command
  • !!: Run the last command

Clone this wiki locally