Scientific Computing Survival Guide: Computational Scientist Tool-belt: Minimum Set of Software

Having a tool-belt full of tools is very beneficial for computational modeling scientists/researchers. But learning each tool is time consuming, and often only a small subset of the available tools is required to accomplish almost any standard problem you will run into.

Here, I assume you are using a POSIX environment, which is my specialty. While I wont say that Windows based computers are not good for scientific modeling, I believe POSIX (Linux, BSD, Macintosh, etc.) based computers are slightly superior. Note: you could install Cygwin or other terminal based software to emulate POSIX, but might not have the benefit from the tight coupling with the OS, nor the simplicity of many of the package managers available on POSIX based OS, nor the ease of compilation and support of many open source softwares.

The following tools are my recommendations for a basic tool-belt. In some cases simple examples of usage are demonstrated. I provide links to either Wikipedia or the exact webpage, which can also be easily found via Google.

Remote Computers

Scientific computing often requires interfacing with other computers besides the actual workstation you sits behind. Often times data is stored on a backup server or you are using a remote computer (e.g. supercomputer). Logging into a user account on a network connected computer or moving data between places (in a secure manner) are two very common processes. Two commands to know are ssh and scp. The first allows you to log into a remove machine where they have a user account. The general form of the command looks like:

ssh username@hostname.domain

e.g.

ssh jp@mycomputer.com

You will often want to run a visual/GUI program on another machine, and this requires a special feature of ssh called X11 forwarding. Two nearly identical flags can be specified that allow this behavior, -X and -Y. The difference is subtle, however I prefer the second, as it is slightly more secure than the first. The command would look like:

ssh -Y jp@mycomputer.com

An analog to the cp command is scp, which allows you to move (more precisely copy) files between two network connected machines. The basic command for copying files from a remote host to the you where you is typing terminal commands is:

scp username@hostname.domain:/path/to/file /local/path/to/file

And the reverse also works:

scp /local/path/to/file dirusername@hostname.domain:/remote/path/to/file

These commands work for files; in order to move a whole directory, the -r flag needs to be added to the command line. However, if a directory contains many files, it is better to compress them into one file and copy it. Scp works well in many cases, but isn't great for copying large files, especially from supercomputers. Scp also has the drawback that an interrupted copy--say there is a dropped connection--must be restarted. There are better tools to cope for these drawbacks that I will be discussing in a post in the near future.

Scripting and Programming

Shell scripting is a means to simply specify a set of actions and logic in the same language as the shell (command line interface) that you use. The two most common languages are sh/bash or csh/tcsh. Personally, I prefer bash, though both are relatively easy to use. Some of my favorite tutorials for bash are:

BASH Programming - Introduction HOW-TO by Mike G.
Advanced Bash-Scripting Guide: An in-depth exploration of the art of shell scripting by Mendel Cooper
Bash Reference Manual

In addition to shell scripting, I suggest you learns a simple programming such as Python or Perl. While I have very little experience with Perl, it is a great tool for string manipulation. Python is pretty common in computational sciences and supports a number of features that make it very desirable. It is easy to learn and proves to be a powerful way to prototype short programs. A command line interface can be accessed by typing python into the terminal which can be used to try out any particular command. A plethora of scientific, plotting and math libraries have been developed, including SciPy, matplotlib and NumPy to name a few. The best resources for Python can be found here:

The Python Tutorial
Python Documentation (the good old)

If you plan to do code development of any kind, a software capable of version management. A common problem is that a person programming will write some code that works, think of a better way to write the code, try to write it that way and then break the working code. Often times it is difficult to bring the code back to original working state. Version control software provides the ability to save the exact state of the code as well as allow the users to log messages about what the exact changes entail. Three commonly used software systems/protocols are used: cvs, svn and git (well really four, but Mercurial isn't used as often as the others). Cvs has very much gone by the wayside. Svn is common in many different settings and considerable legacy support exists. Git is newer, and supports a number of really neat features, though I would argue it has quite a bit higher learning curve than svn.

When you find yourself working with svn for the first time, it will likely be with someone else's code. Some simple commands might get you started. When you first download the code from the svn repository--the central place that the code is stored and changes are tracked--you need to do a checkout:

svn co username@server:/path/to/repository

Usually a server will require a username the command will prompt for a password. In order to make sure the code is up to date at any given time, you can issue an update command to get the newest version of code in the current directory and any subdirectories:

svn up

In order to check the properties of the local version of the code, there are two commands:

svn stat

and:

svn info

These can be run in any directory of the code tree (the whole directory structure under the root of the repository). The status command will tell you of any modifications, additions, deletions, moves, copies, etc that occurred to any of the files in the local repository. The info command will report the version of the code (The revision numbers start at 0 when the directory is created and count up), the svn repository and directory as well as the date of the last code was changed locally.

Once you've made changes and validated that they work, the code should be committed to the repository. Changes may include newly added files which are added to the repository via the command:

svn add <filename>

An update of the code to the most recent version is required before any changes of the code are added to the repository. A commit (updating code that is modified/added in your version of the code to the main repository) can be accomplished using:

svn commit -m "Message telling what the actual changes are. This message logs your changes."

With any luck, this will add your (hopefully correct) changes to the repository for permanent providence. One final command that is very useful is svn log, which prints the commit messages. Any of these commands can be used on a per file/per directory basis. Many other functions are documented in the svn documentation.
Git is a whole different beast; maybe start with the official git tutorial.

One final program commonly used in programming is the diff command. Diff lists the differences between two files on a character by character or line by line basis. It takes two files as input and outputs just the changed lines. The output looks something like this:

2,3c2,3
< green apple
< was very
---
> apple
> was

The first line indicates the lines that contain changes (denoted as "c"; also "a" and "d" are common for added or deleted). The numbers indicate the beginning and ending lines for the first and second file. Changes to the first file are denoted with lines that start with < and the second start with >.

Data Manipulation and Plotting

Data often needs to be manipulated, maybe you need to convert a comma-separated file to a tab-delimited file. An ideal tool for accomplishing this task is sed, a stream processing tool. Sed works on streams which can be strings, output from other programs or most commonly files. The sed command for the example is:

sed 's/,/\t/g' commaSeparatedFile.csv > tabDelimitedFile.txt

This command says: Substitute, "s", a comma "/,/" with a tab character "/\t/" and replace for each instance on the line "g". This is a simple example but sed can do many things such as appending, inserting, deleting and many other operations. But I mostly use it for string replacement. A fantastic tutorial was written by Bruce Barnett.

Stream processing is nice, but often times you may want to perform operations on the data in a file, say multiply each value in a column by a constant/variable or calculate an average. Awk is the tool for this type of manipulation. Awk could be considered a scripting language, though the syntax and operation is a bit different than Python or other languages. I find it most useful when you want to apply the operation to every line in a file, for example calculating the average of the 5th column:

awk 'BEGIN{avg=0; count=0} {avg = avg + $5} END{print avg/count}' file.txt

or multiplying two columns and printing it as a third column:

awk '{print $1 "\t" $2 "\t" $1*$2}' file1.txt > file2.txt

Often times, you don't have all the columns in one file. Of course the cat command wont accomplish the goal of putting the data in the format that is useful for a line-by-line processor such as awk. The tool for this is the paste command. Say you have two columns of data in two files that you want to multiply together with the previous command, you can paste them (with a tab between columns) together with:

paste -d "\t" column1.txt column2.txt > file1.txt

This is really the only use of paste, but it can be very helpful.

Finally, sequences of numbers often come in handy, especially when shell scripting. For this the command seq is most common, that is, if you are on a standard Linux machine. For some reason I don't understand, Apple decided to replace seq with jot (It is a BSD thing). I often find myself doing something like this in a shell script:

for i in $(seq 10 20)
do
# something with $i
...
done

Like paste, seq is a pretty simple command, with the only really customization being the format of the output numbers. Jot on the other hand has a few other uses, for example repeat printing something a number of times:

jot -b hi - 1 10

will print the hi ten times. It can also print random numbers, or floating point numbers, which seq wont do, making it slightly superior.

Quick-and-dirty, or professional plotting can both be taken care of with gnuplot. This is a large and complex topic and will be handled thoroughly in a post very shortly.

Well these are the basic tools I think. If you have any comments, leave them below.

Scientific Computing Survival Guide