Scientific Computing Survival Guide: September 2013

Monday, September 23, 2013

Crash course on Gnuplot

Gnuplot may be my favorite program in the whole world. Of all the graphs I've made (and every plot I've ever published) 95% of them have been made with gnuplot. While I am starting to use Python's Matplotlib for more here and there, gnuplot is still my favorite. It can be used for quick-and-dirty plots and spot checking of data, or the plots can be readily customized and beautified to publication quality. This How-To will give you a basic working understanding on the program. I write it because the gnuplot user guide isn't as useful as it could be (and because my favorite FAQ no longer is a webpage...).

Basics

Gnuplot can be run in instantaneous mode or run on a script describing the full plotting arguments. The syntax is the same either way. To start off with command line mode, run gnuplot from the console. If you are on a reasonable system the terminal, or in other words the place where the plot is drawn/written, will be set up so that it will display the plot in a GUI. Start off by plotting:

gnuplot> plot x**2, 20*sin(x)

Which will show the quadratic and a sine function:

The ** is a funny syntax that means raise x to the power of y. This is a pretty simple example, and rather arbitrary; but a number of functions can be used and most of the standard functions can be used. In addition, functions can be defined:

gnuplot> f(x)=ax+b
gnuplot> a=1.5; b=5
gnuplot> set grid
gnuplot> plot f(x)

Which gives the plot:

Here I also used the "set grid" command to turn on the background grid. The basic syntax is "set option <attributes>" where the option in this case is grid and there are no attributes (there and be a list of attributes after the option).

Of course, the program wouldn't be useful unless you could plot data from experiments and also fit datasets to functions. Take for example trying to fit an exponential to the decay of a reactant you might see in a first-order reaction: A-->2B. As an example, I'll use the datafile (in gnuplot comments in datafiles begin with a # symbol):

# t (s) A (M) Error (M)
0 3 0.001
1 2.3 0.05
2 1.55 0.05
3 1.22 0.03
4 0.99 0.04
5 0.65 0.02
10 0.145 0.08

We can write a function and fit it to the data:

gnuplot> f(x)=A*exp(-x/tau)

gnuplot> fit f(x) 'datafile.dat' via A, tau

gnuplot> set xlabel "Time (s)"

gnuplot> set ylabel "Concentration (M)"

gnuplot> plot 'datafile.dat' using 1:2:3 with yerrorbars t "Points", '' using 1:2 w l t "Lines", f(x) t "Fit"

gnuplot> set xrange [-0.5:10.5]
gnuplot> replot

The fitting procedure gives parameters of the fit and the certainty of the value:

Final set of parameters Asymptotic Standard Error

======================= ==========================

A = 3.0108 +/- 0.05894 (1.957%)

tau = 3.33788 +/- 0.1288 (3.858%)

Which gives the plot:

Now this last example was a bit more like what you may want to do. I'll take you through it step by step. We used the "set" command to specify the text for the labels. In addition to the xlabel and ylabel, you can set the title and a number of other "labels". Next we plot three different things: 1) the points in the datafile with error bars based on the third column (the first two are x and y), 2) a line connecting the data points and 3) the fit to the data. After that, I realized that I wanted to change the range of the x-axis so I defined a range from -0.5 to 10.5, and then I issued the "replot" command which executes the last "plot" command exactly. Now we are getting somewhere!

However, as you may have noticed that the plots that I have created so far kinda look crappy. This is because I used basic parameters and printed to the "png" terminal. To make a slightly prettier plot, we can edit each piece/attribute independently and print to a better file format. Taking the last example one step further and I'll be plotting the product B concentration. This time I used a file with the list of commands written out. This is a nice way to specify a reproducible plot command and also simplifies the process of making a nice graph when we get quite a few commands. It was plotted with the command gnuplot plotfile.gp, where plotfile.gp looks like:

# Define functions that fit the data

fA(x)=A*exp(-x/tau)

fit fA(x) 'datafile.dat' via A, tau

# Set up the axes for the right sid

set xlabel "Time (s)"

set xrange [-0.5:10.5]

set ylabel "Concentration of A (M)"

set yrange [0:3.5]

# Set up the Y-axis on the right side

set y2label "Concentration of B (M)"

set y2tics

set y2range [0:7]

# Make the points bigger

set pointsize 1.75

# Define where the legend is (i.e. below the plot)

set key below

# Set up file to plot to, make it a postscript file with

# a pretty print (enhanced option) with and Arial font

# at 16 points size. Specify the file name to be

# concentrations.ps

set terminal postscript color enhanced "Arial" 16

set output 'concentrations.ps'

# Here, "axis x1y2" means the x is the bottom scale and the y

# is on the right side. "yerrorbars" means that the third

# column is the error bar spread. Similarly, one could

# use the xerrobars. "pt" sets the point shape and the "lw"

# option specifies how thick the lines are.

plot 'datafile.dat' using 1:2 w p t "Average A" pt 7 ,\

'datafile.dat' using 1:2:3 w yerrorbars t "{/Symbol s}_A" pt 3 lw 3,\

fA(x) t "Fit A" lw 3,\

'datafile2.dat' using 1:2:3 w yerrorbars axis x1y2 t "Average and {/Symbol s} of B" pt 5 lw 3

I've included comments (that begin with the # character) so I don't have to describe each command here. Another thing that was demonstrated was how to plot greek symbols. You can see the that {/Symbol s} makes a σ, which can only be plotted if you use the "enhanced" option with the postscript terminal. The greek symbols are indicated by the letter that most closely matches their sound, and a full list (along with a few other symbols) can be found here. The result (converted to a png using another program) looks like:

The terminal can be set to many different output types, you can plot jpegs, pngs, postscript, encapsulated postscript, pdfs, etc. Virtually any font the system has can be used and any label can be customized with a font or a fontsize all by itself. In addition, the two x- and y-axes can have different plots to them, for example: axis x1y2, axis x2y1 and axis x2y2.

Plotting multiple images in the same file is very useful. The origin for the coordinate system is the bottom left, the same as a standard coordinate system. A normalized unit system is used, with the 1,1 being the top right. Schematically this looks like:

Sizes of the subplots are also normalized numbers. For example you can plot two functions, one on top of the other like:

# Turn on the multiplot capabilities

set multiplot

set origin 0,0

set size 1,0.5

plot x

set origin 0,0.5

plot x*x

Which creates the following plot:

Note, however, that you need to specify the terminal and the output file before you do any plotting, or before you even specify that multiplot should be used. I think multiplotting from the gnuplot terminal is practically useless and only really use a plotfile myself. The multiplot easily allows plotting three plots, one on top of the other and one on the side, such as:

Which was created with:

set terminal postscript color enhanced "Arial" 16

set output 'trip.ps'

set multiplot

# Plot the first

set origin 0,0

set size 0.5,0.5

set xlabel "X Label"

set ylabel "Y Label1"

plot x

# Plot the second with the same x label

set origin 0.0,0.5

unset xlabel

set ylabel "Y Label1"

# Note for math fuctions, integer vs

# floating point arithmatic (3/2=1 vs 3.0/2.0=1.5)

plot x**(3.0/2.0) t "x^{3/2}"

# Plot the third

set xlabel "X Label"

set ylabel "Y Label2"

set size 0.5,1.0

set origin 0.5,0.0

plot x**2 t "x^2"

More Fun

Now that I've covered the main functionality of gnuplot, I'd like to show you a few other cool things you can do. I often plot histograms:

# Define a function that calculates which bin

# the x value should be in

binwidth=0.05 # width of the bin

bin(x,width)=width*floor(x/width) + binwidth/2.0

# Pretty the plot

set xrange [-1:1.5]

set key below

# Call the bin function on the first column: $1

plot 'histDat.dat' using (bin($1,binwidth)):(1.0) smooth freq with boxes t "Gaussian: <x>=0.2, {/Symbol s}=0.2"

This creates a plot that looks like:

A few things that you should notice is that you can call any function on the column of a file. For example, with the same distribution I could call plot 'histDat.dat' using 1:(f($1)), which would apply the function f(x) to all the values in the x range. You can customize the bars in a number off different ways: fill with patterns, solid colors, etc.

Plotting a 2D surface plot can easily be accomplished using the splot command instead of plot which can plot 2D functions or a file with three columns. If you use the file you need to separate each line individual line by a blank line. Usually, I use a heatmap to show the height, instead of the wireframe; its more descriptive. Going back to our previous example of the gaussian function:

# Define functions and ranges

f(x,y)=1/sqrt(3.14)*exp(-(x**2+y**2))

set xrange [-3:3]

set yrange [-3:3]

# Set the plot to show height with color

set pm3d

# Set the number of boxes to divide the function into

set isosample 500, 500

# Plot

plot f(x,y) with pm3d

Which makes a nice plot:

Also, it is nice to see the contours for a surface plot drawn on the base of the 3D plot sometimes:

# Define a function

f(x,y)=x**2*cos(2*x+0.5*y)

# Set the contour line levels

set cntrparam level incremental 0,2,20

set isosample 25,25

set xrange [-5:5]

set yrange [-5:5]

# Turn off the legend so we don't

# see all the multitude of contour

# lines

unset key

splot f(x,y)

Making the plot:

Well those are about the best of the basics; we have barely scratched the surface. You can do pretty much anything you can do with Excel--gradients, bar graphs, backgrounds, surfaces, pie charts, ribbon charts--plus quite a few other things. There are a number of great tutorials that are a bit more full featured, such as this one by Zoltán Vörös, which is one of the more imaginative and unique tutorials online.

Thursday, September 19, 2013

Awk - Average and Standard Deviation

A quick little awk script to compute the average and standard deviation of a specified column in a file.

#! /bin/sh
# Bulletproofing
if [[ $# -lt 2 ]]; then
echo "Usage: ./avgStd.sh <file> <column>"
exit
fi

# Compute Average and Std. Dev.
avg=`awk -v var=$2 'BEGIN{count=0; avg=0; std=0} {count=count+1; avg=avg+$var} END{print avg/count}' $1`
std=`awk -v var=$2 -v av=$avg 'BEGIN{count=0; std=0} {std=std + ($var-av)*($var-av); count=count+1} END{print sqrt((std)/(count-1))}' $1`

# Print results
echo "Average:\t$avg"
echo "Std. Dev:\t$std"

You should be able to copy/paste this code to a file, run chmod o+x on the file and run the command with the two arguments, file and column. For example:

./avgStd.sh derp 2

Computational Scientist Tool-belt: Minimum Set of Software

Having a tool-belt full of tools is very beneficial for computational modeling scientists/researchers. But learning each tool is time consuming, and often only a small subset of the available tools is required to accomplish almost any standard problem you will run into.

Here, I assume you are using a POSIX environment, which is my specialty. While I wont say that Windows based computers are not good for scientific modeling, I believe POSIX (Linux, BSD, Macintosh, etc.) based computers are slightly superior. Note: you could install Cygwin or other terminal based software to emulate POSIX, but might not have the benefit from the tight coupling with the OS, nor the simplicity of many of the package managers available on POSIX based OS, nor the ease of compilation and support of many open source softwares.

The following tools are my recommendations for a basic tool-belt. In some cases simple examples of usage are demonstrated. I provide links to either Wikipedia or the exact webpage, which can also be easily found via Google.

Remote Computers

Scientific computing often requires interfacing with other computers besides the actual workstation you sits behind. Often times data is stored on a backup server or you are using a remote computer (e.g. supercomputer). Logging into a user account on a network connected computer or moving data between places (in a secure manner) are two very common processes. Two commands to know are ssh and scp. The first allows you to log into a remove machine where they have a user account. The general form of the command looks like:

ssh username@hostname.domain

e.g.

ssh jp@mycomputer.com

You will often want to run a visual/GUI program on another machine, and this requires a special feature of ssh called X11 forwarding. Two nearly identical flags can be specified that allow this behavior, -X and -Y. The difference is subtle, however I prefer the second, as it is slightly more secure than the first. The command would look like:

ssh -Y jp@mycomputer.com

An analog to the cp command is scp, which allows you to move (more precisely copy) files between two network connected machines. The basic command for copying files from a remote host to the you where you is typing terminal commands is:

scp username@hostname.domain:/path/to/file /local/path/to/file

And the reverse also works:

scp /local/path/to/file dirusername@hostname.domain:/remote/path/to/file

These commands work for files; in order to move a whole directory, the -r flag needs to be added to the command line. However, if a directory contains many files, it is better to compress them into one file and copy it. Scp works well in many cases, but isn't great for copying large files, especially from supercomputers. Scp also has the drawback that an interrupted copy--say there is a dropped connection--must be restarted. There are better tools to cope for these drawbacks that I will be discussing in a post in the near future.

Scripting and Programming

Shell scripting is a means to simply specify a set of actions and logic in the same language as the shell (command line interface) that you use. The two most common languages are sh/bash or csh/tcsh. Personally, I prefer bash, though both are relatively easy to use. Some of my favorite tutorials for bash are:

BASH Programming - Introduction HOW-TO by Mike G.
Advanced Bash-Scripting Guide: An in-depth exploration of the art of shell scripting by Mendel Cooper
Bash Reference Manual

In addition to shell scripting, I suggest you learns a simple programming such as Python or Perl. While I have very little experience with Perl, it is a great tool for string manipulation. Python is pretty common in computational sciences and supports a number of features that make it very desirable. It is easy to learn and proves to be a powerful way to prototype short programs. A command line interface can be accessed by typing python into the terminal which can be used to try out any particular command. A plethora of scientific, plotting and math libraries have been developed, including SciPy, matplotlib and NumPy to name a few. The best resources for Python can be found here:

The Python Tutorial
Python Documentation (the good old)

If you plan to do code development of any kind, a software capable of version management. A common problem is that a person programming will write some code that works, think of a better way to write the code, try to write it that way and then break the working code. Often times it is difficult to bring the code back to original working state. Version control software provides the ability to save the exact state of the code as well as allow the users to log messages about what the exact changes entail. Three commonly used software systems/protocols are used: cvs, svn and git (well really four, but Mercurial isn't used as often as the others). Cvs has very much gone by the wayside. Svn is common in many different settings and considerable legacy support exists. Git is newer, and supports a number of really neat features, though I would argue it has quite a bit higher learning curve than svn.

When you find yourself working with svn for the first time, it will likely be with someone else's code. Some simple commands might get you started. When you first download the code from the svn repository--the central place that the code is stored and changes are tracked--you need to do a checkout:

svn co username@server:/path/to/repository

Usually a server will require a username the command will prompt for a password. In order to make sure the code is up to date at any given time, you can issue an update command to get the newest version of code in the current directory and any subdirectories:

svn up

In order to check the properties of the local version of the code, there are two commands:

svn stat

and:

svn info

These can be run in any directory of the code tree (the whole directory structure under the root of the repository). The status command will tell you of any modifications, additions, deletions, moves, copies, etc that occurred to any of the files in the local repository. The info command will report the version of the code (The revision numbers start at 0 when the directory is created and count up), the svn repository and directory as well as the date of the last code was changed locally.

Once you've made changes and validated that they work, the code should be committed to the repository. Changes may include newly added files which are added to the repository via the command:

svn add <filename>

An update of the code to the most recent version is required before any changes of the code are added to the repository. A commit (updating code that is modified/added in your version of the code to the main repository) can be accomplished using:

svn commit -m "Message telling what the actual changes are. This message logs your changes."

With any luck, this will add your (hopefully correct) changes to the repository for permanent providence. One final command that is very useful is svn log, which prints the commit messages. Any of these commands can be used on a per file/per directory basis. Many other functions are documented in the svn documentation.
Git is a whole different beast; maybe start with the official git tutorial.

One final program commonly used in programming is the diff command. Diff lists the differences between two files on a character by character or line by line basis. It takes two files as input and outputs just the changed lines. The output looks something like this:

2,3c2,3
< green apple
< was very
---
> apple
> was

The first line indicates the lines that contain changes (denoted as "c"; also "a" and "d" are common for added or deleted). The numbers indicate the beginning and ending lines for the first and second file. Changes to the first file are denoted with lines that start with < and the second start with >.

Data Manipulation and Plotting

Data often needs to be manipulated, maybe you need to convert a comma-separated file to a tab-delimited file. An ideal tool for accomplishing this task is sed, a stream processing tool. Sed works on streams which can be strings, output from other programs or most commonly files. The sed command for the example is:

sed 's/,/\t/g' commaSeparatedFile.csv > tabDelimitedFile.txt

This command says: Substitute, "s", a comma "/,/" with a tab character "/\t/" and replace for each instance on the line "g". This is a simple example but sed can do many things such as appending, inserting, deleting and many other operations. But I mostly use it for string replacement. A fantastic tutorial was written by Bruce Barnett.

Stream processing is nice, but often times you may want to perform operations on the data in a file, say multiply each value in a column by a constant/variable or calculate an average. Awk is the tool for this type of manipulation. Awk could be considered a scripting language, though the syntax and operation is a bit different than Python or other languages. I find it most useful when you want to apply the operation to every line in a file, for example calculating the average of the 5th column:

awk 'BEGIN{avg=0; count=0} {avg = avg + $5} END{print avg/count}' file.txt

or multiplying two columns and printing it as a third column:

awk '{print $1 "\t" $2 "\t" $1*$2}' file1.txt > file2.txt

Often times, you don't have all the columns in one file. Of course the cat command wont accomplish the goal of putting the data in the format that is useful for a line-by-line processor such as awk. The tool for this is the paste command. Say you have two columns of data in two files that you want to multiply together with the previous command, you can paste them (with a tab between columns) together with:

paste -d "\t" column1.txt column2.txt > file1.txt

This is really the only use of paste, but it can be very helpful.

Finally, sequences of numbers often come in handy, especially when shell scripting. For this the command seq is most common, that is, if you are on a standard Linux machine. For some reason I don't understand, Apple decided to replace seq with jot (It is a BSD thing). I often find myself doing something like this in a shell script:

for i in $(seq 10 20)
do
# something with $i
...
done

Like paste, seq is a pretty simple command, with the only really customization being the format of the output numbers. Jot on the other hand has a few other uses, for example repeat printing something a number of times:

jot -b hi - 1 10

will print the hi ten times. It can also print random numbers, or floating point numbers, which seq wont do, making it slightly superior.

Quick-and-dirty, or professional plotting can both be taken care of with gnuplot. This is a large and complex topic and will be handled thoroughly in a post very shortly.

Well these are the basic tools I think. If you have any comments, leave them below.

Friday, September 13, 2013

Why a Blog on Scientific Simulations?

Surviving in Graduate School is hard, as I have recently discovered. Using computers to solve those problems that come up during research, be they simple data analysis on a local workstation or complex computer simulations on national supercomputers, can be very desirable (more likely necessary). While computers can greatly simplify or speed up the research process, when the programs/codes/tools that one uses don't work, it can be the most aggravating thing. Yelling (with cursing) often ensues, and despair easily creeps in.

I have lived this life for more than 6 years now, and while these situations have occurred often, don't despair. There is hope! Getting by this many years was only possible by relying on those sharing their knowledge online. Stack Overflow may be one of the most useful websites around. Many blog posts may be found to help one solve a specific problem. This blog was created to help me convey some of the knowledge and skills I have accrued over the years to help others that work in the computational sciences or conduct scientific research solve common problems, learn necessary skills and overcome problems.

A little about my background: I have a BS in Chemistry and a BS in Computer Science. The past six years of my life have involved computational sciences of many kinds. Now I am pursuing a PhD in Chemistry, though the research is more realistically described as Computational Biophysics. I have experience with a handful of programming languages and many unix tools. With experience in computational fluid dynamics, continuum mechanics for material modeling, stochastic biological simulations, molecular dynamics and some quantum chemical modeling, my experience in computational modeling is rather diverse. In the past I used and developed code for the Uintah Computational Framework used by many to simulate a number of materials and fluids problems. Currently, I develop code for the stochastic biochemical problem solving environment Lattice Microbes. In solving these problems both workstations and supercomputers have had to be employed.

This blog will draw on these experiences and I intend to cover such topics as:

Data analysis/processing
Using supercomputers
Compiling software on different architectures
Math techniques in scientific computing
Setting up simulations
Programming techniques
and many more...

Some posts will be how-to's, some will be FAQs, but the majority will be short tid-bits of information that occur to me during the day. I intend to make code available where possible, to help facilitate the understanding, and intend to actively respond to comments and questions.

With that, I hope you all will find this useful.