Scientific Computing

Tid-bits, FAQs and How-To's designed to help students survive Graduate School, when the world of computers is against them.

Tuesday, January 26, 2016

HowTo: Supercomputer Environment (e.g. Modules)

Most supercomputers and properly run clusters often include many of the software packages you need already installed. These could include simulation packages (molecular dynamics, quantum mechanics, material mechanics, etc.), scripting languages and their associated libraries (Python, Perl, etc.), and many of the libraries your software might link against (OpenMPI, OpenMP, CUDA, HDF5, BLAS, LAPACK, ATLAS, etc).  Perhaps more importantly, many different compiler suites are available, often including those from Intel, GNU and PGI. Setting up your environment to point to the necessary packages, especially to make sure all of them play well together could be a daunting task.

Modules is a software environment management software; in other words, it makes this nightmare of a task simple.  The idea is that all the various libraries/softwares are packaged up with their environment and loading this software sets up the user environment automatically. When properly create, a module will know what dependencies it needs and which other modules it conflicts with, pointing out potential problems before you run into them.  But enough of the introduction, lets get into the use of the software.

Using the "module avail" command provides a list of the modules that can be loaded.  For example, on the Keeneland supercomputer (now defunct; oh why did you go??) this gives:



Each of the sections describe a set of modules that can be used.  To enable a module issue the command "module load <name>".  For example: "module load gcc/4.7.3" will attempt to load

Unloading a module is just as easy as loading a module.  For example, I will need to unload the Intel module before loading the GCC modules because they have conflicting program names. To do this, I first issue: "module unload intel/2011_sp1.11.339".

Of course, you need a way to determine which modules are currently loaded.  "module list" accomplishes this:




Now, these are all pretty standard usage commands for the module command; module is just that easy.  "module swap" will allow you to switch between software; for example "module swap cuda/4.2 cuda/5.5".  The final list looks like:




This about concludes the functionality you need as a user.  If you want to delve deeper, especially with respect to managing a package, read the following.

Friday, October 18, 2013

HowTo: Launch Multiple Jobs in One Batch Submission Script, Pt 1

Many Long Jobs at Once


Recently, we have been using supercomputer resources that give higher priority for larger job (e.g. larger node count).  Some examples of jobs that might be good for these are quantum chemistry simulations to explore a potential energy surface for example exploring a reaction coordinate, or running many replicates of a stochastic simulation.  This proves to be an interesting yet relatively simple problem to solve.  A sample batch script (for the Portable Batch System (PBS)) looks like this:


#!/bin/sh
#PBS -N [JobName]
#PBS -M [email@site.domain]
#PBS -m abe
#PBS -q [QueueName]
#PBS -l nodes=<nodes>:ppn=<processors per node>,walltime=<HH>:<MM>:<SS>

# Move to the working directory
WORKING=[working directory]
cd $WORKING

# Calculate processor variables
NODESPERJOB=<nodesperjob>
PROCSPERJOB=<ppn>

# Split the available nodes
cat $PBS_NODEFILE | uniq > allNodes.txt
split -l $NODESPERJOB allNodes.txt nodefile
ls -1 nodefile?? > nodefiles.txt

# Define some things
PROGRAM=/uufs/chpc.utah.edu/common/home/u0554548/Scratch/Builds/opt/bin/StandAlone/sus

count=0
for f in $(cat nodefiles.txt)
do
  mpirun -np $PROCSPERJOB --hostfile $f $PROGRAM <arguemnts> > outfile$count.log &
  count=`expr $count + 1`
done

wait

# Cleanup temporary files
rm allNodes.txt
rm nodefile??
rm nodefiles.txt


This script assumes that the number of nodes the jobs requires evenly divides the number of processors. Here, the variables denoted by <> (those that are nonstandard) are set to integer values and those denoted with [] should be replaced with strings where appropriate.  Next week I will write a post on how to run a whole boat load of (perhaps short) jobs on a few number of nodes using an bash queue/bash array approach.  Until then, hope this helps!

Thursday, October 3, 2013

How-To: Run Simulations on a Supercomputer

Working on a cluster or a supercomputer necessitates playing well with others.  The workstation that you may sit at during work is rarely shared with more than a few people, but large shared resources can service tens to thousands of people.  The computations that you run must be submitted to a "scheduler", which determines when the computation runs.  A scheduler sets up the order of "job" execution.

The Portable Batch System (PBS) is the de facto standard for NSF, DOE and many other computer systems.  The PBS system defines a set of commands and directives that can be used to control a job.  A list (not exhaustive) of the main commands are: qsub, qstat, qhold, qalter and qdel.  Really you probably only need qsub, qstat and qdel until your jobs get a little more advanced.

To describe briefly the each command: When creating a job, you can submit it from the command line with a command such as "qsub -l nodes=1 ..." command or pass a file that describes the job such as "qsub filename".  I prefer the second, and suggest you use a file too.  Once you've submitted a job, you can check where in line and what the status of the job is using the qstat command.  And in the off-chance that you want to delete a job you've already submitted, you can issue the command "qdel jobid" where the job ID is that reported by qstat.

Job Submission


Submitting a job is the most important, and having a good submission script can really make your life easy, as types with a one line submission cause a lot of aggravation; I've spent dozens of hours debugging crappy qsub lines.  The general form of a queue script is:

<directives>

<environment setup>

<job submission>

Each of the subsections are rather important.  Directives tell the scheduler how to run the job.  These include accounting for the number of resources needed, the queue to use (if there are multiple queues available), definition of the job name and any reporting based on errors.  I usually use a set of directives like:

#!/bin/sh
#PBS -N <job name>
#PBS -l nodes=<number of nodes>,walltime=<HH>:<MM>:<SS>
#PBS -j <job output file>
#PBS -abe <email@site.domain>
#PBS -q <queue name>

Here, each of the variables in between < > must be specified.  The first is the job name that shows up in the qstat command.  Next you may specify the number of nodes used as well as the number of hours, minutes and seconds the job should run, as an example 11:59:59 which will run for almost 12 hours.  The next command defines the file that writes output files for both normal and error output. This output, is not the normal output from the job (we define it later in the file), but the output and error logging from the batch scheduler and any job issues are written to the file.  The next command says to email the specified address when the jobs beings, "b", ends, "e", or aborts due to abnormal execution "a".  This option will spam your email address and you may want to only specify the "e" or "b" option.  The final option specifies which queue to submit the job to (in the case that there are multiple jobs; otherwise leave this option out).

The next two sections of the file that you may want to specify are the environment and the actual command to run the simulation.  The environment probably looks like:

# Set up environment
export PATH=/Paths/to/different/programs:$PATH
export LD_LIBRARY_PATH=/Paths/to/required/libraries:$LD_LIBRARY_PATH

module unload <list of modules>
module load <list of modules>

Generally when  you set up an environment on a supercomputer the first two lines of the previous post are desirable.  These you can use to specify the locations of the executables used by your simulation and where the libraries your simulation needs are located.  The following lines are used mostly on very professional supercomputers, such as those provided by the NSF, DOE etc., and allow you to load specific software into your environment.  These will be a topic of a later post.  Next you will want to define output files and actually run the job.

# Define Program
PROGRAM=/path/to/program/program.executable
# Define output file
OUTPUT=outfilename.log

mpirun -n <processors> $PROGRAM >& $OUTPUT

The first line is optional, but I think it makes the script more modular.  It defines the program to be executed.  Next we define the file that the output from the program will be written to.  This is different from the file that is specified earlier with the $PBS -j option in that it only contains output from the program itself and not from the scheduler.  Finally, the mpirun option runs the simulation with the number of processors specified.  This of course requires that your code be written to take advantage of multiple processors/nodes with MPI, though I assume you've already taken care of that.  (NOTE: some supercomputers have their own wrapper around mpirun; for example, DOE supercomputers use the "aprun" variant which makes life a little easier, especially if you are distributing many instances of your program over many nodes.)

Job Monitoring

Once you have submitted a job you need to monitor its progress (or more likely where it is in the queue). The two commands to do this are qstat and showq.  I know that qstat is standard among all PBS systems; although I have seen showq on most the systems I've been on, it is not standard.  qstat gives just a list of all the jobs.  The output looks like:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
  87725 0.39704 d2_eq03.co user1     r     09/05/2013 16:22:36 ge3@compute-0-11.local            32        
  87826 0.30814 CoPy5H_Co3 user2    r     09/05/2013 14:23:54 ge3@compute-0-10.local             4  
...

The columns are pretty self explanatory.  The first, second and fifth column are the most important.  The first column is the job ID which is used for all job modification operations.  Job priority describes how close it is to the front of the queue, though I haven't quite figured out what the number means, though I think it runs from 0 to 1.  States can be r, qw or h, for "running", "queued waiting" and "hold" statuses.  Perhaps the most useful use of qstat is "qstat -u username" which lists a job by the user; I usually alias it like "alias qme="qstat -u myusername".

I find that showq has more interesting in that it show the currently running jobs, the jobs that are in the queue (with their relative time to execution) and the ones that are on hold.  Up to now I haven't described what hold means.  Jobs that are on hold are that because one of two reasons: 1) the administrators of the supercomputer put it on hold (THIS IS VERY BAD, FIND OUT WHY YOUR JOB IS ON HOLD), 2) you put it on hold.  The second of the options often occurs when you have one simulation that depends on another; for instance, you want a simulation to restart from the previous.  In this case you would put it on hold until the first finished.  I will cover how this works in a later post.

Anyway, showq gives you output (pretty self-explanatory) like this:


active jobs------------------------
JOBID              USERNAME      STATE PROCS   REMAINING            STARTTIME

388243             user1    Running    16     3:23:39  Thu Oct  3 16:31:59
388200             user2    Running    16     5:01:11  Thu Oct  3 18:09:31
...

45 active jobs         616 of 1880 processors in use by local jobs (32.77%)
                         77 of 235 nodes active      (32.77%)

eligible jobs----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME

385175             user1       Idle     4  1:00:00:00  Thu Sep 26 14:37:15
385176             user2       Idle     4  1:00:00:00  Thu Sep 26 14:37:24
388412             user3       Idle   160    23:00:00  Wed Oct  2 10:06:41

3 eligible jobs   

blocked jobs-----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME

388201             user1       Hold    16     6:00:00  Tue Oct  1 18:05:25
388202             user2       Hold    16     6:00:00  Tue Oct  1 18:05:25
...

64 blocked jobs   

Total jobs:  112

Finally, one other useful command that comes with many of the supercomputers (though not all of them) is showstart.  You pass it the job id and it estimates when the job will begin in the format 'dd:hh:mm:ss' based on its priority.

Job Modification

Job modification comes in two forms.  You can modify the state of a job, for instance put it on hold or delete it.  These are accomplished with:

qalter -h u jobid
qdel jobid

The other way to modify a job is to have the job script, execute other scripts.  This way the actual job is not run by the queue script and if you discover a change you need to make while it is sitting in the queue, you can change the resultant script without having to wait through the queue again.  This becomes especially important when you are running on supercomputers at national resources where you wait in the queue for a few days.

Those are sort of the main commands you will need.  I'll have a few more advanced topics in a near future post.

Monday, September 23, 2013

Crash course on Gnuplot

Gnuplot may be my favorite program in the whole world.  Of all the graphs I've made (and every plot I've ever published) 95% of them have been made with gnuplot.  While I am starting to use Python's Matplotlib for more here and there, gnuplot is still my favorite.  It can be used for quick-and-dirty plots and spot checking of data, or the plots can be readily customized and beautified to publication quality.  This How-To will give you a basic working understanding on the program.  I write it because the gnuplot user guide isn't as useful as it could be (and because my favorite FAQ no longer is a webpage...).

Basics


Gnuplot can be run in instantaneous mode or run on a script describing the full plotting arguments.  The syntax is the same either way.  To start off with command line mode, run gnuplot from the console.  If you are on a reasonable system the terminal, or in other words the place where the plot is drawn/written, will be set up so that it will display the plot in a GUI.  Start off by plotting:

gnuplot> plot x**2, 20*sin(x)

Which will show the quadratic and a sine function:
The ** is a funny syntax that means raise x to the power of y.  This is a pretty simple example, and rather arbitrary; but a number of functions can be used and most of the standard functions can be used.  In addition, functions can be defined:

gnuplot> f(x)=ax+b
gnuplot> a=1.5; b=5
gnuplot> set grid
gnuplot> plot f(x)

Which gives the plot:

Here I also used the "set grid" command to turn on the background grid.  The basic syntax is "set option <attributes>" where the option in this case is grid and there are no attributes (there and be a list of attributes after the option).

Of course, the program wouldn't be useful unless you could plot data from experiments and also fit datasets to functions.  Take for example trying to fit an exponential to the decay of a reactant you might see in a first-order reaction: A-->2B.  As an example, I'll use the datafile (in gnuplot comments in datafiles begin with a # symbol):

# t (s)  A (M)  Error (M)
0  3     0.001
1  2.3   0.05
2  1.55  0.05
3  1.22  0.03
4  0.99  0.04
5  0.65  0.02
10 0.145 0.08

We can write a function and fit it to the data:

gnuplot> f(x)=A*exp(-x/tau)
gnuplot> fit f(x) 'datafile.dat' via A, tau
gnuplot> set xlabel "Time (s)"
gnuplot> set ylabel "Concentration (M)"
gnuplot> plot 'datafile.dat' using 1:2:3 with yerrorbars t "Points", '' using 1:2 w l t "Lines", f(x) t "Fit"
gnuplot> set xrange [-0.5:10.5]
gnuplot> replot


The fitting procedure gives parameters of the fit and the certainty of the value:

Final set of parameters            Asymptotic Standard Error
=======================            ==========================

A               = 3.0108           +/- 0.05894      (1.957%)
tau             = 3.33788          +/- 0.1288       (3.858%)

Which gives the plot:

Now this last example was a bit more like what you may want to do.  I'll take you through it step by step.  We used the "set" command to specify the text for the labels.  In addition to the xlabel and ylabel, you can set the title and a number of other "labels".  Next we plot three different things: 1) the points in the datafile with error bars based on the third column (the first two are x and y), 2) a line connecting the data points and 3) the fit to the data.  After that, I realized that I wanted to change the range of the x-axis so I defined a range from -0.5 to 10.5, and then I issued the "replot" command which executes the last "plot" command exactly. Now we are getting somewhere!

However, as you may have noticed that the plots that I have created so far kinda look crappy.  This is because I used basic parameters and printed to the "png" terminal.  To make a slightly prettier plot, we can edit each piece/attribute independently and print to a better file format.  Taking the last example one step further and I'll be plotting the product B concentration.  This time I used a file with the list of commands written out.  This is a nice way to specify a reproducible plot command and also simplifies the process of making a nice graph when we get quite a few commands.  It was plotted with the command gnuplot plotfile.gp, where plotfile.gp looks like:

# Define functions that fit the data
fA(x)=A*exp(-x/tau)
fit fA(x) 'datafile.dat' via A, tau

# Set up the axes for the right sid
set xlabel "Time (s)"
set xrange [-0.5:10.5]
set ylabel "Concentration of A (M)"
set yrange [0:3.5]

# Set up the Y-axis on the right side
set y2label "Concentration of B (M)"
set y2tics
set y2range [0:7]

# Make the points bigger
set pointsize 1.75

# Define where the legend is (i.e. below the plot)
set key below

# Set up file to plot to, make it a postscript file with
#  a pretty print (enhanced option) with and Arial font
#  at 16 points size.  Specify the file name to be
#  concentrations.ps
set terminal postscript color enhanced "Arial" 16
set output 'concentrations.ps'

# Here, "axis x1y2" means the x is the bottom scale and the y
#  is on the right side.  "yerrorbars" means that the third
#  column is the error bar spread.  Similarly, one could
#  use the xerrobars.  "pt" sets the point shape and the "lw"
#  option specifies how thick the lines are.
plot 'datafile.dat' using 1:2 w p t "Average A" pt 7 ,\
     'datafile.dat' using 1:2:3 w yerrorbars t "{/Symbol s}_A" pt 3 lw 3,\
     fA(x) t "Fit A" lw 3,\
     'datafile2.dat' using 1:2:3 w yerrorbars axis x1y2 t "Average and {/Symbol s} of B" pt 5 lw 3

I've included comments (that begin with the # character) so I don't have to describe each command here.  Another thing that was demonstrated was how to plot greek symbols.  You can see the that {/Symbol s} makes a σ, which can only be plotted if you use the "enhanced" option with the postscript terminal.  The greek symbols are indicated by the letter that most closely matches their sound, and a full list (along with a few other symbols) can be found here. The result (converted to a png using another program) looks like:

The terminal can be set to many different output types, you can plot jpegs, pngs, postscript, encapsulated postscript, pdfs, etc.  Virtually any font the system has can be used and any label can be customized with a font or a fontsize all by itself.  In addition, the two x- and y-axes can have different plots to them, for example: axis x1y2, axis x2y1 and axis x2y2.  

Plotting multiple images in the same file is very useful.  The origin for the coordinate system is the bottom left, the same as a standard coordinate system.  A normalized unit system is used, with the 1,1 being the top right.  Schematically this looks like:

Sizes of the subplots are also normalized numbers.  For example you can plot two functions, one on top of the other like:

# Turn on the multiplot capabilities 
set multiplot
set origin 0,0
set size 1,0.5
plot x
set origin 0,0.5
plot x*x

Which creates the following plot:

Note, however, that you need to specify the terminal and the output file before you do any plotting, or before you even specify that multiplot should be used.  I think multiplotting from the gnuplot terminal is practically useless and only really use a plotfile myself.  The multiplot easily allows plotting three plots, one on top of the other and one on the side, such as:
Which was created with:

set terminal postscript color enhanced "Arial" 16
set output 'trip.ps'

set multiplot
# Plot the first
set origin 0,0
set size 0.5,0.5
set xlabel "X Label"
set ylabel "Y Label1"
plot x
# Plot the second with the same x label
set origin 0.0,0.5
unset xlabel
set ylabel "Y Label1"
# Note for math fuctions, integer vs
#  floating point arithmatic (3/2=1 vs 3.0/2.0=1.5)
plot x**(3.0/2.0) t "x^{3/2}"

# Plot the third
set xlabel "X Label"
set ylabel "Y Label2"
set size 0.5,1.0
set origin 0.5,0.0
plot x**2 t "x^2"

More Fun



Now that I've covered the main functionality of gnuplot, I'd like to show you a few other cool things you can do.  I often plot histograms:

# Define a function that calculates which bin 
#  the x value should be in
binwidth=0.05   # width of the bin
bin(x,width)=width*floor(x/width) + binwidth/2.0

# Pretty the plot
set xrange [-1:1.5]
set key below

# Call the bin function on the first column: $1
plot 'histDat.dat' using (bin($1,binwidth)):(1.0) smooth freq with boxes t "Gaussian: <x>=0.2, {/Symbol s}=0.2"


This creates a plot that looks like:
A few things that you should notice is that you can call any function on the column of a file.  For example, with the same distribution I could call plot 'histDat.dat' using 1:(f($1)), which would apply the function f(x) to all the values in the x range.  You can customize the bars in a number off different ways: fill with patterns, solid colors, etc.  

Plotting a 2D surface plot can easily be accomplished using the splot command instead of plot which can plot 2D functions or a file with three columns.  If you use the file you need to separate each line individual line by a blank line.  Usually, I use a heatmap to show the height, instead of the wireframe; its more descriptive.  Going back to our previous example of the gaussian function:

# Define functions and ranges
f(x,y)=1/sqrt(3.14)*exp(-(x**2+y**2))
set xrange [-3:3]
set yrange [-3:3]

# Set the plot to show height with color
set pm3d
# Set the number of boxes to divide the function into
set isosample 500, 500
# Plot
plot f(x,y) with pm3d

Which makes a nice plot:
Also, it is nice to see the contours for a surface plot drawn on the base of the 3D plot sometimes:

# Define a function
f(x,y)=x**2*cos(2*x+0.5*y)

# Set the contour line levels
set cntrparam level incremental 0,2,20
set isosample 25,25

set xrange [-5:5]
set yrange [-5:5]
# Turn off the legend so we don't
#  see all the multitude of contour
#  lines
unset key   

splot f(x,y)

Making the plot:

Well those are about the best of the basics; we have barely scratched the surface.  You can do pretty much anything you can do with Excel--gradients, bar graphs, backgrounds, surfaces, pie charts, ribbon charts--plus quite a few other things.  There are a number of great tutorials that are a bit more full featured, such as this one by Zoltán Vörös, which is one of the more imaginative and unique tutorials online

Thursday, September 19, 2013

Awk - Average and Standard Deviation

A quick little awk script to compute the average and standard deviation of a specified column in a file.


#! /bin/sh
# Bulletproofing
if [[ $# -lt 2 ]]; then
  echo "Usage: ./avgStd.sh <file> <column>"
  exit
fi

# Compute Average and Std. Dev.
avg=`awk -v var=$2 'BEGIN{count=0; avg=0; std=0} {count=count+1; avg=avg+$var} END{print avg/count}' $1`
std=`awk -v var=$2 -v av=$avg 'BEGIN{count=0; std=0} {std=std + ($var-av)*($var-av); count=count+1} END{print sqrt((std)/(count-1))}' $1`

# Print results
echo "Average:\t$avg"
echo "Std. Dev:\t$std"

You should be able to copy/paste this code to a file, run chmod o+x on the file and run the command with the two arguments, file and column.  For example:

./avgStd.sh derp 2

Computational Scientist Tool-belt: Minimum Set of Software

Having a tool-belt full of tools is very beneficial for computational modeling scientists/researchers.  But learning each tool is time consuming, and often only a small subset of the available tools is required to accomplish almost any standard problem you will run into.  

Here, I assume you are using a POSIX environment, which is my specialty.  While I wont say that Windows based computers are not good for scientific modeling, I believe POSIX (Linux, BSD, Macintosh, etc.) based computers are slightly superior. Note: you could install Cygwin or other terminal based software to emulate POSIX, but might not have the benefit from the tight coupling with the OS, nor the simplicity of many of the package managers available on POSIX based OS, nor the ease of compilation and support of many open source softwares.

The following tools are my recommendations for a basic tool-belt.  In some cases simple examples of usage are demonstrated.  I provide links to either Wikipedia or the exact webpage, which can also be easily found via Google.

Remote Computers

Scientific computing often requires interfacing with other computers besides the actual workstation you sits behind.  Often times data is stored on a backup server or you are using a remote computer (e.g. supercomputer).  Logging into a user account on a network connected computer or moving data between places (in a secure manner) are two very common processes.  Two commands to know are ssh and scp.  The first allows you to log into a remove machine where they have a user account.  The general form of the command looks like:

ssh username@hostname.domain

e.g.

ssh jp@mycomputer.com

You will often want to run a visual/GUI program on another machine, and this requires a special feature of ssh called X11 forwarding.  Two nearly identical flags can be specified that allow this behavior, -X and -Y.  The difference is subtle, however I prefer the second, as it is slightly more secure than the first.  The command would look like:

ssh -Y jp@mycomputer.com

An analog to the cp command is scp, which allows you to move (more precisely copy) files between two network connected machines.  The basic command for copying files from a remote host to the you where you is typing terminal commands is:

scp username@hostname.domain:/path/to/file /local/path/to/file

And the reverse also works:

scp /local/path/to/file dirusername@hostname.domain:/remote/path/to/file

These commands work for files; in order to move a whole directory, the -r flag needs to be added to the command line.  However, if a directory contains many files, it is better to compress them into one file and copy it.   Scp works well in many cases, but isn't great for copying large files, especially from supercomputers.  Scp also has the drawback that an interrupted copy--say there is a dropped connection--must be restarted.  There are better tools to cope for these drawbacks that I will be discussing in a post in the near future.

Scripting and Programming

Shell scripting is a means to simply specify a set of actions and logic in the same language as the shell (command line interface) that you use.  The two most common languages are sh/bash or csh/tcsh.  Personally, I prefer bash, though both are relatively easy to use.  Some of my favorite tutorials for bash are:
In addition to shell scripting, I suggest you learns a simple programming such as Python or Perl.  While I have very little experience with Perl, it is a great tool for string manipulation.  Python is pretty common in computational sciences and supports a number of features that make it very desirable.  It is easy to learn and proves to be a powerful way to prototype short programs.  A command line interface can be accessed by typing python into the terminal which can be used to try out any particular command.  A plethora of scientific, plotting and math libraries have been developed, including SciPy, matplotlib and NumPy to name a few.  The best resources for Python can be found here:
If you plan to do code development of any kind, a software capable of version management.  A common problem is that a person programming will write some code that works, think of a better way to write the code, try to write it that way and then break the working code.  Often times it is difficult to bring the code back to original working state.  Version control software provides the ability to save the exact state of the code as well as allow the users to log messages about what the exact changes entail.  Three commonly used software systems/protocols are used: cvs, svn and git (well really four, but Mercurial isn't used as often as the others).  Cvs has very much gone by the wayside.  Svn is common in many different settings and considerable legacy support exists.  Git is newer, and supports a number of really neat features, though I would argue it has quite a bit higher learning curve than svn.

When you find yourself working with svn for the first time, it will likely be with someone else's code.   Some simple commands might get you started.  When you first download the code from the svn repository--the central place that the code is stored and changes are tracked--you need to do a checkout:

svn co username@server:/path/to/repository

Usually a server will require a username the command will prompt for a password.  In order to make sure the code is up to date at any given time, you can issue an update command to get the newest version of code in the current directory and any subdirectories:

svn up

In order to check the properties of the local version of the code, there are two commands:

svn stat

and:

svn info

These can be run in any directory of the code tree (the whole directory structure under the root of the repository).  The status command will tell you of any modifications, additions, deletions, moves, copies, etc that occurred to any of the files in the local repository.  The info command will report the version of the code (The revision numbers start at 0 when the directory is created and count up), the svn repository and directory as well as the date of the last code was changed locally.

Once you've made changes and validated that they work, the code should be committed to the repository.  Changes may include newly added files which are added to the repository via the command:

svn add <filename>

An update of the code to the most recent version is required before any changes of the code are added to the repository.  A commit (updating code that is modified/added in your version of the code to the main repository) can be accomplished using:

svn commit -m "Message telling what the actual changes are.  This message logs your changes."

With any luck, this will add your (hopefully correct) changes to the repository for permanent providence.  One final command that is very useful is svn log, which prints the commit messages. Any of these commands can be used on a per file/per directory basis.  Many other functions are documented in the svn documentation.
Git is a whole different beast; maybe start with the official git tutorial.

One final program commonly used in programming is the diff command.  Diff lists the differences between two files on a character by character or line by line basis.  It takes two files as input and outputs just the changed lines.  The output looks something like this:


2,3c2,3
< green apple
< was very
---
> apple
> was 

The first line indicates the lines that contain changes (denoted as "c"; also "a" and "d" are common for added or deleted).  The numbers indicate the beginning and ending lines for the first and second file.  Changes to the first file are denoted with lines that start with < and the second start with >.  

Data Manipulation and Plotting

Data often needs to be manipulated, maybe you need to convert a comma-separated file to a tab-delimited file.  An ideal tool for accomplishing this task is sed, a stream processing tool.  Sed works on streams which can be strings, output from other programs or most commonly files.  The sed command for the example is:

sed 's/,/\t/g' commaSeparatedFile.csv > tabDelimitedFile.txt

This command says: Substitute, "s", a comma "/,/" with a tab character "/\t/" and replace for each instance on the line "g".  This is a simple example but sed can do many things such as appending, inserting, deleting and many other operations.  But I mostly use it for string replacement.  A fantastic tutorial was written by Bruce Barnett.

Stream processing is nice, but often times you may want to perform operations on the data in a file, say multiply each value in a column by a constant/variable or calculate an average.  Awk is the tool for this type of manipulation.  Awk could be considered a scripting language, though the syntax and operation is a bit different than Python or other languages.  I find it most useful when you want to apply the operation to every line in a file, for example calculating the average of the 5th column:

awk 'BEGIN{avg=0; count=0}  {avg = avg + $5}  END{print avg/count}' file.txt 

or multiplying two columns and printing it as a third column:

awk '{print $1 "\t" $2 "\t" $1*$2}' file1.txt > file2.txt

Often times, you don't have all the columns in one file.  Of course the cat command wont accomplish the goal of putting the data in the format that is useful for a line-by-line processor such as awk.  The tool for this is the paste command.  Say you have two columns of data in two files that you want to multiply together with the previous command, you can paste them (with a tab between columns) together with:

paste -d "\t" column1.txt column2.txt > file1.txt

This is really the only use of paste, but it can be very helpful.

Finally, sequences of numbers often come in handy, especially when shell scripting.  For this the command seq is most common, that is, if you are on a standard Linux machine.  For some reason I don't understand, Apple decided to replace seq with jot (It is a BSD thing).  I often find myself doing something like this in a shell script:

for i in $(seq 10 20)
do
  # something with $i
  ...
done

Like paste, seq is a pretty simple command, with the only really customization being the format of the output numbers.  Jot on the other hand has a few other uses, for example repeat printing something a number of times:

jot -b hi - 1 10

will print the hi ten times.  It can also print random numbers, or floating point numbers, which seq wont do, making it slightly superior.

Quick-and-dirty, or professional plotting can both be taken care of with gnuplot.  This is a large and complex topic and will be handled thoroughly in a post very shortly.

Well these are the basic tools I think.  If you have any comments, leave them below.

Friday, September 13, 2013

Why a Blog on Scientific Simulations?

Surviving in Graduate School is hard, as I have recently discovered.  Using computers to solve those problems that come up during research, be they simple data analysis on a local workstation or complex computer simulations on national supercomputers, can be very desirable (more likely necessary).  While computers can greatly simplify or speed up the research process, when the programs/codes/tools that one uses don't work, it can be the most aggravating thing.  Yelling (with cursing) often ensues, and despair easily creeps in.

I have lived this life for more than 6 years now, and while these situations have occurred often, don't despair.  There is hope!  Getting by this many years was only possible by relying on those sharing their knowledge online.  Stack Overflow may be one of the most useful websites around.  Many blog posts may be found to help one solve a specific problem.  This blog was created to help me convey some of the knowledge and skills I have accrued over the years to help others that work in the computational sciences or conduct scientific research solve common problems, learn necessary skills and overcome problems.

A little about my background: I have a BS in Chemistry and a BS in Computer Science.  The past six years of my life have involved computational sciences of many kinds.  Now I am pursuing a PhD in Chemistry, though the research is more realistically described as Computational Biophysics.  I have experience with a handful of programming languages and many unix tools.  With experience in computational fluid dynamics, continuum mechanics for material modeling, stochastic biological simulations, molecular dynamics and some quantum chemical modeling, my experience in computational modeling is rather diverse.  In the past I used and developed code for the Uintah Computational Framework used by many to simulate a number of materials and fluids problems. Currently, I develop code for the stochastic biochemical problem solving environment Lattice Microbes.  In solving these problems both workstations and supercomputers have had to be employed.

This blog will draw on these experiences and I intend to cover such topics as:

  • Data analysis/processing
  • Using supercomputers
  • Compiling software on different architectures
  • Math techniques in scientific computing
  • Setting up simulations
  • Programming techniques
  • and many more...
Some posts will be how-to's, some will be FAQs, but the majority will be short tid-bits of information that occur to me during the day.  I intend to make code available where possible, to help facilitate the understanding, and intend to actively respond to comments and questions.

With that, I hope you all will find this useful.