HPC@UniTN, folder organization, analysis with Gromacs.

In this lecture we will finally learn how to use the cluster to run a simulation. In the second part, we will talk about folder organization and we will have a brief introduction to simulation analysis with Gromacs.

In detail:

  1. How to use the cluster.
    • Architecture.
    • Usage.
    • Send and retrieve data from the cluster.
    • Moduli.
    • Write and run small scripts.
    • How to perform a benchmark.
  2. Keeping files organized.
  3. Analysis with Gromacs.

1. Using HPC@UniTN

This is a very small guide about launching simulations on HPC@Unitn. It is meant to be helpful for the master students who attended Computational Biophysics.

You are supposed to have a basic knowledge of bash/sh (i.e. using the terminal). You can find resources on bash under the resources section.

For more info about the cluster, see HPC website.

How to connect

In order to connect to the HPC cluster, you have to be connected to the university network:

  • you are connected to Unitn-x or similar (so you are in one of the university structures);
  • if you log from home, you need to use the university VPN (install the Pulse secure and google how to connect to unitn).

Once one of the aforementioned criteria is fulfilled, then open a shell (Ctrl+Alt+t for Linux user) and type:

 ssh <your-unitn-username>@hpc2.unitn.it

and then your password will be requested.

NOW you are connected to the login computers. DO NOT LAUNCH JOBS THERE: those computers are meant just for the job submission, since they are shared among all connected users and they have to be available.

If you want to do something on the fly, using only few cores for few minutes, go interactively (see below).

Your home directory.

Each user has a separate home directory. Here you can read, write, and execute files. It helps to have a common structure in your home directory. A suggested one is as follow

.
├── projects    # scientific projects (see below how to structure them)
│   ├── proj-1  # of course, you should use sensible names..
│   ├── proj-2
│   ├── ...
│   └── proj-n
├── templates   # templates for common files, if needed.
├── transfer    # anything you want to scp to/from the cluster
└── usr
    ├── bin
    ├── man
    ├── include
    ├── lib
    └── src    # external software such as miniconda, gromacs...

Use mkdir to create this folder structure.

Using the man page, what does mkdir -p do?

Setting up where to find things

You want the programs you install to be easy to find when you need them. That is why we created a usr/ directory in your home folder. That way, when you install compile a program, you just have to specify that it should be installed in home/<name.surname>/usr and it will automatically place everything in the correct subfolders.

To make bash recognize where you put programs and libraries, add these lines to the .bashrc file in your home directory.

export PATH=$PATH:$HOME/usr/bin:$HOME/bin
export INCLUDE=$INCLUDE:$HOME/usr/include
export LIBRARY_PATH=$LIBRARY_PATH:$HOME/usr/lib

Exercise

$ Create a script to download a pdb file automatically from www.rcsb.org

This is what you need to know.

  1. Note that all PDBs are downloaded from https://files.rcsb.org/download/<code>.pdb
  2. you have a program, wget, that downloads files (or entire websites). Use man to understand how it works.
  3. You can pass <code> as a variable, ${PDBCODE}, where PDBCODE has been already assigned, e.g. PDBCODE=1AKE.
  4. A bash script is an executable file which begins with the line #!/bin/bash.
  5. The arguments for a script are $1, $2, $3… so PDBCODE=$1 will take the first argument and assign it to PDBCODE.

The steps to create your script are:

  1. Write a bash script using the info above. Save it as get-pdb.sh
  2. Make it executable using chmod +x get-pdb.sh
  3. Launch it with ./get-pdb.sh <PDBCODE>

$ Can you launch it from another directory?

  1. Move it into ~/usr/bin.

$ Now, can you launch it from another directory?

Understanding BASH

From here on, BASH will be very important. We recommend the following resources:

A minimal intro to cluster architecture

a standard cluster architecture A standard cluster architecture.

A cluster usually has:

  • Some login nodes to connect to the rest of the world.
  • A file server + disks to store data.
  • Computational nodes, to perform heavy calculations.
  • Hopefully, an infiniband connection to make the nodes communicate with each other.
  • A batch scheduler (PBS for our HPC cluster) to manage the computational jobs of different users fairly.

Cores, cpus, and nodes Cores, CPUs, nodes. (image adapted from Supalov et al.)

Inside a node you will have one or more CPUs, each with one or more nodes. The cpus usually share some very fast access memory while the ram is shared per node. This will affect some of you in the future if you end up writing and optimizing your own code (hint: you might want to avoid that).

An extremely quick overview of PBS

The cluster has a batch scheduler, a software that organises the execution of the jobs coming from different users with different requested resources. The allocation is based on resources those you need, and those that are free.

Resources are:

  • Nodes types (some have more cores, some have GPUs for example).
  • # Nodes.
  • # Cores.
  • Amount of memory.
  • Walltime: maximum time you will be granted; if your job is not ended yet, it will be killed. So, choose it carefully!

These resources are usually rationalized in queues. That is, the cluster administration can set up some limits on the walltime for some specific resources. You can specify a queue (better) or let the scheduler pick up a default queue for you, based on which resources you ask. To see the queues, you can run:

qstat -S

We remind you that you should request maximum 16 cores. As far as I know, you will not be able to run, statistically, those jobs that request more than 10 cores per nodes. Therefore, either you ask for 10 cores only on a single node (the following line will be clear in a while):

#PBS -l select=1:ncpus=10:mpiprocs=10:mem=4GB

or you ask for 10 cores on two nodes (using then 20 cores in total)

#PBS -l select=2:ncpus=10:mpiprocs=10:mem=4GB

We will see later where you can find the Template batch files for PBS.

IMPORTANT: before launching THE simulation, just run small simulations (1000 timesteps maximum) asking for different numbers of cores. For example: launch 3 simulations requesting 5, 10 e 20 cores (ça va sans dire, set the walltime at 10 mins). Depending on your system, you will have a small gain in performances using 20 cores with respect to 10 cores. If this is your case, use only 10 cores, since it will speed up your time in queue.

Submitting a batch job

To launch a calculation on the nodes, you need to instruct the batch scheduler about your intentions: what resources you need, and what you need to run. Let see a small, serial, test job, before jumping to production scripts.

Launch this script from your local usr/src directory.

#!/bin/bash
# Resources you need
#PBS -l select=1:ncpus=1
# WALLTIME
#PBS -l walltime=00:10:00
# test queue
#PBS -q test_cpuQ
# Name of the job
#PBS -N USE_AN_APPROPRIATE_NAME
# stdout redirected here
#PBS -o appropriate_name_out
# stderr redirected here
#PBS -e appropriate_name_err

# This is a comment.
# From the /home in the compute node we move to the
# directory from which we launched the job
# (and where you, hopefully, have your files)

VERSION=2019.6
NAME=gromacs-${VERSION}
wget http://ftp.gromacs.org/pub/gromacs/${NAME}.tar.gz
tar xvf ${NAME}.tar.gz

Suppose this file is called submit_me.pbs. You can submit the job with:

qsub submit_me.pbs

Then you usually wait, have a cup of coffee and pray.

$ What is the result?

$ What is inside the output and error files?

What to modify

All the lines starting with #PBS are NOT comments but instructions for the batch scheduler.

#PBS -l select=<numberOfNodes>:ncpus=<numberOfCoresPerNode>\
:mpiprocs=<sameNumberOfNcpus>:mem=<GBofRAMnecessary>

#PBS -l walltime=hours:minutes:seconds

#PBS -q <queue to be used>

#PBS -N <name to be visualised using qstat>

#PBS -o/e <file to redirect stdout and stderr>

Submitting a parallel job for a GROMACS simulation

Although we can install gromacs, we will use the version provided by the cluster for our runs.

Pay attention to the command used to run a Gromacs simulation in parallel.

#!/bin/bash
#PBS -l select=1:ncpus=10:mpiprocs=10:mem=10GB
#PBS -l walltime=00:10:00
#PBS -q short_gpuQ
#PBS -N USE_AN_APPROPRIATE_NAME
#PBS -o appropriate_name_out
#PBS -e appropriate_name_err

# This is a comment.
# From the /home in the compute node we move to the
# directory from which we launched the job
# (and where you, hopefully, have your files)

module load openmpi-3.0.0

cd $PBS_O_WORKDIR

source /apps/gromacs-2018/bin/GMXRC

mpirun -np 10 mdrun_mpi -v -deffnm nvt

The line used to load the module will be clear in a while.

Interactive jobs

On the login shell type:

qsub -I -q <queue> -l select=1:ncpus=2:mpiprocs=2:mem=10GB,walltime=00:30:00

and then you can use the same commands you use in the submit_me.pbs file.

Check your job(s) and delete them if necessary

The command to check for resources, jobs, statistics, etc. is qstat. To see what it does:

qstat -h

To check all the jobs running for your user.

qstat -u <username>

Once you submitted a job, you can delete it.

qdel <jobid>

Copying to/from the cluster

See the cluster page.

You can use:

  • scp: cp over ssh protocol. scp <file(s)> <user>@<destination_address>:<destination_dir> e.g. scp myfile.txt luca.tubiana@hpc2.unitn.it:
  • rsync: rsync is useful to sync folders. rsync -avuz <files(s)> <user>@<destination_address>:<destination_dir>
  • wget to download from webservers.
  • git is available only through the http protocol.

scp works as cp: if you want to upload or download a folder you have to use the option -r.

Please note the : at the end of the command: the colon represents your home. If you want to copy a file from your computer to another directory on the cluster, you should use (from the terminal on your computer):

$ scp file your_username@hpc.unitn.it:path/to/a/folder

As usual, use the man page or the command’s help to get more info on the commands!

  • man <command>,
  • <command> -h.

Using modules

A capital problem for shared systems: different users need different softwares, diffierent libraries, different compilers. The solution for this is to isolate the “software stacks” you use from those of the system. I.e. make you able to use a certain version of the libraries/software you need, on demand. There are two main solutions for this.

  1. Using modules.
  2. Using containers like docker and singularity.

Modules are supported everywhere right now. The downside is that what is available is decided by the admin. Containers can instead be set by the user, but they are not yet a standard in academia. Docker is the standard in industry, but poses serious security problems on shared clusters. Singularity tries to alleviate them. We will give you a quick intro to modules.

  1. See available modules: module avail.
  2. Load a module to use it: module load <module_name>.
  3. List loaded modules: module list.
  4. Unload a module: module unload <module_name>
  5. Purge all loaded modules: module purge.

Performing a benchmark

According to the size and type of system simulated, there is a specific number of cores that corresponds to the maximum efficiency of the parallel setup. In order to find this optimal number of cores, before running the production simulation it is important to perform a benchmark, namely a series of short tests that use a different number of cores. The duration of each simulation can vary; in general, 10 minutes should be enough to have a good estimate of the performance. Plotting the speed of the simulation (in terms of ns/day simulated) as a function of the number of cores gives a visual indication of the efficiency with which the resources are employed. An example is given in the following figure, where the performance of difference MD software packages is compared.

Benchmark

Gromacs writes at the end of the log file (produced as one of the outputs of the mdrun command) some information that is useful for benchamrking, including the number of nanoseconds that can be performed in one day. Even if the simulation will not be completed in the walltime allocated, this recap of the simulation performance can be printed in the log file making use of the -maxh option of mdrun. When using a walltime of 10 minutes, setting -maxh 0.16 allows us to avoid an abrupt termination of the simulations right before the walltime expires, even though the total number of steps is not reached. By properly terminating the simulation, we will find the recap of the simulation statistics at the end of the log file.

$ Using the files created for the production run of the alanine dipeptide simulation, perform a set of 10-minutes long simulations; for each simulation, use a different number of cores.

Restart a simulation

When performing long simulations, the walltime that had been set might be reached before the simulation is completed. In such cases, it is necessary to perform a restart of the simulation. Gromacs offers a practical way to smoothly extend the trajectory file after a restart, by making use of the checkpoint file (cpt). These files are written at regular intervals during the simulation. To restart a simulation that has been killed, use mdrun with the following command, applying the appropriate filenames:

gmx mdrun -s X.tpr -cpi X.cpt -deffnm X -append

$ This is the serial version of the command; can you write it so that it runs in parallel?

$ What do you think is the purpose of the flag “append”?

Further info

For more commands and the use of the batch scheduler:

A gentle and friendly reminder

You will be able -and you are encouraged- to use GPU nodes for your simulations, especially for those related to your final project. But do not forget that the cluster is a shared machine used for actual research by the whole university.

Use it responsibly.

Just a final friendly reminder on this subject.

We will monitor your usage of the cluster.

Any inappropriate usage will be reported
to the admins and the professors.

We don’t forgive.
We don’t forget.

Homework

Using the notions learned in the last two lectures, set up a simulation of the lysozime T4, and run it on the cluster. You can use the parameter files (mdp files) from the lecture2; however, remember to change the number of timesteps as appropriate.

The necessary steps are the following:

  • Download the structure from the protein data bank (PDB ID: 3HTB)
  • Remove the small molecules, ions and water molecules present in the file, in order to keep only the protein; you can recognize the additional molecules from the name HETATM in the first column.
  • Create the topology and set-up the box; add water and ions.
  • Perform minimization, and equilibration in NVT and NPT.
  • Perform the production MD run.

2. Keeping files organized

Choose a proper folder structure

An important and often overlooked aspect of running a project is that if you design things correctly you will considerably reduce the amount of work required to understand and recover things.

A project folder tends to explode into a chaotic mess pretty quickly. This aspect is so well known to be a classic joke about academia. Filenames

In some fortunate case, years of frustration cristallize on top of successful codes in the form of dark humour mocking the user (looking at you, Gromacs), but since it is not always the case, so let’s try to think of a decent structure.

Everything in the right place

Each project should follow a very similar folder structure. This has several advantages

  1. You always know where to find things!
  2. It is easier to automate common tasks (e.g. copying files, launching jobs).
  3. The folder structure will make you think about your workflow.
  4. Your work will be easy to share and merge with others.
  5. Other people will be able to pick up your work.

The structure

This is a template folder structure studied to partially solve the issue, if you follow it as intended.

.
├── bin                           -> all compiled programs and scripts. Compiled programs ignored.
├── data                          -> Data folder, ignored by git
│   ├── 00_external               -> External data, if present
│   │   └── Readme.md
│   ├── 01_raw                    -> Raw data produced by simulations
│   │   └── Readme.md
│   ├── 02_processed              -> Intermediary data
│   │   └── Readme.md
│   └── 03_analyzed               -> Results after analyses. This is used to produce figures.
│   │   └── Readme.md
├── doc                           -> Extra documentation if needed (e.g. software docs).
├── notebooks                     -> Jupyter notebooks.
├── references                    -> Store literature here, ignored by git.
│   └── Readme.md
├── report                        -> Should eventually lead to the final paper
│   ├── imgs
│   └── report_template.tex
├── src
│   ├── analysis                  -> Analysis scripts etc.
│   ├── external                  -> Third-party software. Ignored by git.
│   ├── production                -> simulation software and scripts used to produce raw data.
│   ├── tools                     -> Utilities (scripts)
│   └── visualization             -> Visualization scripts.
├── makefile                      -> The makefile should reproduce all figures and processed data.
├── AUTHORS.md
├── conda_QCB_essentials.yml      -> conda environment
├── README.md                     -> README for the project
└── test                          -> all tests on analysis scripts and production software

We are always improving this, and if you have comments they are welcome!

Setting up the folder structure on the cluster.

The folder structure above can be automatically installed through cookiecutter. For those interested, see cookiecutter to set up the project template. Let’s install everything..

I. Install cookiecutter

On your laptop

conda activate <QCB-env>
conda install cookiecutter

On the cluster

module load python-<version> ... #use the above commands to find a version of python >= 3.6!
pip3 install --user cookiecutter

II. Use cookiecutter to get the folder

cd <some_directory_I_like>
cookiecutter https://gitlab.physics.unitn.it/sbp/reproducible-comp-phys-template.git

Cookiecutter will ask you a few questions, including the project name, your name, and your email. Fill them in correctly.

You will end up with a folder like the one above.

Of course, in order to use the conda environment you need to install conda first. If you want, you can use wget to download miniconda to your home in the cluster and then proceed as usual.

3. Analysis with Gromacs

Download the folder.

The output of an MD calculation performed using Gromacs is usually the trajectory in .xtc format. As we saw in lecture n.2, the trajectory is usually processed before starting the data analysis. Indeed, we remove the effects of the Periodic Boundary Conditions and the diffusive dynamics by using the trjconv command:

gmx trjconv -s md_tpr.tpr -f md_trajectory.xtc -o output_trajectory_noPBC.xtc -pbc mol -center

a further usefull step is to remove the rotations and translation of the protein with respect to the initial condidions:

gmx trjconv -s md_tpr.tpr -f output_trajectory_noPBC.xtc -o output_trajectory_fitted.xtc -fit rot+trans

Now we can start the data analysis using the fitted trajectory output_trajectory_fitted.xtc that we can rename in traj_fitted.xtc.

Apply the commands above to process the 3 trajectories in the folder

Pay attention to specify the name of the system in the output file names! Keep the input names also in the output files.



RMSD

The RMSD is defined as following:

The RMSD can be computed by using the program gmx rms. It compares two structures by computing the root mean square deviation (RMSD). Each structure from a trajectory (-f) is compared to a reference structure. The reference structure is taken from the structure file (-s)

gmx rms -f traj_fitted.xtc -s tpr_file.tpr -o output_file.xvg

option: -n index_file.ndx


Calculate the RMSD of the two trajectories (about p53 protein) and plot the results.


We can plot the data using gnuplot (type it in a new shell), and then:

>plot “file.dat” w lp
>replot “file2.dat” w lp

To quit from gnuplot, type q and Enter.

Which information can you extrapolate about the systems from the plots?


Create an index file and calculate the RMSD for a subset of residues

Suggestion:
to create the index file start by using the command gmx make_ndx

gmx make_ndx -f tpr_file.tpr -o index_file.ndx



RMSF

The RMSF is defined as following:

The RMSF (Root mean square fluctuation) can be computed by the program gmx rmsf.

gmx rmsf computes the root mean square fluctuation (RMSF, i.e. standard deviation) of atomic positions in the trajectory (supplied with -f) after (optionally) fitting to a reference frame (supplied with -s).

gmx rmsf -f trajectory_file.xtc -s tpr_file.tpr -o output_file.xvg

option: -n index_file.ndx

Calculate the RMSF of the trajectory (p53_wt) and plot the results.

Which information can you extrapolate about the system from the plots?



Radius of Gyration

The radius of gyration is calculated as:

It is a quantity that quantifies you how compact your protein is.

gmx gyrate -f trajectory_file.xtc -s tpr_file.tpr -o output_file.xvg


Which processes can be monitored by calculating the radius of gyration in time?

Calculate the radius of gyration for the trajectory (GB1), and plot the results over time.

Calculate the RMSD for the trajectory (GB1), and plot the results over time.

How we can interpret these two results? Visualize the trajectory (GB1) in VMD.



Chimera

Now we can try to download another useful software for molecular visualization: Chimera.

Visualize the initial structure of each trajectory (in .pdb format)

In order to be able to extract the first frame of the trajectory and to save it in .pdb format we have to use the gmx trjconv command:

gmx trjconv -f traj_fitted.xtc -s tpr_file.tpr -o first_frame.pdb -dump 0

Pay attention to specify the name of the system in the output file names! keep the input names also for the output files.



Notes