Tutorial 10: Umbrella Sampling

In this lecture we will see how to setup, run and analyse an umbrella sampling simulation. This will be done in three (main) steps:

  1. generation of configurations sampled along a reaction coordinate through accelerated, unrealistic steered molecular dynamics (pulling);

  2. setup of umbrella sampling simulations, one for each reaction coordinate window;

  3. analysis of the results with the WHAM algorithm;

We follow Prof. Lemkul’s tutorial illustrated here. We will deal with a peculiar group of amyloid fibrils, \(A \beta_{42}\) fibrils, which are heavily linked to Alzheimer disease. In general amyloid fibrils are a stable state of proteins that is generated from misfolded conformations. These misfolded conformations tend to aggregate to form insoluble stacks that are supposedly responsible for toxicity inside the brain (Parkinson disease, Alzheimer,…).

Mechanism of formation of amyloid fibrils from misfolded states. Image taken from Adamcik and Mezzenga

Step zero: purpose of the simulation

What do we want to do? We want to investigate the unbinding free energy related to the detachment of a fibril from the amyloid complex through umbrella sampling. The reaction coordinate that we will use is the distance between center of mass of the detached chain (chain A) and that of chain B, that is the closest fibril to chain A.

Scientific question: why are these amyloid fibrils so stable in water? How to reduce their stability?

First step: system preparation

The system we are going to simulate has already been pre-processed by Prof. Lemkul and can be found here. This is the first NMR model extracted by the 2BEG pdb file, in which terminal patches have been added. 5 equal fibrils are present, and we are planning to detach one of them.

Export the pdb file to your directory on the cluster or keep it on your PC if you would rather work on your local machine.

Let’s generate the starting configuration of the system with pdb2gmx as usual.

gmx pdb2gmx -f 2BEG_model1_capped.pdb -ignh -ter -o complex.gro

-ignh ignores pdb’s hydrogens and -ter allows to interactively define the termini.

Choose the GROMOS96 53A6 parameter set, SPC water, “None” for the N-termini, and the carboxylic group “COO-“ for the C-termini for each chain.

Remark: GROMOS is not correct

The GROMOS force field is deprecated, and it should not be used in actual research projects.

If you use the 2019 or 2020 versions of GROMACS you can encounter a warning while using GROMOS. For the scope of this tutorial, which is not the accuracy of force fields, feel free to suppress the GROMOS-related warnings using the option -maxwarn 1

For your interest, here is a blog discussion where Erik Lindahl, one of the professional GROMACS developers, complains about GROMOS.

More in detail:

Solvation box

After this remark we want to tell the system that the chain B will be the reference for the following simulations. This is done by modifying the file topol_Protein_chain_B.itp, adding the following lines at the end of the file:

#ifdef POSRES_B
#include "posre_Protein_chain_B.itp"
#endif

The creation of the simulation box is non-trivial if we want to run a pulling simulation. Indeed, here we cannot simply specify a d parameter that authomatically defines the solvation box based on the desired distance from the walls. In that case the simulation would crash in few picoseconds once the pulled molecule reaches the opposite side of the box. If the simulation does not crash it generates artifacts by interacting artificially with the farthest part of the system (that it should never feel).

Moreover, minimum image convention must be continually satisfied, implying that the pull distance must always be less than one-half the length of the box vector along which the pulling is being conducted. Why? Because all distances are calculated with periodicity, so when we are computing the distance between the centers of mass of chain A and chain B we are calculating the distance between chain A and the closest image of chain B. Therefore a pulling distance larger than half of the simulation box would lead to unrealistic interactions.

To avoid this we will create a box with a \(\delta z = 12\) nm in order to pull our molecule for 5 nm with no problems:

gmx editconf -f complex.gro -o newbox.gro -center 3.280 2.181 2.4775 -box 6.560 4.362 12

The above command places the pdb in \((x = 3.280, y=2.181 z=2.4775)\) and generates a box of \((\delta x = 6.560, \ \delta y = 4.362, \ \delta z=12)\).

To check that everything went correctly load the structure in VMD and from the Tk console run:

pbc box

You should see an eloganted box towards the z axis, with the fibrils on one side.

Are you able to solvate and ionize the system with 0.15 M of NaCl?

Prior to the application of any algorithm we have to minimize and equilibrate the system. I assume that you are familiar with the basic concepts of minimization and equilibration, so we are going to skip these last, tedious steps.

The equilibrated structure, the corresponding checkpoint and the topology file can be downloaded here

Second step: Steered Molecular Dynamics (SMD)

If you cannot wait ages for Newton’s dynamics to do its job you can think of perturbing the system by adding an external force acting on certain atoms, while other atoms remain fixed. The external force is mainly expressed in terms of a spring.

Where is it useful? Which kind of events you can induce?

  • unfolding events: you force a protein to unfold and analyse the unfolding pathway

  • conformational changes : you may want to apply a force that forces a molecule to make a transition;

  • ligand unbinding: if you apply a force that makes a ligand go away from its partner molecule, at some point it will leave the binding pocket;

  • generation of topologically complex structures : biological polymers such as DNA can be actively confined, e.g. inside a virus. With a custom force you can generate such configurations and study their statistical properties. The same goes for topologically complex polymers, such as those containing knots and links (e.g. the kinetoplast DNA).

There’s no Free Lunch: Steered Molecular Dynamics generates unrealistic configurations that possess a negligible statistical weight in the unperturbed ensemble

Two types of SMD:

  1. Constant Velocity Pulling:
\[{\bf F} = - \nabla U\] \[U = \frac{1}{2} k (vt - ({\bf r}_{COM}(t) - {\bf r}_{COM}(0)) \cdot \hat{n})^2\]
  1. Constant Force Pulling: a constant force vector applied to an atom

We here employ constant velocity pulling, making use of this mdp file. We will pull chain A away from chain B since they are not connected by covalent interactions. Actually we want to investigate the disruption of the non-covalent interactions among the fibrils.

Let’s check the md_pull.mdp file and. First of all, check the first line of the file:

head md_pull.mdp

Remember the lines we added to the file topol_Protein_chain_B.itp? Here we are going to keep chain B fixed (DPOSRES_B) in order to pull away chain A away more easily.

Now let’s check the section related to pulling parameters:

pull                    = yes
pull_ncoords            = 1         ; only one reaction coordinate
pull_ngroups            = 2         ; two groups defining one reaction coordinate
pull_group1_name        = Chain_A
pull_group2_name        = Chain_B
pull_coord1_type        = umbrella  ; harmonic potential
pull_coord1_geometry    = distance  ; simple distance increase
pull_coord1_dim         = N N Y
pull_coord1_groups      = 1 2
pull_coord1_start       = yes       ; define initial COM distance > 0
pull_coord1_rate        = 0.01      ; 0.01 nm per ps = 10 nm per ns
pull_coord1_k           = 1000      ; kJ mol^-1 nm^-2

This should be kind of self-explanatory. It is important to remark a few points:

  • pull_coord1_type = umbrella has nothing to do with umbrella sampling. Indeed we are doing Steered Molecular Dynamics. But we want the external potential to be harmonic, therefore the quite misleading word umbrella;

  • pull_coord1_dim tells, in a binary way, where to apply the pulling (should be read as NO NO YES). Of course we can decide to have pulling in all three dimensions;

  • pull_coord1_rate : the pulling rate for constant velocity pulling. It’s basically the velocity at which I have to pull;

  • pull_coord1_k : the spring constant;

As usual you can check the meaning of all mdp parameters in the GROMACS website.

0.01 is a good parameter for pulling this system, but unfortunately a simulation of our system with 5 nm of pulling would require 500 ps, which is not affordable for all of you in a decent time. Therefore, in order to let you create a pulling trajectory you will perform a sort of ultra-pulling, with a superfast pulling rate.

This is not advisable, we do it just for fun

Change pull_coord1_rate to 0.05 and the number of steps of the simulation to one fifth of the current value.

Last step before pulling! Let’s make GROMACS aware of the two groups. Indeed, in the mdp file we have the names of the groups:

pull_group1_name        = Chain_A
pull_group2_name        = Chain_B

but GROMACS doesn’t know about these groups yet! Let’s create them with gmx make_ndx:

gmx make_ndx -f npt.gro

Insert the following lines to the prompt:

r 1-27
name 19 Chain_A
r 28-54
name 20 Chain_B

We are basically creating a group called chain_A containing the first 27 residues and a second group chain_B with all residues from 28 to 54. No big science here ;).

Let’s create the tpr now:

gmx grompp -f md_pull.mdp -c npt.gro -p topol.top -r npt.gro -n index.ndx -t npt.cpt -o pull.tpr

GROMACS > 2018 Users: this has likely generated a warning due to the tau_p parameter in Parrinello-Rahman barostat. This is related to situations of high pressure instability, that should not be an issue in our already equilibrated system. You can either suppress it (with -maxwarn) or decide to double the tau_p parameter of the barostat.

You are free to run the simulation now:

gmx mdrun -deffnm pull -pf pullf.xvg -px pullx.xvg

The only difference between this command and the usual one is the presence of two xvg output files that describe the pulling forces and the reaction coordinate, namely the position of the center of mass of the pulled chain.

Feel free to run the simulation on the cluster by submitting a pbs script, but please do not use more than 4-6 cores. Even on 4 cores it should not take more than 10 minutes.

Visualize the pulling trajectory with VMD

Third step: Umbrella Sampling

We generated a bunch of configurations that span the reaction coordinate landscape. It is time to identify the candidate configurations from which we will run separate umbrella sampling simulations.

We will use my trajectory, which has been run for 500 ps with one fifth of pulling velocity. Just because with a slower pulling the extracted configuration can be thought as more realistic.

Now we are using trjconv to extract all the configurations in pull.xtc in separate gro files:

gmx trjconv -s pull.tpr -f pull.xtc -o conf.gro -sep

Save the whole system with 0

You should retrieve 501 separate configurations. Now we want to extract for all of them the COM distances between chain A and chain B. In order to do so let’s use this bash script.

run the bash script and open the resulting file summary_distances.dat

The file contains the correct COM coordinates for each configuration that we extracted.

We want to identify the umbrella sampling candidate configurations such that they are separated by 0.2 nm of COM distance.

We keep conf0.gro as the first one, which has a COM distance between the chains of about 0.504 nm. The second configuration that has to be considered should possess a COM distance as close as possible to \(0.504 + \Delta x\). Remember that \(\Delta x = 0.2 nm\). The third one should have a distance as close as possible to \(0.504 + 2 \Delta x\) and so on.

Can you write a small script to perform this selection authomatically?

.
.
.
.
.
.
.
.
.
.
.
.
.
.

Ten minutes later: check your result against mine.

Ok now we have our set of configurations, each one associated to an umbrella sampling window. For each of them we impose a biasing potential centered on the COM distance of the reference configuration (0.504 for the first window etc.).

Each configuration deserves its proper equilibration and simulation

Umbrella sampling can be computationally expensive!

The production run here will not be unrestrained!

I will only give you the two mdp files, one for NPT equilibration and the other for production.

Check for their differences:

diff npt_umbrella.mdp md_umbrella.mdp

The output should be sort of clear, but ask questions if you do not understand the flow.

Here I report the commands that one should launch for every window:

gmx grompp -f npt_umbrella.mdp -c conf0.gro -p topol.top -r conf0.gro -n index.ndx -o npt0.tpr
gmx mdrun -deffnm npt0
gmx grompp -f md_umbrella.mdp -c npt0.gro -t npt0.cpt -p topol.top -r npt0.gro -n index.ndx -o umbrella0.tpr
gmx mdrun -deffnm umbrella0

I ran all these commands, obtaining a bunch of tpr files and pullx.xvg files, that you can find here. Download the compressed folder, we will need these files for the WHAM analysis.

Fourth step: analysis of results

Let’s recap a bit what we have:

  1. a set of simulations performed at different values of COM distance, each with a different biasing potential

  2. the values of the biasing force at each timestep and for each window

Let’s recap a bit what we want:

  1. use the aforementioned information to patch together the windows

  2. remove the bias and estimate the unbiased free energy of unbinding

The WHAM algorithm does what we need!

The theoretical background behind the WHAM algorithm is not discussed here. The best estimate of the unbiased probability of each value of the collective variable is found through a weighted average:

\[P({\bf s}) = \frac{\sum_{i}^{N_{sim}} h_i({\bf s})}{\sum_{i}^{N_{sim}} N_{i} e^{\beta(A_i - U^b_i ({\bf s}))}}\] \[A_i = - \frac{1}{\beta} \ ln (\sum_{\bf s} P({\bf s}) e^{- \beta U^b_i ({\bf s})})\]

where:

  • \(N_{sim}\) is the number of umbrella sampling windows;

  • \(h_i({\bf s})\) is the number of configurations of simulation \(i\) with value \({\bf s}\);

  • \(N_{i}\) is the number of configurations in simulation \(i\);

  • \(A_i\) and \(U^b_i ({\bf s})\) are the free energy shift and the bias potential of simulation \(i\);

Two unknown variables: \(P({\bf s})\) and \(A_i\)

Self consistent approach: we start from an educated guess of one of the two variables and we iterate until convergence

In order to apply WHAM we need a file (tpr-files.dat) with a list of umbrella sampling tpr files and another file (pullf-files.dat) with the list of corresponding pulling force files. Something like:

umbrella0.tpr
umbrella1.tpr
umbrella2.tpr
....

and

umbrella0_pullf.xvg
umbrella1_pullf.xvg
umbrella2_pullf.xvg
....

Create the files tpr-files.dat and pullf-files.dat

Let’s run it!

gmx wham -it tpr-files.dat -if pullf-files.dat -o -hist -unit kCal

If everything goes in the right way the output is constructed iteratively. The first result (file histo.xvg) is an histogram file containing the biased histograms:

Not all the histograms overlap sufficiently (this has to be checked always), especially histogram centered in \(s = 0.9 nm\), which is a bit “alone”. In this case we should carry on two other umbrella sampling simulations choosing the center in \(s = 0.8\) and \(s = 1.0\), exactly how we did before.

WHAM gives us the Potential of Mean Force (file profile.xvg) along the desired reaction coordinate, which approximates the unbinding free energy. This is the main result:

The experimental value for the unbinding free energy is 50.5 kcal/mol, so we are very close to it, with our asymptotic value of \(\sim 52.1\). Closer than what is obtained in Lemkul’s tutorial, where also the Potential of Mean Force has an evident discontinuity when the value of the reaction coordinate is around 0.8 nm. I’m pretty proud of it.

Try to plot your results. Use xmgrace if you have it.

Notes

Additional Material