### Homework from last week:

• Did you manage to render any figures from the systems you visualized?
• Did you encounter any specific difficulties?

## Fundamental ingredients of an MD simulation

In this session we will introduce the ingredients we need to perform a molecular dynamics simulation with Gromacs. It is a mesh up of information you can find scattered on different sources. The Gromacs getting started guide is anyway a must-read.

In this tutorial we will:

1. Discuss what a force field is.
2. Learn the basics of .pdb files.
3. Learn how to launch a very simple MD simulation with Gromacs.

There should be one folder: alanine-dip/.

### Overview

Let’s recap the so called MD machinery.

Performing a molecular dynamics simulation means solving numerically the Newton’s equations of motion for each atom in order to evolve the system at the next time step.

To do so, we usually employ the (velocity-)Verlet algorithm, described below.

%

How do we compute $F_i(t+\Delta t)$ in the above equation?

What information do we need to implement the Verlet algorithm?

suspense…

We will need to provide:

1. the initial positions of the atoms in the system;
2. the potential energy to compute the forces from (or directly the forces);
3. the masses of the atoms;
4. the time step of integration;
5. the initial velocity of atoms.

The initial position of the atoms in the system is given by the pdb file, which is the end-product of experiments.

Forces and the potential energy are calculated using a force field, which specifies how different pairs of atoms interact with each other depending on their distance, position in a molecule, etc.

The masses of the atoms are given in the pdb (and basically everywhere).

The time step should be as large as possible, in order to achieve long simulated times. However (there is always a caveat), we do not want to have huge errors due to the integration. Moreover with super-long time step, atoms will move too much and they would overlap resulting in a high repulsive force.

How is the repulsive force implemented?

The choice of the time step is based on the fastest motion in the system. As a rule of thumb, it should not be greater than 1/10 of this fastest motion. For all-atom simulation, the hydrogen vibrations in bonds with heavy atoms have a period of ~10 fs. Therefore we would use a timestep of up to 2 fs (with restraints on bond-containing H atoms).

The initial velocities are in general assigned from a Maxwell-Boltzmann distribution at the desired temperature, but they can be set to random values, extracted from a uniform distribution etc…

M-B distribution (from Wikipedia).

## Simulating an alanine dipeptide

Alanine dipeptide is a small peptide usually used to study protein backbone dynamics. It consists of an ALA in the middle equipped with two minimal peptide bonds. The N terminus NH2 gains an acetyl group CH3CO, while the the C terminus COOH obtains a methylamide CH3NH.

ACE - ALA - NME

How many carbon atoms are in the alanine dipeptide?

A pdb for the alanine dipeptide is provided.

What are the main degrees of freedom of the system?

Alanine dipeptide is an ideal tool for computational modelling since it is the simplest biomolecule that exhibits a non-trivial conformational change over long time scales. Specifically, the backbone rearranges in many diffrerent shapes that are described by the two dihedral angles of the backbone. These angles can be used to describe the overall motion, so they are called collective variables: with only two variables we can give a simplified description of a protein with more than twenty atoms!!

## Coordinates: PDB

We have already seen a pdb file in the previous lectures.

Let’s go the Protein Data Bank website and search for the structure 4pti. We won’t use this structure today, but it’s more instructive than the (too) simple alanine dipeptide.

Let’s give a brief overview of these columns

• serial: integer associated to each atom;
• name: “atom name” of each atom (see figure below);
• resname: 3-letter code for protein, it can be a 4-letter code for fancy residues;
• chain: 1-letter identifier for each chain in a pdb;
• x, y, and z: try to guess;
• occupancy: usually 1.00, it can be less than 1 if the structure is not univocally resolved;
• beta factor: the crystallographic temperature factor, related to the Root Mean Squared Fluctuation of atomic positions. Not always available;
• element, not always present.

Regarding the atom names, they follow the star convention: from $\alpha$ (of the Carbon $\alpha$) to $\beta, \gamma, \delta$

Histidine, as a constellation.

PDB files have a fixed format. Do not modify them manually if you are unsure of the outcome!

PDB files have 80 characters per line, irrespective of spaces. It is not safe to assume that coordinates or other columns will be separated by a space.

Visualise the protein using Beta as Coloring method.

Does the colour scheme fit with your intuition?

Open the pdb file related to alanine dipeptide now. Do you see evident differences?

## May the Force (field) be with you!

The idea is to solve the Newton’s equations of motion for all the atoms in the system. For each atom in 1D we can write:

We need then to know what the potential $U$ is. The set of functions and parameters that composes the $U$ is called Force Field (ff from now on).

The idea behind ffs is to mimic the experimental behaviour of proteins with a potential for atoms that has a feasible computational cost. There are several force fields available; among them, we mention:

• CHARMM, AMBER for all-atom simulations;
• Martini for coarse-grained simulations;
• oxDNA for simulations of DNA up to a few thousands bp;
• OPEP for coarse-grained simulations of amyloid protein stacking.

Each ff has its own functional form and its protocol to define parameters. These parameters are based on assumptions which are specific to the ff. Mixing data from different ffs, e.g. CHARMm parameters with the AMBER potentials, to mention two of the most popular ffs, will lead to unreliable results.

Do not mix parameters from different force fields!

Moreover, ff parameters evolve as more data from experiments and simulations are gathered.

We will mainly use Amber99sb-ildn force field for our all-atom simulations.

In general, each force field has two main parts (we are omitting the dependence on the coordinates and parameters that will be clear in the following): $$U_{ff} = U_{bonded} + U_{non~bonded}$$

Let’s have a graphical representation.

As their name suggest, we can define interactions for covalently bonded atoms as:

but we have also to take into account non-bonded interactions between molecules and within molecules:

## Simulating alanine dipeptide with Gromacs

We will perform a simulation of alanine dipeptide in water. In order to reach this goal, we cannot simply put some water around the protein, turn on the thermostat and start the Molecular Dynamics engine. A step-by-step protocol has to be followed: here we depicted the most basic workflow of a MD simulation.

The schemes below summarize what you need to do to prepare a production run. You can find a short description of the standard files used in a gromacs simulation here.

Preprocessing

First equilibration

The rest of the tutorial will be dedicated to implementing the above steps for the alanine dipeptide, running a small simulations, and analyzing some of its results.

### 0-BUILDING THE TOPOLOGY

Let’s start by using pdb2gmx to create the topology of our system. Usually the first step would be the removal of the crystallised water around the molecule, but here there are not water molecules in the pdb so we can skip that step.

gmx pdb2gmx -f dip.pdb -o dip_processed.gro

Never run Gromacs commands twice. If something went wrong delete all the files before trying again.

A command line interface should appear, in which you are supposed to decide which force field you want to use in your simulation. It is possible to choose among all the force fields that come with our version of Gromacs. Insert 6, AMBER99SB-ILDN force field. Next, we have to think about how we are going to treat water molecules: water is a highly non-trivial molecule to describe, and several different models are available. We will use the standard TIP3P water (prompt 1 in the shell).

You should see three files:

• dip_processed.gro, i.e. the pdb converted in a Gromacs-readable format;
• topol.top, i.e. the topology file;
• posre.itp, i.e. the position-restraints file

Open the file dip_processed.gro: you will see that it contains the same information present in the pdb, just displayed in a different fashion.

The posre.itp file contains information used to restrain the positions of heavy atoms. Why to do this?

The topology file describes all the ingredients of the system and how to treat them:

1. the force field to use: in our case something like #include "amber99sb-ildn.ff/forcefield.itp" ;
2. [ moleculetype ] : the name of the molecule and nrexcl. This number means that we have to exclude from nonbonded interactions all the atoms that are closer than nrexcl bonds from the considered atom. We will discuss later about this parameter ;
3. [ atoms ] : a complete description of the atoms that belong to each residue, with their mass and charge ;
4. [ bonds ]: the list of covalent bonds for this molecule. This has been inferred from the chemical structure and the force field information ;
5. [ pairs ] : list of nonbonded pairs: actually this list is useless, since the number of possible non-bonded pairs is $N*(N-1)/2$. It is used if one wants to modify the parameters of the 1-4 interactions. If you look at the atoms in [ pairs ] you will see only the atoms separated by three bonds;
6. [ angles ] : list of 3-body bonded interactions;
7. [ dihedrals ] : list of 4-body bonded interactions;
8. [ some-things-to-include ] : the position-restraints file, the force field details and the water model to use are included at the end of the topology;

We will analyse more in detail the topology files in the following lectures. You do not have to know in detail every aspect of it, just remember that whether the force field defines the parameters and the functional form for every possible interaction, the topology file tells us which interactions have to be included in our system.

### 1-CREATION OF THE SIMULATION BOX AND SOLVATION

Let’s create the computational “box” in which we are going to place the alanine dipeptide.

gmx editconf -f dip_processed.gro -o dip_newbox.gro -c -d 1.0 -bt cubic

With this command, we will create a new gro file, dip_newbox.gro. The box is cubic and centered (option -c) around the origin. The parameter -d is crucial and will be object of much discussion. It is the minimum distance between the protein and each edge of the box. It has to be chosen carefully in order to avoid artifacts in the simulation and to respect the Minimum Image Convention.

The explicit solvent dynamics uses Periodic Boundary Conditions: periodic images are created aroung the original simulation box (unit cell) in order to mimic the behaviour of an infinite system while dealing only with a limited number of atoms. The concept of Minimum Image Convention is crucial to this purpose: each atom interacts with the closest image of all the other particles. Hence, if we put a too low values of d, we would be in the situation in which the head of the protein could interact with its own tail (that is in the neighboring cell).

More detail in Gianluca’s lectures.

With the previous command we only defined the simulation box, now we have to fill it with water. For this we will use gmx solvate:

gmx solvate -cp dip_newbox.gro -cs spc216.gro -o dip_solv.gro -p topol.top

The options used here are:

• -cp dip_newbox.gro : where to take the input box from? Of course from the gro file created before;
• -cs spc216.gro : how to distribute the water molecules? gmx solvate automatically fetches the folder gromacs/share/top and looks for the specified pre-equilibrated model. spc216.gro is a good equilibrated 3-point model for SPCE and TIP3P solvent;

As usual you can save the output of the command by adding a > at the end of the line together with the name of the file.

Gromacs solvates the system and gives you the number of water molecules in the box. Would you be able to give a rough prediction of that number knowing only the shape of the simulation box?

### 2-IONS

It is crucial to add ions after the creation of the solvation box for two reasons:

• At this point of the process we could have a charged protein inside a box of water. The charged system should be neutralized before running any other step. If everything went right, this is not the case.
• It is important to mimic the physiological ions concentration in the cell. Of course it depends on the organism, the tissue, the environment we are considering. A value between 100 mM and 150 mM is employed in the vast majority of cases.

For this task we will use gmx grompp and gmx genion and we will employ for the first time a mdp (molecular dynamics parameter file) file. You find it in the tutorial3 folder. Here the mdp file is not useful, it is just needed by Gromacs to create the tpr.

gmx grompp -f minim.mdp -c dip_solv.gro -p topol.top -o ions.tpr

With this line we build a tpr file (molecular dynamics parameter file). This is a non-readable, atomic level file, which describes all the available information about each atom in the system.

Now we use the tpr to include 0.1M concentration of ions in our system.

gmx genion -s ions.tpr -o dip_solv_ions.gro -p topol.top -pname NA -nname CL -conc 0.1 -neutral

Genion takes in input the ion concentration we want to put in our system. We used NA for positive ions and CL for negative ones. In the following command line prompt you are asked to specify one group for replacement with ions. Elements of the chosen group will be selected randomly and replaced with ions. For instance if you choose System, every atom of our box has the possibility of being replaced by ions.

We don’t want protein atoms to be substituted by charged ions!

Always choose the SOL (or Water) keyword in this step, so that you can only replace water molecules with ions.

If we only wanted to neutralize the system, we could have used -neutral option insted of -conc.

Check the new file topol.top! Are there ions inside?

### 3-ENERGY MINIMISATION

Energy minimisation (EM) is a geometric optimisation technique that is inherently different from Molecular Dynamics (MD). While MD evolves the system through Newton’s equations of motion, EM is a (almost always) gradient-based method that finds the local minimum of a function, the molecular energy. It is used to give a reasonable starting structure to MD.

The aim of the minimisation is two-fold:

1. to remove the steric clashes that can be present when building a system adding elements;
2. to move your structure into the ideal one provided by the topology file, with equilibrium distances, angles…

The minimisation is a process that relaxes the overall structure towards a more energetically favourable configuration.

Exactly like before: let’s create the tpr file:

gmx grompp -f minim.mdp -c dip_solv_ions.gro -p topol.top -o em.tpr

Now the mdp file becomes important. It is the Configuration File used by Gromacs to determine all the parameters related to the simulation.

Open the minim.mdp file

You will find details about the algorithm employed to perform energy minimisation (steepest gradient descent in this case), the number of minisation steps and the threshold on force. The maximum force should be around $10^3$ kJ/mol/nm.

; minim.mdp - used as input into grompp to generate em.tpr
; Parameters describing what to do, when to stop and what to save
integrator  = steep         ; Algorithm (steep = steepest descent minimization)
emtol       = 500.0        ; Stop minimization when the maximum force < 500.0 kJ/mol/nm
emstep      = 0.01          ; Minimization step size
nsteps      = 50000         ; Maximum number of (minimization) steps to perform
energygrps      = system        ; Which energy group(s) to write to disk

; Parameters describing how to find the neighbors of each atom and how to calculate the interactions
nstlist         = 1         ; Frequency to update the neighbor list and long range forces
cutoff-scheme   = Verlet    ; Buffered neighbor searching
ns_type         = grid      ; Method to determine neighbor list (simple, grid)
coulombtype     = PME       ; Treatment of long range electrostatic interactions
rcoulomb        = 1.0       ; Short-range electrostatic cut-off
rvdw            = 1.0       ; Short-range Van der Waals cut-off
pbc             = xyz       ; Periodic Boundary Conditions in all 3 dimensions


The most important parameter is the integrator, that, in this peculiar case, is not even an integrator! The steepest descent algorithm calculates position at step $n+1$ as:

%

where $h_n$ is the maximum displacement and $F_{n}$ the force vector acting on all the atoms at step $n$ . On the other hand $max (\mid\textbf{F}_{n}\mid)$ is the maximum value of the force inside the force vector, i.e., the maximum force in the system.

The algorithm accepts/rejects the new positions with a simple rule: if the new potential energy $V_{n+1}$ is lower than $V_n$ the move is accepted and $h_{n+1} = 1.2h_n$. Instead, if $V_{n+1} > V_{n}$ the new positions are rejected and the displacement is reduced ($h_n = 0.2 h_n$).

The algorithm requires only $h_0$ (emstep) and two thresholds: emtol on the maximum force and nsteps that limits the number of moves.

gmx mdrun -v -deffnm em

This should take less than one minute. In order to check that everything went right we can analyse the behaviour of the potential energy:

gmx energy -f em.edr -o potential.xvg

Plot the potential written in potential.xvg

### 4-EQUILIBRATION AT CONSTANT VOLUME/PRESSURE

Energy minimisation ensures that our starting structures do not include steric clashes or inappropriate geometries. Nevertheless, the system is not ready for an unrestrained simulation: indeed, the solute is mainly equilibrated around itself, but not around the protein. NVT equilibration has the goal of finding the right orientation between solute and solvent (and ions of course). Next, the right density has to be found through NPT equilibration.

NVT: as usual, it is a twofold process. Here we will use for the first time the file posre.itp that we generated using pdb2gmx. With position restraints we impose a penalty to each movement of the heavy atoms during equilibration: applying these penalties, the solvent (unrestrained) can rearrange its orientation around the protein (which is almost fixed). Clearly we will not use position restraints in the unrestrained, production run.

gmx grompp -f nvt.mdp -c em.gro -r em.gro -p topol.top -o nvt.tpr
gmx mdrun -deffnm nvt

This can take a couple of minutes. Employ them to inspect nvt.mdp file: it is quite different from minim.mdp. The most important parameter is ref_t, which defines the reference temperature to be employed throughout the simulation. Here we report only some lines for the sake of brevity:

define = -DPOSRES  ; position restrain the protein
integrator              = md        ; leap-frog integrator
nsteps                  = 50000     ; 2 * 50000 = 100 ps
dt                      = 0.002     ; 2 fs
; Output control
nstxout   = 500       ; save coordinates every 1.0 ps
nstvout   = 500       ; save velocities every 1.0 ps
nstenergy = 500       ; save energies every 1.0 ps
nstlog    = 500       ; update log file every 1.0 ps
; Temperature coupling is on
tcoupl   = V-rescale             ; modified Berendsen thermostat
tc-grps  = Protein Non-Protein   ; two coupling groups - more accurate
tau_t    = 0.1     0.1           ; time constant, in ps
ref_t    = 300     300           ; reference temperature, one for each group, in K
; Pressure coupling is off
pcoupl   = no        ; no pressure coupling in NVT
; Velocity generation
gen_vel  = yes       ; assign velocities from Maxwell distribution
gen_temp = 300       ; temperature for Maxwell distribution
gen_seed = -1        ; generate a random seed

• define includes the position restraints on the heavy atoms;
• nsteps defines the number of steps for our equilibration;
• dt is the standard timestep, 2 femtoseconds, i.e. $\sim 1/10$ of the timescales of fastest motions;
• Output control section : how often to save coordinates, velocities, energies and how uften to update the log file;
• Temperature coupling section : all the details about the thermostat used, the reference temperature etc.;
• Pressure coupling section : pcoupl is set to no, we are in the NVT ensemble and there’s no need to couple our system to a barostat;
• Velocity generation section tells us the detail about how the initial velocities (at time 0) are generated.

gmx energy -f nvt.edr -o temperature.xvg

Choose 16 to analyse the temperature progression over the NVT equilibration.

NPT: constant pressure.

gmx grompp -f npt.mdp -c nvt.gro -r nvt.gro -t nvt.cpt -p topol.top -o npt.tpr
gmx mdrun -deffnm npt

What should differ from NVT and NPT equlibration?

1. NPT run should be a bit longer than NVT. Typically we run 300 ps of NPT and 100-200 ns in NVT.
2. pcoupl is on in NPT equilibration
3. here we used a checkpoint (-t nvt.cpt), i.e. we started from the last configuration retrieved with the NVT run.

gmx energy -f npt.edr -o pressure.xvg

Analyse the behaviour of the pressure. Do the values look meaningful?

The answer is No, MD engines calculate the instantaneuos pressure, which oscillates significantly. Oscillations of the order of the hundreds of bar are standard in small size systems. Time average give more reasonable values.

Can you do the same with the density? What do you observe?

Pressure coupling can be mainly attained in two ways in molecular dynamics engines: through Berendsen coupling or Parrinello-Rahman coupling. The first method is a weak coupling method that does not reproduce the correct NPT ensemble, so you should never use it for data production.

Parrinello-Rahman pressure coupling is an extended method: it adds variables to the phase space of the system, thus modifying the overall Hamiltonian.

More details on this in Gianluca’s lectures!

### 5-UNRESTRAINED MOLECULAR DYNAMICS

• no position restraints anymore;
• much longer than equilibration runs;
• starts from the last configuration of the NPT run;
• can be done at constant volume or at constant pressure, depending on what we want to simulate. Here we will simulate the system in NVT at 300 K;

As usual we proceed first with the creation of the tpr starting from the mdp file that is dedicated to the unrestrained simulation. We take as starting configuration and velocities the last frame of the NPT equilibration.

gmx grompp -f md.mdp -c npt.gro -t npt.cpt -p topol.top -o md_plain.tpr

Open the md.mdp file and change the number of steps to run to $2 \times 10^5$ (400 ps).

In this mdp file you can notice that we are saving the coordinates of the overall system (protein + ions + water). Usually one cannot afford this because it is very computationally demanding, so we always resort to saving only protein atoms. An exception is represented by those studies in which the effect of solvent-solute interactions are investigated in detail.

Let’s run the simulation!

gmx mdrun -deffnm md_plain

This can take some minutes on your laptop. Feel free to decrease the number of steps.

### 6-POST-PROCESSING

Once the simulation ends we need to clean a bit our data in order to:

1. remove the effects of the Periodic Boundary Conditions
2. remove the diffusive dynamics

During the simulation the protein may diffuse out of the box, so it can appear to be broken in two pieces. This is just an artifact that is corrected with trjconv:

gmx trjconv -s md_plain.tpr -f md_plain.xtc -o md_plain_noPBC.xtc -pbc mol -center

Choose the group corresponding to the Protein to center the trajectory around the dipeptide, then select System to print all system in output.

The option center centers the system in the box, while choosing -pbc mol means that the center of mass of molecules is put in the box.

Load the cleaned trajectory in VMD!

To do this just load the npt.gro file in VMD, then go to File/Load data into Molecule and select md_plain_noPBC.xtc.

Congratulations, you just run your first computational simulation of a biomolecule! We hope that you realised “by doing” how messy and prone to errors the process is. The importance of a well-defined protocol is self-evident together with the need of saving every detail/parameter of the setup for the sake of reproducibility.

# Further Notes

A not comprehensive yet interesting historical and technical report about the several models of water in computational simulations.