## Tutorial 9: Graphs, proteins, and pyinteraph.

In this tutorial we will see:

- A quick intro to graph theory;
- How Pyinteraph works and how to use it;
- An overview of the reproducibility problem in science.

## Why do we need graphs?

In several problems we need to find relations between objects which are independent of the geometrical properties of said objects. For example, one might want to know if a city is reachable by train from another, or how many metro stops there are between two stations, without bothering about the precise path taken by the train.

*Two different ways of showing the London tube map*

In the case of proteins, this is important as some secondary and tertiary structure elements are conserved even though there are different loops connecting them, or the elements themselves are made by different residues. Again, we need a way to measure the connection between different numbered residues along the backbone, measure their strenghts, and if needed identify patterns.

## A quick intro to graph theory.

Graph theory is a rich and interesting branch of mathematics with applications ranging from electrical circuits, to topology, to network optimizations and analysis of protein structures.

We will cover only some basic notions of graph theory, in order to better understand pyinteraph and other research papers (and softwares) applying graph and network concepts to proteins’ structures.

### Definition

A **graph** $ G=G(V;E) $ consists of a set of vertices V and a set of edges E.
Two vertices $v_i$ and $ v_j $ are said to be adjacent if there is an edge
$e_{if}$ connecting them. In the same way, two edges $e_{ij}$ $e_{jk}$ are
said to be adjacent if they have at least a vertex in common.

*Basic concepts from graph theory. a) Schematic representations of vertices
and edges. b) A digraph, or directed graph. Those are most often used to
represent electric circuits. c) Weigthed graphs have different numerical value
(cliques) each vertex is connected to all other vertices.*

There are several possible things one can consider in graphs. One can put
arrows on the edges to indicate a flow, as in electric networks or chemical
reactions, one can consider multiple edges between vertices (*multigraphs*), or
give different **weights** to different vertices and edges. This latter
consideration is particularly important in Physics (and Chemistry!), where we
want not only to know that there exists a connection between two entities, but
also its strength or probability. For example, graph c) could indicate a
molecule: the vertices are the atoms and the edges the bonds. Their respective
weigths could be set to indicate different chemical properties.

From our definitions of a graph, it follows immediately that any graph $G(V;E)$ has subgraphs, $\{ G’(V’;E’)\qquad |\qquad V’\subseteq V,\quad E’\subseteq E \}$.

*a) An example graph and, b) some of its subgraphs.*

Roughly, how many different subgraphs can be identified in a given graph?

For any two vertices $v_i$ and $v_j$ in a graph, we say that there is a **path** connecting them
if there exist a sequence adjacent edges $\{e_{il}e_{la}e_{ab}\ldots e_{mp}e_{pj}\}$.
Of course this is not necessary, and two vertices or even two portions of a graph
can be **disconnected**, i.e. not joined by any path. A (connected) component of $G$ is a
subgraph $G’$ of $G$ such that all of its vertices are connected by at least a path and there
are no paths between any of them and another component of $G$. Components are frequently
referred to as **clusters** in physics.

*Graphs with different connected components. In both cases the diameter of the
largest component is highlighted in yellow. In b) the removal of a single edge
(a “bridge”) causes the appearance of a third component.*

By removing edges it is possible to split a graph in a set of connected components. This is particularly relevant in the applications related to protein structures and analysis, where the graphs are weigthed since they represent some interactions. Introducing a threshold then splits up the graph into different connected components based on the value of the threshold.

Can you give an every-day life example of a weighted network which gets splitted in different components due to a threshold?

### Some simple properties

There are several measures that can be defined for graphs. We will limit
ourselves to the simplest ones:, **degree** of a vertex, **distance** between two vertices,
**eccentricity** of a vertex, **center** of a graph.

The **degree** of a vertex, $D(v)$ is simply the number of edges incident to
it. Vertices with a very high degree are often called ‘hubs’ and they are very
important when studying the statistical properties of networks.

The **distance** between two vertices $v_i$ and $v_j$ is the length (number of edges) of
the shortest path connecting $v_i$ and $v_j$, if it exists, or $\infty$ if $v_i$ and $v_j$
are disconnected.

Given a graph $G(V;E)$, the **Eccentricity** of a vertex $v_i$ is defined as

The maximum eccentricity is the **diameter** of the graph.

The vertex $v_c$ for with minimal eccentricity is defined to be the **center** of the graph.

*a) A path between two vertices. b) Distance between the same two vertices.
c) Eccentricity of the graph. d) Center of the graph.*

### Isomorphism

Graphs are **topological objects**. This means that the way we draw the diagram
does not change the graph, which is identified just by a set, $V$, of vertices
and a set, $E$, of edges connecting (some of) them.

Two graphs with the same diagram, but different labels are considered distinct.
Nonetheless, it is clear that there must be a relation between them. This relation
is an isomorphism. Two graphs $G_\alpha(V;E)$ and $G_\beta(W;F)$ are said to be
**isomorphic** is there exist a bijective function between $(V;E)$ and $(W;F)$
so that the structure of the graphs is maintained.

Consider the figure below. Are graphs c) and d) isomorphic? and a) and b)?

*different (?) graphs*

### Mapping graphs to matrices

Graphs can be mapped easily to matrices. One of the most used is the **Adjacency matrix**,
whose elements are defined as in the figure below. This matrix is often called contact matrix in protein physics.

*Two graphs and their corresponding adjacency matrices.*

Adjacency matrices are not the only one relevant in graph theory, but they highlight an important principle: for undirected graphs, the matrices are symmetric. This means that they can be diagonalized, and their spectra hold information about the graph which is independent of the labelling of the vertices. Among other things, spectral analysis is used to identify isomorphic graphs and subgraphs, something very useful in the analysis of protein structure and comparison of different protein folds.

## Protein Interaction Networks & Pyinteraph

Graphs, like other topological tools, allow us to map continuous data into a discrete number of classes. This is of fundamental importance to recognize common patterns between different proteins. The most obvious graph in a protein is of course its topology. If cysteine bridges are considered, this contain some non-trivial information. Another clear example is given by the secondary structure, which can be extracted by looking at the pattern of hydrogen bonds. More in general, one can study the **Interaction Network** of a protein from its MD trajectories, and extract a series of networks (i.e. graphs) based on the frequency of different kind of interactions. To do so, we use *Pyinteraph*.

Pyinteraph is a software written in python and cython that identified interaction networks in MD trajectories. The nodes are usually side-chains, although the backbone is also considered for hydrogen bonds. The graphs identified are weigthed, with each edge proportional to the persistence of a given interaction in a thermodynamic ensemble.

Pyinteraph considers three type of interactions: **Hydrophobic**, **Salt
bridges**, and **H-Bonds**. They are identified according to the following
rules.

### Hydrophobic contacts

For hydrophobic contacts, the interaction between two residues is included if the center of mass of the side chain of the two hydrophobic residues is found within 5 Å of distance as a default.

*(Image adapted from: Tiberti, M., et al (2014). Jour. chem. inf. mod., 54(5), 1537-1551.) Hydrophobic interactions network.*

### Salt bridges

For salt bridges, all the distances between atom pairs belonging to two “charged groups” of two diﬀerent residues are calculated, and the charged groups are considered as interacting if at least one pair of atoms is found at a distance shorter than 4.5 Å

*(Image adapted from: Tiberti, M., et al (2014). Jour. chem. inf. mod., 54(5), 1537-1551.) Salt bridges network. Edges thickness is proportional to their strength (or “persistence”), hubs are colored in yellow.*

### H-Bonds

A H-bond is identiﬁed when both the distance between the acceptor atom and the hydrogen atom is lower than 3.5 Å and the donor-hydrogen-acceptor atom angle is greater than 120°. These default parameters can be modiﬁed by the user. As a default, both side chain and main chain H-bonds are included

*(Image adapted from: Tiberti, M., et al (2014). Jour. chem. inf. mod., 54(5), 1537-1551.) Hydrogen bonds network. Bond thickness is proportional to their weight (“persistence”); a set of very persistent ones are
reported in the right panel.*

The details below are taken from the original article; however, for the hands-on session, we are going to use its more recent version, namely pyinteraph2.

### Pyinteraph workflow

How does pyinteraph identify the various interaction networks? It works as schematically reported below:

*(Image adapted from: Tiberti, M., et al (2014). Jour. chem. inf. mod., 54(5), 1537-1551.) How pyinteraph works: 0. Start from an MD trajectory. 1. Identify weigthed adjacency matrices, 2. filter them, 3. join them into an unweighted matrix*

Starting from an MD trajectory, pyinteraph computes for every frame the hydrophobic, salt bridges and H-bond based on the relative distance and orientation of the relevant atoms. The output from this first step consist of three weighted graphs (i.e. three matrices), one per interaction, in which **the weight of the edges correspond to the fraction of frames in which said bonds were active**. The program (or the user) then identifies a critical persistence length to separate clusters which are persistent from interactions which are rarer. There is usually a sharp transition, as shown in the image below. As a second step, the program outputs three filtered matrices. Finally, it puts them together as an unweigthed matrix representing the whole interaction network of the protein.

*(Image adapted from: Tiberti, M., et al (2014). Jour. chem. inf. mod., 54(5), 1537-1551.) Largest cluster size as a function of the persistence cutoff. The jump is the ‘critical persistence’ at which it makes sense to place the cut-off.*

### Assumptions

Pyinteraph works on the following assumptions.

- No PBC (if present, they must be removed first).
- No large transitions (e.g. no rare events). The trajectory is assumed to be sampling a relatively local free energy basin.

## Installing pyinteraph2

git clone https://github.com/ELELAB/pyinteraph2.git

Enter the pyinteraph2 folder.

conda env create --file environment.yaml --prefix ./pyinteraph-env

conda activate ./pyinteraph-env

python setup.py install

## Using Pyinteraph on a sample trajectory

For this tutorial, download the following folder.

### Calculating salt bridges

For any oppositely-charged charged residue pair, a salt bridge between them will be identified if at least two atoms belonging to the two groups are closer than the chosen cut-off. This calculation is performed for every model in the ensemble and the final occurrence value is written in the output files.

pyinteraph -s sim_protA.tpr -t traj.xtc -r sim.prot.A.pdb --sb-co 5 -b --sb-graph sb-graph.dat --ff-masses charmm27 -v --sb-cg-file charged_groups.ini

In the csv files there is one line for each detected interaction between pairs of residues, and the following columns: first residue chain name, first residue residue number, first residue name, first residue interaction group, second residue chain name, second residue residue number, second residue name, second residue interaction group and occurrence percentage of identified interaction.

Can you write a simple Python script that can plot the occurrence of the resulting interactions as a heatmap?

### Calculating hydrogen bonds

Hydrogen bonds are calculated between pairs of atoms that can act as either acceptor, donor, or both. These lists are defined in the hydrogen bonds configuration file as lists of atom names and can be customized at will. PyInteraph identifies hydrogen bonds based on the distance between the donor and acceptor atoms as well as the donor-hydrogen-acceptor angle, using the hydrogen bonds calculation module from MDAnalysis.

pyinteraph -s sim_protA.tpr -t traj.xtc -r sim.prot.A.pdb -y --hb-graph hb-graph.dat --ff-masses charmm27 -v --hb-ad-file hydrogen_bonds.ini

### Calculating hydrophobic interactions

This analysis identifies the interactions between hydrophobic residues in the protein. It does so by calculating the distance between the center of mass of side-chains of pairs of residues.

pyinteraph -s sim_protA.tpr -t traj.xtc -r sim.prot.A.pdb --hc-co 5 -f --hc-graph hc-graph.dat --ff-masses charmm27 -v --hc-residues ALA,VAL,LEU,ILE,PHE,PRO,MET,TRP

### Estimating the cutoff for occurrence percentage

The biggest connected component size value calculated for each occurrence value is written in the filter_graph output file specified by –c flag, whereas a plot in PDF format of these data is available by passing the -p flag.

filter_graph -d sb-graph_all.dat -c cluster_size_sb.dat -p clusters_plot_sb.pdf

Can you apply the same analysis for the other interaction types?

In several application of PyInteraph, an occurrence percentage of 20% is used as a threshold; it seems a reasonable value also for our case.

### Filtering the graphs using the identified cutoff of occurrence percentage

The input adjacency matrices are filtered so that edges with weights below the specified threshold are discarded.

filter_graph -d hb-graph_all.dat -o hb-graph_filtered.dat -t 20.0

Apply the same filter on the other graphs.

### Generating a macroIIN (unweighted)

The calculation of the macro-IIN can be performed by merging the three single filtered IIN graphs.

filter_graph -d hb-graph_filtered.dat -d hc-graph_filtered.dat -d sb-graph_filtered.dat -o macro_IIN_unweighted.dat

The macro-IIN.dat is unweighted (i.e.,all weights are equal to 1) by default. It should be used as unweighted matrix of data just for qualitative purposes (i.e. to have a map visualization of all the more persistent interactions in the protein and their reciprocal location). Nevertheless, it is also possible to assign weights to edges of the obtained graphs by supplying a weighted adjacency matrix file, for instance by using the interaction pseudo-energy map.

### Generating a macroIIN (weighted)

Fist generate the energy interaction map, and use the output for the weighted graph.

pyinteraph -s sim_protA.tpr -t traj.xtc -r sim.prot.A.pdb -p --ff-masses charmm27 -v --kbp-graph kbp-graph.dat

filter_graph -d hb-graph_filtered.dat -d hc-graph_filtered.dat -d sb-graph_filtered.dat -o macro_IIN_weighted.dat -w kbp-graph.dat

### Identifying the connected components

Connected components are subsets of nodes such that a path exists between any arbitrary pair of nodes within that subset, while no path exists between any arbitrary pair of nodes between different of such subsets. They are, in fact, isolated portions of the network. These usually represent the more interconnected parts of the protein.

graph_analysis -a macro_IIN_weighted.dat -r sim.prot.A.pdb -c -cb ccs.pdb

In the output PDB file, the B-factor column is replaced by the ID of the connected component to which the residue belongs. For instance, all residues having “1” as B-factor belong to the largest connected component (that has ID “1”), all the residues having B-factor “2” belong to the second-largest connected component, and so on.

Visualize the resulting PDB file with VMD.

### Identifying hubs

Hubs are graph nodes (i.e. residues) connected by a relatively high number of edges in the graph (i.e. they have a large node degree). Hubs are generally considered in graph analysis of protein structures as residues characterized by more than 3 or 4 edges in the graph (Fanelli and Felline, 2011; Vishveshwara et al, 2009).

graph_analysis -a macro_IIN_weighted.dat -r sim.prot.A.pdb -u -ub hubs.pdb -k 3

The resulting PDB file contains in the B-factor field the number of edges for each identified hub node; in the case of nodes with less than k edges, this number is 0.

Visualize the resulting PDB file with VMD.

### Exercise

Would you be able to apply Pyinteraph2 on a different trajectory?

## Reproducible & replicable science

The past ten years have (finally!) seen a surge in the interest for how Science is done, and in particular whether published scientific results can be replicated.

The results of these studies opened the era of the **“replication crisis”**.
Several results, particularly in psychology and Medicine were found to be non-replicable,
or simply wrong. This problem is now recognized by scientist themselves,
as illustrated by a survey from Nature.

Before we continue

Can you guess the distinction between **replicability** and **reproducibility**?

suspense…

**Reproducibility**means the ability of reobtaing the same results using exactly the same instruments. For Computational Biophysics that means the same software.**Replicability**means the ability of**independently**obtain the same results (e.g. through simulation with a different software).

Can you think of a few reasons for the lack of replicability?

suspense…

There are several. Aside from fraud, the most common are:

- Wrong use of statistics (especially in Medicine, Psychology, now relevant for ML as well).
- Pressure to finish a project, lack of funding -> lack of testing.
**Alternative hypotheses are not tested**- For computational (bio)physics: lack of proper training, wrong use tools, lack of documentation, etc.

In general, Physicists believe that the problem is not as serious in their field:

**Is that really the case though?**

### TABLE 4-1Examples of Reproducibility-Related Studies

Author | Field | Scope of Study | Reported Concerns |
---|---|---|---|

Prinz et al. (2011) | Biology (oncology, women's health, cardiovascular health) | Data from 67 projects within Bayer HealthCare | Published data in line with in-house results: ~20%-25% of total projects. |

Iqbal et al. (2016) | Biomedical | An examination of 441 biomedical studies published between 2000 and 2014 | Of 268 papers with empirical data, 267 did not include a link to a full study protocol, and none provided access to all of the raw data used in the study. |

Stodden et al. (2018a) | Computational physics | An examination of the availability of artifacts for 307 articles published in the Journal of Computational Physics | More than one-half (50.9%) of the articles were impossible to reproduce. About 6% of the articles (17) made artifacts available in the publication itself, and about 36% discussed the artifacts (e.g., mentioned code) in the article. Of the 298 authors who were emailed with a request for artifacts, 37% did not reply, 48% replied but did not provide any artifacts, and 15% supplied some artifacts. |

Stodden et al. (2018b) | Cross-disciplinary, computation-based research | A randomly selected sample of 204 computation-based articles published in Science, with a data-sharing requirement for publication | Fewer than one-half of the articles provided data: 24 articles had data, and an additional 65 provided some data when requested. |

Chang and Li (2018) | Economics | An effort to reproduce 67 economics papers from 13 different journals | Of the 67 articles, 50% were reproduced. |

Dewald et al. (1986) | Economics | A 2-year study that collected programs and data from authors who had published empirical economic research articles | Data were available for 72%-78% of the nine articles, two were reproduced successfully, three “near” successfully, and four unsuccessfully. |

Duvendack et al. (2015) | Economics | A progress report on the number of economics journals with data-sharing requirements | In 27 of 333 economics journals, more than 50% of the articles included the authors' sharing of data and code (an increase from 4 journals in 2003). |

Jacoby (2017) | Political science | A review of the results of a standing contract between American Journal for Political Science and universities to reproduce all articles submitted to the journal | Of the first 116 articles, 8 were reproduced on the first attempt. |

Gunderson et al. (2018) | Artificial intelligence | A review of challenges and lack of reproducibility in artificial intelligence | In a survey of 400 algorithms presented in papers at two top artificial intelligence conferences in the past few years, 6% of the presenters shared the algorithm's code; 30% shared the data they tested their algorithms on; and 54% shared “pseudocode”—a limited summary of an algorithm. |

Setti (2018) | Imaging | A review of the published availability of data and code for articles in Transactions on Imaging for 2004 | For the year covered, 9% reported available code, and 33% reported available data. |

Moraila et al. (2013) | An empirical study of reproducibility in computer-systems research conferences | The software could be built for less than one-half of the studies for which artifacts were available (108 of 231). | |

Read et al. (2015) | Data work funded by the National Institutes of Health (NIH) | A preliminary estimate of the number and type of NIH-funded datasets; focused on those datasets that were “invisible” or not deposited in a known repository; studied published articles in 2011 cited in PubMed and deposited in PubMed Central | 12% explicitly mention deposition of datasets in recognized repositories, leaving 88% (200,000 of 235,000) with invisible datasets; of the invisible datasets, approximately 87% consisted of data newly collected for the research reported, and 13% reflected reuse of existing data. More than 50% of the datasets were derived from live human or nonhuman animal subjects. |

Byrne (2017) | An assessment of the open data policy of PLOS ONE as of 2016 (noting that rates of data and code availability are increasing) | 20% of the articles have data or code in a repository; 60% of the articles have data in main text or supplemental information; and 20% have restrictions on data access. |

**Table adapted from Reproducibility and replicability in Sciences. Natl. Acad. Sci. Press, Washingon, DC**

## Mitigating the issue: how a CBP project should be organized or, ‘the three laws of CBP’.

How shall we proceed to set up a research project??

Before proceeding further, let’s discuss a bit how to set up a project so
that it is **reproducible**, **efficient**, and **scalable**. Think of these
as if they where the three laws of ~~robotics~~ CBP.

- The results of a research project must be
**reproducible**. - The protocol leading to the results should be as
**efficient**, as long as this efficiency does not hinder reproducibility. - The results of a research project should produce an impact at
**scale**, as long as this does not conflict with the reproducibility and efficiency.

## Actually, what is it that we can automate & organize?

If we want to make our projects more reproducible, we first have to identify what steps constitute our project!

- What is our objective?
- What is our starting point?
- How do we get from one to the other?
- Are there steps that can be easily automated?

### Let’s try to identify a simple pipeline.

A pipeline is something like the steps of a recipe. In each one we need one or more ingredients, and some steps might have to be repeated in order to achieve the desired results.

*What are the missing ingredients and steps from this recipe?*

What ingredients should enter each box in the pipeline above?

What questions do you need to consider at each step?

What step should substitute the question mark?

The filled-in pipeline will be added after class.

Once the you grasp the content and organization of the pipeline, two possible strategies to make this kind of work more reproducible should become clear:

**Script and automate repetitive tasks.****Keep your data and scripts organized.**

Ideally, if you do both of these things you will end up with a nice directory tree containing both the data for each step, and the scripts necessary to produce and analyse it. Of course, real life is messy, and it will take you several iterations to implement the correct setup, analysis, etc. until you converge to the final result. To keep track of these various changes and undo them if you made a mistake, a further tool is needed: **GIT**.

## Scripting philosophy

### Script & Automate everything you can.

Computers and automation exist so that we can spend less time doing repetitive
tasks. Those vary from vacuuming the floor to solving difficult equations to
simulating entire systems (*efficiency*).

Another advantage of automating tasks is that -if done properly- you know
exactly how they are executed (*reproducibility*).

If you do not try to reinvent the wheel, build consistently on top of
existing software, and frequently discuss techniques and aims with peers your
research and protocol will spread more easily (*scalability*).

Think carefully about what is an efficient approach….

Producing plots very quickly by hand, saving things in random directories?

Spending a lot of time designing your scripts & protocols?

When you think about designing a system, keep in mind that you want to optimize the overall time
required to get a **reproducible** result…

…And perhaps, **do not overdo it!**