Tutorial 9: Git & Pyinteraph

In this tutorial we will see two different things.

  1. What is git, a control version system. Git is essential to keep a project in good order.
    • What is git
    • git states
    • git branches
    • git cheatsheet
  2. How to setup your project on Gitlab.
    • connecting the project folder
    • uploading data
    • collaborating effectively
  3. How Pyinteraph works and how to use it.

Using git

What is git? Why should I care?

Git, like all serious version control systems, tries to solve two main problems.

  1. Keeping track of all changes to files.
  2. Let people collaborate on complex projects effectively.

A Version Control System (VCS) allows you to:

  • revert selected files back to a previous state,
  • revert the entire project back to a previous state,
  • compare changes over time,
  • work in two or more people on the same project,
  • see who last modified something that might be causing a problem,

You can imagine git as a sort of time-travel machine, which allows you to also jump to alternate realities (the parallel branches on your project).

how git works

a successful git branching model

Using a VCS also generally means that if you screw things up or lose files, you can easily recover.

If you want to get a clear idea about how git works, particularly if you have never used a version control system before, we warmly suggest you to take the git-novice tutorial from software carpentry.

For those of you who are more experienced, we follow here with the basic informations about git.

Version control

A basic version of a VCS would be to simply have a local database in which you store subsequent versions of the same file(s).

Versioning

The old idea: keep track of your changes

In practice, this often become just the same file saved with the suffix.. _v2, _v3,…, _final1, _final2, _final_final…Entropy definitely increases in a file system with time, as everywhere else.

The idea is to use a VCS like git to reduce the entropy as much as possible with the minimum energy input, i.e. the minimum effort on your side. Plus, you want to cooperate with people!

how it is done..

And how it is usually done…

A way to have something cooperative could be to put the whole DB (the “repository” ) on a single server accessible by all contributors. This has some problems though: i) if the server crashes, your can not work. ii) If the server gets lost, your work gets lost as well. iii) lag times!

Distributed version control

To solve the issues above, git implements what is known as distributed version control. This means that the copy of the repository storing all the history of changes is stored on the computer of each contributor, as well as on the server.

how it is done..

  • Each contributor is on the same page.
  • Even if the server is done for, you can retrieve the data.
  • You can do a lot of work locally, then push things on the remote repository only in a second time.

How git saves your project

Git creates a mini file-system, saving “snapshots” of your project directory each time you “commit” your changes. Each snapshots stores all the data in the directory. To save space, git substitutes unchanged files with links to their latest version.

how git works

Git version control. Unchanged files (links) are represented with dashed contours.

Strong points of git

  1. All operations are local -> fast!
  2. It is possible to perform large changes without messing up everything.
  3. It is impossible to change something on the repository without git knowing it!
  4. Snapshots are additive: no data is lost (of course you can make it useless, but not lose it).

Git three states

To allow git to do its job, you need to tell it what changes you want to push to the repository!

You need to tell git what changes you want to save.

To give you flexibility in deciding what to save and what not to save, git implements three “areas”, to which correspond four possible file states.

area file state description
working directory untracked Files unknown to git.
working directory modified Files known to git that have been modified on your pc, not yet staged to enter a snapshot.
Staging area staged Files whose changes will enter the next snapshot to be committed
Local repository (.git directory) committed Latest local snapshot of your project

how git works

The staging area stores incremental changes before you commit them to a snapshot.

Note the arrow back from the repository. That means that you canretrieve changes from your local repository to correct some error, or to go back to a previous version of a file and experiment some different change.

The figure above only considers files that are already known to git. You can include unknown files to the staging area with git add. A typical situation looks like this:

how git works

The lifecycle on a file on git. commit adds a file to the local repository. Therefore, after that, there will be no difference between the copy on the snapshot and your local copy, which is then marked as unmodified.

You can see the status of various files whenever you wish, by writing git status. Here is an example from the git-project we use to prepare your lectures.

how git works

a practical example: a step during the preparation of this lecture

There is then the remote repository which is shared with others; to this you can push only snapshots that are in your local repository.

Using branches to collaborate effectively

The nice things about version control, and in particular modern VCS is that they can manage co-editing. I.e. when two people do modify the same file, the VCS is usually able to merge the changes from both of them in a seamless way. That is, until both tried to modify the same line in a file. That results in a conflict.

how git works

A and B both modified the same region in the file. Which one should be picked?

Once you encounter a conflict -git will tell you, trust me- you will have to instruct what version to pick by yourself. On the incriminated file(s) you will find regions marked as follow:

<<<<<<< HEAD:file.txt
Hello world
=======
Goodbye
>>>>>>> 77976da35.....:file.txt

Where: <<<<< HEAD means your local copy and everything from ==== to >>>> 77976da... means their copy, the one you pulled from the remote repo. You will have to choose which one to choose by removing the other one AND the markers.

How to avoid conflicts in the first place?

By using branches.

Git has the wonderful characteristic that it makes it extremely easy to create a new snapshot of your data, called a branch, which is independent from the others. Consider it as a second directory for your project, in which you can play without modifying what was stored in the first one. With the plus that git is still keeping track of all changes between your “branched directory” and the “master directory”.

how git works

adapted from here

How should one structure branches?

Branches are effective if you use them in a structured way. A successful git workflow introduce by Vincent Driessen in 2010 and widely adopted. You can find it explained in detail here.

how git works

a successful git branching model](https://nvie.com/posts/a-successful-git-branching-model/)

There is another trick to it then, but we will see it later..

Summary of git basic commands.

command effect example
git help Get help on git commands! git help [command]
git clone Clone a remote repository in the current dir. git clone <git@>
git init Initialize the current directory as a repo git init
git add Add one or more files to the staging area git add file1 file2 file3..
git commit Add the files in the staging area to the local repository git commit [-m "commit message..."]
git checkout Recover a file from the repo or switch branches * git checkout myfile #get file "myfile",
* git checkout mybranch #get branch "mybranch",
* git checkout -b newbranch #create new branch "newbranch"
git rm remove files from the repository (same options as normal rm) git rm file or git rm -r dir
git mv move files inside the repository (same as normal mv) git mv fileA fileB
git pull Integrate changes from a remote repository into the local branch git pull
git push Push local changes to a remote repository git push

Registering on gitlab + installing git on your machine.

We have set up a virtual machine to host gitlab at the physics department. You can organize yourselves to use other hosts, but we strongly recommend using this one instead.

1. Register on our internal gitlab host.

Go to the SBP gitlab host and register using your unitn email and name.surname as user account. {:refdef: style=”text-align: center;”} how git works

Gitlab welcome page

2. I will add you to the group cbp2020-2021.

3. Create a project on gitlab

We will see how to create a project from gitlab. Create only ONE project per each group. It helps to give sensible names to projects, such as e.g. SBSD-<mutation> where <mutation> is substituted with the code for the mutation you will investigate.

Linking the project

Link the project you created to the project folder on hpc2 (and on your laptop: you can create the folder there too!!). Normally you can follow the instructions on gitlab, but in your case you already have files written in the template folder, so we need to be sure to avoid conflicts. There is a bit of an issue since we should have introduced git before the template folder..We will follow these steps:

  1. One person in your group will create the repo, from the empty template folder.
  2. Both members of the group will copy what relevant in the newly created folder, pushing to different branches.
  3. We will try to merge your changes to a common branch, and see what happens..

1 linking and initializing the repo

First let git know who you are

git config --global user.name "name surname"
git config --global user.email "name.surname@studenti.unitn.it"

Initialing the remote repo. ONE person only (example for group1)

cd <some_directory_I_like>
cookiecutter https://gitlab.physics.unitn.it/sbp/reproducible-comp-phys-template.git

Insert all the data requested, and check that the repo address is correct: [https://gitlab.physics.unitn.it].

Then cd into the project directory and either do the following steps by hand

# create the master branch
git init
git remote add origin git@gitlab.physics.unitn.it/<your_name>/<project-slug>.git #substitute them!!
git add .
git commit -m "Initial commit"
git push -u origin master

Or simply run git_initialize.sh from the bin subdirectory.

The second member of the group will have to choose an empty folder and clone the project. Then she/he will be able to add their data.

Note: you need to upload the ssh key of your computer in order to use gitlab, and you need to add the other user to your repository!

2 Adding your data

First of all only add text files to git. We do not want to store simulation results here!! I repeat, No simulation trajectories, nor log files, on git.

To add changes, you will want to avoid conflicts if possible. We will try to do that by using branches. After you copied your data, go to the directory in which the data is stored and run the following.

git checkout -b your_name # of course substitute your_name appropriately.. :) This will create a branch named after you.
git add .
git status # what is git status showing?
git commit -m "write a proper message here!" # do it!
git push -u origin your_name # again, subsistute appropriately..

The above commands will switch git from the master branch to a branch named after you, where you can play without the fear of losing something important. You can manage branches with the following commands:

git branch  # show all local branches
git branch -a # show all branches, including remote ones
git checkout -b branch_name # create branch 'branch_name'
git checkout branch_name # switch to branch 'branch_name'
git branch -D branch_name # DELETE branch 'branch_name' locally

3 Collaborating effectively

Now go to gitlab and update the project page, we will work on that.

Let’s take a look at our repository. On the left bar, click repository, branches. You should see three branches: two personal ones and master. Now, click on new branch to create a new one. Of course, just one of you should do it. Call the new branch develop and select create from master.

mutations

An example screen..

Now that you have a develop branch, use the interface to merge one of your branches to develop. Go to merge anche choose develop as a target instead of master. Do all the procedure and describe your merge request properly.

Finally, we want to merge the second personal branch, let’s say pippo if the first was ciccio onto develop. ciccio contents are now already in develop. Now, do you remember all our discussions about avoiding conflicts? This is the extra trick. Pippo will have to do the following before submitting a merge request to develop:

git fetch # update all branches locally, without pulling the changes
git checkout develop
git checkout pippo # again, use the correct name..
git merge develop # !!! this will merge develop INTO pippo
# solve any conflict that should arise, and commit again.
git status
git commit -m "merged develop into pippo"
git push

Only after doing this, Pippo can follow the same procedure as above to merge into develop from the web interface. Why this? Because this way pippo who knows better what changes he introduced can quickly solve any possible issues, before asking for a merge request which might incur in larger conflicts (e.g. if something changed on develop while he was working).

Remember the following golden rules:

  • Create a branch for any change you want to implement.
  • Do commit frequently.
  • Merge from the starting branch (should be develop) back to your branch before pushing a merge request.
  • Comment commits and merge requests exhaustively.
  • Link branch names to tickets in gitlab…

keep track of things with gitlab.

Gitlab supports among other things the creation of issues. These are wonderful to organize your work, particularly when you have to collaborate. Even better if you start calling the branches by referencing the issue number..

issues

An example screen from our course preparation

A quick intro to graph theory.

Graph theory is a rich and interesting branch of mathematics with applications ranging from electrical circuits, to topology, to network optimizations and analysis of protein structures.

We will cover only some basic notions of graph theory, in order to better understand pyinteraph and other research papers (and softwares) applying graph and network concepts to proteins’ structures.

Definition

A graph $ G=G(V;E) $ consists of a set of vertices V and a set of edges E. Two vertices $v_i$ and $ v_j $ are said to be adjacent if there is an edge $e_{if}$ connecting them. In the same way, two edges $e_{ij}$ $e_{jk}$ are said to be adjacent if they have at least a vertex in common.

graph definitions

Basic concepts from graph theory. a) Schematic representations of vertices and edges. b) A digraph, or directed graph. Those are most often used to represent electric circuits. c) Weigthed graphs have different numerical value (cliques) each vertex is connected to all other vertices.

There are several possible things one can consider in graphs. One can put arrows on the edges to indicate a flow, as in electric networks or chemical reactions, one can consider multiple edges between vertices (multigraphs), or give different weights to different vertices and edges. This latter consideration is particularly important in Physics (and Chemistry!), where we want not only to know that there exists a connection between two entities, but also its strength or probability. For example, graph c) could indicate a molecule: the vertices are the atoms and the edges the bonds. Their respective weigths could be set to indicate different chemical properties.

From our definitions of a graph, it follows immediately that any graph $G(V;E)$ has subgraphs, $\{ G’(V’;E’)\qquad |\qquad V’\subseteq V,\quad E’\subseteq E \}$.

graph definitions

a) An example graph and, b) some of its subgraphs.

Roughly, how many different subgraphs can be identified in a given graph?

For any two vertices $v_i$ and $v_j$ in a graph, we say that there is a path connecting them if there exist a sequence adjacent edges $\{e_{il}e_{la}e_{ab}\ldots e_{mp}e_{pj}\}$. Of course these is not necessary, and two vertices or even two portions of a graph can be disconnected, i.e. not joined by any path. A (connected) component of $G$ is a subgraph $G’$ of $G$ such that all of its vertices are connected by at least a path and there are no paths between any of them and another component of $G$. Components are frequently referred to as clusters in physics.

graph definitions

Graphs with different connected components. In both cases the diameter of the largest component is highlighted in yellow. In b) the removal of a single edge (a “bridge”) causes the appearance of a third component.

By removing edges it is possible to split a graph in a set of connected components. This is particularly relevant in the applications related to protein structures and analysis, where the graphs are weigthed since they represent some interactions. Introducing a threshold then splits up the graph into different connected components based on the value of the threshold.

Can you give an every-day life example of a weighted network which gets splitted in different components due to a threshold?

Some simple properties

There are several measures that can be defined for graphs. We will limit ourselves to the simplest ones:, degree of a vertex, distance between two vertices, eccentricity of a vertex, center of a graph.

The degree of a vertex, $D(v)$ is simply the number of edges incident to it. Vertices with a very high degree are often called ‘hubs’ and they are very important when studying the statistical properties of networks.

The distance between two vertices $v_i$ and $v_j$ is the length (number of edges) of the shortest path connecting $v_i$ and $v_j$, if it exists, or $\infty$ if $v_i$ and $v_j$ are disconnected.

Given a graph $G(V;E)$, the Eccentricity of a vertex $v_i$ is defined as \(E(v_i) = max_{v_j \in V} d(v_i,v_j).\)

The maximum eccentricity is the diameter of the graph. \(Diam(G) = max_{(v_i,v_j) \in V} d(v_i,v_j).\)

The vertex $v_c$ for with minimal eccentricity is defined to be the center of the graph.

graph definitions

a) A path between two vertices. b) Distance between the same two vertices. c) Eccentricity of the graph. d) Center of the graph.

Isomorphism

Graphs are topological objects. This means that the way we draw the diagram does not change the graph, which is identified just by a set, $V$, of vertices and a set, $E$, of edges connecting (some of) them.

Two graphs with the same diagram, but different labels are considered distinct. Nonetheless, it is clear that there must be a relation between them. This relation is an isomorphism. Two graphs $G_\alpha(V;E)$ and $G_\beta(W;F)$ are said to be isomorphic is there exist a bijective function between $(V;E)$ and $(W;F)$ so that the structure of the graphs is maintained.

In the figure below, the graphs a) and b) are isomorphic. While the graphs c) and d) are not.

graph definitions

a) and b) are isomorphic. c) and d) are not.

Mapping graphs to matrices

Graphs can be mapped easily to matrices. One of the most used is the Adjacency matrix, whose elements are defined as in the figure below. This matrix is often called contact matrix in protein physics.

graph definitions

Two graphs and their corresponding adjacency matrices.

Adjacency matrices are not the only one relevant in graph theory, but they highlight an important principle: for undirected graphs, the matrices are symmetric. This means that they can be diagonalized, and their spectra hold information about the graph which is independent of the labelling of the vertices. Among other things, spectral analysis is used to identify isomorphic graphs and subgraphs, something very useful in the analysis of protein structure and comparison of different protein folds.

Pyinteraph

Pyinteraph is a software, written in python and cython which identifies Interaction networks, i.e. graphs in which the vertices are atoms or amino-acids and the edges are different types of interactions. The graphs identified are weigthed, with each edge proportional to the persistence of a given interaction in an thermodynamic ensemble.

Pyinteraph considers three type of interactions: Hydrophobic, Salt bridges, and H-Bonds. They are identified according to the following rules.

The details below are taken from the original article.

Hydrophobic contacts

For hydrophobic contacts, the interaction between two residues is included if the center of mass of the side chain of the two hydrophobic residues is found within 5 Å of distance as a default.

Hydrophobic interaction

(Image adapted from: Tiberti, M., et al (2014). Jour. chem. inf. mod., 54(5), 1537-1551.) Hydrophobic interactions network.

Salt bridges

For salt bridges, all the distances between atom pairs belonging to two “charged groups” of two different residues are calculated, and the charged groups are considered as interacting if at least one pair of atoms is found at a distance shorter than 4.5 Å

Salt bridges

(Image adapted from: Tiberti, M., et al (2014). Jour. chem. inf. mod., 54(5), 1537-1551.) Salt bridges network. Edges thickness is proportional to their strength (or “persistence”), hubs are colored in yellow.

H-Bonds

A H-bond is identified when both the distance between the acceptor atom and the hydrogen atom is lower than 3.5 Å and the donor-hydrogen-acceptor atom angle is greater than 120°. These default parameters can be modified by the user. As a default, both side chain and main chain H-bonds are included

Hydrogen bonds

(Image adapted from: Tiberti, M., et al (2014). Jour. chem. inf. mod., 54(5), 1537-1551.) Hydrogen bonds network. Bond thickness is proportional to their weight (“persistence”); a set of very persistent ones are reported in the right panel.

Pyinteraph workflow

How does pyinteraph identify the various interaction networks? It works as schematically reported below:

pyinteraph scheme

(Image adapted from: Tiberti, M., et al (2014). Jour. chem. inf. mod., 54(5), 1537-1551.) How pyinteraph works: 0. Start from an MD trajectory. 1. Identify weigthed adjacency matrices, 2. filter them, 3. join them into an unweighted matrix

Starting from an MD trajectory, pyinteraph computes for every frame the hydrophobic, salt bridges and H-bond based on the relative distance and orientation of the relevant atoms. The output from this first step consist of three weighted graphs (i.e. three matrices), one per interaction, in which the weight of the edges correspond to the fraction of frames in which said bonds were active. The program (or the user) then identifies a critical persistence length to separate clusters which are persistent from interactions which are rarer. There is usually a sharp transition, as shown in the image below. As a second step, the program outputs three filtered matrices. Finally, it puts them together as an unweigthed matrix representing the whole interaction network of the protein.

pyinteraph scheme

(Image adapted from: Tiberti, M., et al (2014). Jour. chem. inf. mod., 54(5), 1537-1551.) Largest cluster size as a function of the persistence cutoff. The jump is the ‘critical persistence’ at which it makes sense to place the cut-off.

Assumptions

Pyinteraph works on the following assumptions.

  1. No PBC (if present, they must be removed first).
  2. No large transitions (e.g. no rare events). The trajectory is assumed to be sampling a relatively local free energy basin.

Installing pyinteraph

Pyinteraph is currently distributed as a python 2.7 package, although the authors are working on a python3 version. This raises a few problems, since python 2.7 will be phased out shortly and python3 is not backward compatible. As Python 2.7 packages are still an issue in Academia, we left the details on how to install the original package here, for those of you that are interested.

Installing the UniTN Python3 version

Gianfranco Abrusci and Alberto Borsatto from UniTN did however build a python-3 compatible version of it, pyfferaph, which we are going to use. You can find it on gitlab: https://gitlab.com/recoverin/final_pyfferaph.

The instructions on the Readme are quite clear. To download it, use git clone (top right on the page for the link).

In order to run pyfferaph, you will need to have a separate conda environment, as the packages are different from the ones we used in the QCB environment. In particular pyfferaph uses a newer MDAnalysis version. From the cloned folder run:

conda env create -f environment.yml

or if you are on windows, use conda navigator to create an environment based on that .yml file.

At least on Linux, you will also have to tell jupyter to use the correct kernel. First you need to install it with the following lines:

conda activate pyfferaph
python -m ipykernel install --user --name pyfferaph --display-name "Pyfferaph (py3.8)"

Then, when you run the notebook check the kernel on the top-right corner. If it is a different one, change it using the menu.

Using Pyfferaph

To use pyfferaph, simply copy and modify to your needs the example_pyff.ipynb file that you can find in the main folder.

Some important notes:

  1. As pyfferaph is not installed as a package (yet) you will have to always use it from this directory (OK, there are other ways, but let keep it simple..)..
  2. You need to activate the pyfferaph environment.
  3. Do not push to this repo

Exercise

Would you be able to apply pyfferaph on a different trajectory?

As an example take the well-known Adenylate Kinase trajectory that has already been used in tutorial2 and tutorial8.