## Tutorial 9: Statistical Analyses

In this tutorial we will see some more advanced techniques both to establish whether a trajectory is sampling at equilibrium and to extract information from it.

- Principal Component Analysis
- Clustering

First, we we will cover them in detail in the first part of the lecture, and apply them in a second time.

## Unsupervised learning

Machine Learning has become increasingly fashionable over the last years, so it
is interesting to observe how both *principal component analysis* and *clustering*
fit into this field. Hopefully, this will help you find more literature for your
own interest and your scientific career.

Machine Learning focuses on extracting patterns from data. In that sense, the
machine learns to recognize something in a dataset and *generalizes* its conclusion
to a different dataset. This two steps are called *fit* and *prediction*. Of course,
if the pattern can not be generalized, it is hard to say we learned something.

ML can be divided into two general subjects

**Supervised ML**-> you want to learn how to assign a label**Unsupervised ML**-> you want to recognize generic patterns, no labels involved.

*image adapted fron Abu-Moustafa, Learning from Data*

In the example above, on the left we have some labels (the colors) which can
help you learn how to separate the data, and will be hopefully generalible. On
the right, there are no labels, but our eyes tell us that there clearly are
four groups of data. Our brain identified a pattern and decided that group of
points which are close to each other must have something in common. That is an
example of **clustering**.

Clustering can be extremely important in Molecular Dynamics as well, as the system might e.g. jump between different states and you migth want to identify them in a clear way. Or perhaps, you might want to identify the rigid and mobile portions of a protein, subdividing its structure according to their dynamical properties. There are lots of examples one can think of, so, for starting, it is useful to get some basic understanding of how simple clustering algorithms work.

## Clustering

**Problem** we want to divide data in clusters of points which have something in
common. We do not know a priori how many clusters are there, nor what it is exactly
that the point have in common (no label!). How do we proceed?

Let us consider two intuitive approaches first. We will then make them a bit more formal.

# Close points have something in common

Consider the image below: we can simply identify 4 clusters of points. Our eyes
tell us that points which are closer to the center of one of the cluster than to
any other center must be part of that cluster. **Perhaps we can use a distance!**

### Dense regions highlight some common characteristic

Consider the image below: our eyes tell us that clusters are characterized by a
high density of points, with pretty empty regions in between. **Perhaps we can
use a density!**

### Where intuition fails..

The images above gave you the idea that the two methods might be equivalent.. Well that is the case for well behaved sets, in which the cluster are dense and you can draw a plane between each pair of them. In general, things are more complex, as you can see below.

Let us see in more details two common strategies based on distances and density.

# Clustering with K-means

K-means formalizes our first approach: a cluster is defined by the distance from a center. Of course, we do not know where the center are, or how many centers are there! one thing at a time though.

### Assigning a fixed number of clusters.

Suppose we know the number of clusters already, and we simply want to identify them.

- We assume we have clusters.
- We need to identify the centers of each one.
- We need to assign each point to a different center.

This assignment can be obtained iteratively, by proposing a center for each cluster, and assigning the points to different clusters based on their distance from the center. At the end of the procedure, we want all points to be assigned to the nearest cluster center. We can obtain this as follow.

First, we define a function

to minimize with respect to

and , center of the cluster.

To minimize it, we proceed iteratively.

- We propose centers in some random way.
- For each value of , we minimize , by setting it to “point” to the nearest center .
- We minimize J wrt to for each , by equalling it to the center of mass of the points assigned to cluster .

An example of how this proceeds for two clusters is reported below.
*image adapted from C. Bishop, Pattern Recognition and Machine Learning*

### Deciding the optimal number of clusters

There is nothing preventing you from diving points into clusters. The function will be exactly 0 at that point, but you will have gained no information from your clustering procedure.

In fact, decreases monotonically with the number of clusters to assign.
When do we stop then? A common strategy consists in identifying an elbow in the
function . After a certain number of cluster is introduced, any
increase in leads to only a small decrease in . We pick the value of
immediately the slowdown.
*image adapted from C. Bishop, Pattern Recognition and Machine Learning*

### Other similarity measures

There is nothing forcing you to use the euclidean distance to identify the clusters. In fact you could use a different measure of similarity, let call it . The algorithm works as well, provided you redefine :

Of course, must respect a couple of constraints

- ,
- if

# Agglomerative Clustering

We saw how Kmeans can be used as a top-down approach, even using different kinds
of distances. Now we introduce a complementary approach, based on the idea of
lumping together observations in a **bottom-up** fashion. At first, each
data point is a cluster on its own, but at the following step the two closest
points (based on a certain metric) are merged together. Then, we have to
*locate* the new cluster or, in other words, it is necessary to define the
distance between it and the other data points (**linkage criterion**).

Indeed, the linkage criterion is fundamental, since it defines how clusters are merged in the tree.

Prominent examples are:

*Single linkage*: the distance between two clusters is given by the minimum of the distances between its elements

*Average Linkage*: the distance is given by the average of all the inter-clusters distances. Each cluster element contributes to assess the position of the cluster with respect to the other:

Once the first two points are merged and the new cluster is defined (computing its distance with respect to the others), this new entity can be treated as another data point, so we can
find once again which are the two closest points among those that survided. By
doing this we build a **hierarchy of clusters**, a tree (**dendrogram**), that
goes from the bottom (each point is a cluster) to the root (a unique cluster
with all the points).

# DBScan

A different approach, which can capture curved clusters as well, relies on
estimating the density. Here the main problem is that we do not have the *real*
density of the system and we must find a way to define it locally. When we have
a local definition, we then need to construct a cluster from it somehow. There are
several algorithms based on density estimation, some of which designed by
biophysicists who
applied their work on biomolecules to image recognition. Here we will simply
describe at a high level how a standard one, DBSCAN, works.

#### Local density estimator

- For each point, in our set, we consider the set of points within
a certain small distance, from it. Those are the
*neighbours*of . This is called a “ball”. - We estimate the local density as , where is the volume of a ball of radius . If is the euclidean distance, then . indicates the number of element of a set.

#### Clustering with local density (DBSCAN)

**Basic idea:**Clusters have density .- established by fixing two hyperparameters: and the minimum number of points within a ball in a cluster, .
- points which have neighbours are called
*internal*. - points which are within of an
*internal point*are called*directly reachable*. - A point for which there exists a sequence of directly reachable points from an internal point , is called .

DBSCAN proceeds as follows.

- Identify all internal points
- Identify all sets of internal points which are directly reachable from another point, (connected component, see image).
- To each of this set, add the reachable points (non internal).

# Exercise

Now we will show you a practical application of a clustering technique applied to Molecular Dynamics trajectories. You can download the zipfile with the material and the python notebook here.

Unzip the folder and start the python notebook as usual. Remember to activate your QCB conda environment before starting the notebook.

## Dimensional reduction

Data from our simulations live in a very high dimensional space. You have atoms, and therefore trajectories. In the system you are studying, including water you have about 40 000 atoms. When you think about it, that means that your system dynamics happens in dimensions. Even considering only the equilibrium properties, those are described in principle by points in a dimensional space.

Luckily for us, the important properties of our system are dictated by a far lower number of degrees of freedom. We know we can ignore water in several cases, or at least the bulk of it. Furthermore, atoms in proteins are often rigidly bonded (e.g. the peptidic bond) so that, for example, we can describe the whole backbone conformation of a protein with residues with variables, the and angles.

What happens in the simulation of biomolecules is a particular case of a more general reality, i.e. that phenomena can often be described with a reduced number of variables once we identify what is the shape of the subspace they live in (for the physicists here: working on those subspaces, or manifolds, is the whole point of Lagrangian mechanics).

*An example of data “living” on a curved subspace, i.e. a manifold*

Dimensional reduction of the data relative to phenomena is at the basis of science, and we will see it again in the rest of the course when we talk about coarse-grained simulations. Before that, let us see how some reduction can be achieved in simple cases.

# Principal Component Analysis

PCA or Principal Component Analysis is a technique to identify those directions in a high dimensional space that capture most information about the data under investigation.

Let see an example. Consider the motion of a ball of mass *m* attached to a
spring in 3 dimensions. We record it with 3 cameras, which are not aligned to
the oscillation plane of the ball.

*Image adapted from J. Shlens, ArXiv 1404.1100*

where is a random noise. The three cameras record a point
, on their vision plane; we have a 6 dimensional
space. Nonetheless, we know in this case that most of the motion will be in 1
dimension, the rest is just noise.. **How can we proceed to identify the relevant 1D subspace?**

#### Signal to Noise Ratio

*Signal to Noise Ratio for direction .*

In simple cases, like the one in the example, we can use for the amount of “signal” in a given direction the variance of the data points projected on that direction. For the noise instead we consider the variance of the data points on the orthogonal space. Clearly, if we find a direction that maximizes this ratio, we can use it to capture most of diversity observed in our data.

#### Covariance matrix

The approach above can be aptly generalized by using the *Covariance matrix* .

Each element of the covariance matrix capture the covariance between two variables, i.e. two components of our vectors. For example, for 4 dimensions:

As you know, if the covariance of two random variable is zero, and are statistically independent, and , variance of the variable . The matrix therefore has the following properties

- The diagonal capture the variance of the components of our vector
- Clearly,
- Since the matrix is simmetric, we can diagonalize it.

Intuitively, when we diagonalize it we obtain an **orthnormal base** of vectors
which all have zero covariance among them:

The eigenvalues are now the **variances** captured by these directions.
If we sort them from highest to lowest, we obtain a scale from the direction with
highest SNR, to the one with lowest SNR.

In short:

**The eigenvectors of the Covariance matrix are our Principal Components**- They are sorted from higher to lowest SNR.
- One can consider only the first eigenvectors.

tells us how much information we captured.

**Some caveat**

- The main limitation of PCA is that it works with
**linear transformations**. If our data live on a curved manifold, PCA will not be able to describe this manifold. - Of course, the principal components
**depend on the data**if your sample is not representative, your components will not be representative.

For those of you more interested in the derivation of PCA, please check
*Bishop, Pattern Recognition and Machine Learning*.

# Exercise

We prepared a notebook for you to try your hands on PCA of protein trajectories. The notebook is based on MDAnalysis, as usual. We encourage you to try your hands on with scikit-learn as well. After playing with the notebook, try to adapt it to the trajectory used in the clustering tutorial.

The material can be downloaded here. The total size of the folder is of about 500 Mb.