Tutorial 9: Statistical Analyses
In this tutorial we will see some more advanced techniques both to establish whether a trajectory is sampling at equilibrium and to extract information from it.
- Principal Component Analysis
- Clustering
First, we we will cover them in detail in the first part of the lecture, and apply them in a second time.
Unsupervised learning
Machine Learning has become increasingly fashionable over the last years, so it is interesting to observe how both principal component analysis and clustering fit into this field. Hopefully, this will help you find more literature for your own interest and your scientific career.
Machine Learning focuses on extracting patterns from data. In that sense, the machine learns to recognize something in a dataset and generalizes its conclusion to a different dataset. This two steps are called fit and prediction. Of course, if the pattern can not be generalized, it is hard to say we learned something.
ML can be divided into two general subjects
- Supervised ML -> you want to learn how to assign a label
- Unsupervised ML -> you want to recognize generic patterns, no labels involved.
supervised vs unsupervised learning, adapted from Y. Abu-Mostafa, Learning from data
In the example above, on the left we have some labels (the colors) which can help you learn how to separate the data, and will be hopefully generalible. On the right, there are no labels, but our eyes tell us that there clearly are four groups of data. Our brain identified a pattern and decided that group of points which are close to each other must have something in common. That is an example of clustering.
Clustering can be extremely important in Molecular Dynamics as well, as the system might e.g. jump between different states and you migth want to identify them in a clear way. Or perhaps, you might want to identify the rigid and mobile portions of a protein, subdividing its structure according to their dynamical properties. There are lots of examples one can think of, so, for starting, it is useful to get some basic understanding of how simple clustering algorithms work.
Clustering
Problem we want to divide data in clusters of points which have something in common. We do not know a priori how many clusters are there, nor what it is exactly that the point have in common (no label!). How do we proceed?
Let us consider two intuitive approaches first. We will then make them a bit more formal.
Close points have something in common
Consider the image below: we can simply identify 4 clusters of points. Our eyes tell us that points which are closer to the center of one of the cluster than to any other center must be part of that cluster. Perhaps we can use a distance!
clustering using a distance
Dense regions highlight some common characteristic
Consider the image below: our eyes tell us that clusters are characterized by a high density of points, with pretty empty regions in between. Perhaps we can use a density!
Clustering through density
Where intuition fails..
The images above gave you the idea that the two methods might be equivalent.. Well that is the case for well behaved sets, in which the cluster are dense and you can draw a plane between each pair of them. In general, things are more complex, as you can see below.
Comparing clustering algorithms, from scikit-learn
Let us see in more details two common strategies based on distances and density.
Clustering with K-means
K-means formalizes our first approach: a cluster is defined by the distance from a center. Of course, we do not know where the center are, or how many centers are there! one thing at a time though.
Assigning a fixed number of clusters.
Suppose we know the number of clusters already, and we simply want to identify them.
- We assume we have \(K\) clusters.
- We need to identify the centers of each one.
- We need to assign each point to a different center.
This assignment can be obtained iteratively, by proposing a center for each cluster, \(\mu_k\) and assigning the points to different clusters based on their distance from the center. At the end of the procedure, we want all points to be assigned to the nearest cluster center. We can obtain this as follow.
First, we define a function
\[J = \sum_{k=1}^K\sum_{i=1}^n r_{ki} || \vec{x}_{ki} - \vec{\mu}_k ||^2\]to minimize with respect to
\[r_{ki} = \left\{ \begin{array}{lr} 1 & : \vec{x}_i\qquad \textrm{is in cluster k}\\ 0 & : \qquad \textrm{otherwise} \end{array} \right.\]and \(\vec{\mu}_{k}\), center of the cluster.
To minimize it, we proceed iteratively.
- We propose \(K\) centers \(\vec{\mu}_k\) in some random way.
- For each value of \(i\), we minimize \(r_{ki}\), by setting it to “point” to the nearest center \(\vec{\mu}_k\). \(r_{ki} = \left\{ \begin{array}{lr} 1 & : k = argmin_j ||\vec{x}_i-\vec{\mu}_j||^2\\ 0 & : \qquad \textrm{otherwise} \end{array} \right.\)
- We minimize J wrt to \(\vec{\mu}_k\) for each \(k\), by equalling it to the center of mass of the points \(\vec{x}_{ki}\) assigned to cluster \(k\).
An example of how this proceeds for two clusters is reported below.
image adapted from C. Bishop, Pattern recognition and machine learning
Deciding the optimal number of clusters
There is nothing preventing you from diving \(n\) points into \(n\) clusters. The \(J\) function will be exactly 0 at that point, but you will have gained no information from your clustering procedure.
In fact, \(J\) decreases monotonically with the number of clusters to assign. When do we stop then? A common strategy consists in identifying an elbow in the function \(J(K)\). After a certain number \(K\) of cluster is introduced, any increase in \(K\) leads to only a small decrease in \(J\). We pick the value of \(K\) immediately the slowdown.
image adapted from C. Bishop, Pattern recognition and machine learning
Other similarity measures
There is nothing forcing you to use the euclidean distance to identify the clusters. In fact you could use a different measure of similarity, let call it \(\nu\). The algorithm works as well, provided you redefine \(J\):
\[J = \sum_{k=1}^K\sum_{i=1}^n r_{ki} \nu(\vec{x}_{ki},\vec{\mu}_k)\]Of course, \(\nu\) must respect a couple of constraints
- \(\nu(x,x) = 0\),
- \(\nu(x, y) > 0\) if \(y\neq x\)
Agglomerative Clustering
We saw how Kmeans can be used as a top-down approach, even using different kinds of distances. Now we introduce a complementary approach, based on the idea of lumping together observations in a bottom-up fashion. At first, each data point is a cluster on its own, but at the following step the two closest points (based on a certain metric) are merged together. Then, we have to locate the new cluster or, in other words, it is necessary to define the distance between it and the other data points (linkage criterion).
Indeed, the linkage criterion is fundamental, since it defines how clusters are merged in the tree.
Comparing different linkage criteria
Prominent examples are:
- Single linkage: the distance between two clusters is given by the minimum of the distances between its elements
- Average Linkage: the distance is given by the average of all the inter-clusters distances. Each cluster element contributes to assess the position of the cluster with respect to the other:
Once the first two points are merged and the new cluster is defined (computing its distance with respect to the others), this new entity can be treated as another data point, so we can find once again which are the two closest points among those that survided. By doing this we build a hierarchy of clusters, a tree (dendrogram), that goes from the bottom (each point is a cluster) to the root (a unique cluster with all the points).
A dendrogram represents the hierarchi of clusters as a tree
DBScan
A different approach, which can capture curved clusters as well, relies on estimating the density. Here the main problem is that we do not have the real density of the system and we must find a way to define it locally. When we have a local definition, we then need to construct a cluster from it somehow. There are several algorithms based on density estimation, some of which designed by biophysicists who applied their work on biomolecules to image recognition. Here we will simply describe at a high level how a standard one, DBSCAN, works.
Local density estimator
- For each point, \(\vec{x_i}\) in our set, we consider the set of points within a certain small distance, \(\epsilon\) from it. Those are the neighbours of \(\vec{x_i}\). \(B_{\vec{x}_i,\epsilon} = \left\{ \vec{x} \in X | d(\vec{x_i},\vec{x})< \epsilon \right\}\) This is called a “ball”.
- We estimate the local density as \(\rho(\vec{x_i}) = \frac{Card(B_{\vec{x_i},\epsilon})}{Vol(B_{\epsilon})}\), where \(Vol(B_{\epsilon})\) is the volume of a ball of radius \(\epsilon\). If \(d\) is the euclidean distance, then \(Vol(B_{\epsilon}) = \frac{4}{3}\pi\epsilon^3\). \(Card\) indicates the number of element of a set.
Clustering with local density (DBSCAN)
- Basic idea: Clusters have density \(\rho_c > \rho_{min}\).
- \(\rho_{min}\) established by fixing two hyperparameters: \(\epsilon\) and the minimum number of points within a ball in a cluster, \(n_{min}\).
- points \(\vec{x_i}\) which have \(n_i > n_{min}\) neighbours are called internal.
- points which are within \(\epsilon\) of an internal point are called directly reachable.
- A point \(\vec{x_j}\) for which there exists a sequence of directly reachable points from an internal point \(\vec{x_i}\), is called \(reachable\).
DBSCAN proceeds as follows.
- Identify all internal points
- Identify all sets of internal points which are directly reachable from another point, (connected component, see image).
- To each of this set, add the reachable points (non internal).
The DBSCAN algorithm scheme
Exercise
Now we will show you a practical application of a clustering technique applied to Molecular Dynamics trajectories. You can download the zipfile with the material and the python notebook here.
Unzip the folder and start the python notebook as usual. Remember to activate your QCB conda environment before starting the notebook.
Dimensional reduction
Data from our simulations live in a very high dimensional space. You have \(N\) atoms, and therefore \(3N\) trajectories. In the system you are studying, including water you have about 40 000 atoms. When you think about it, that means that your system dynamics happens in \(6N = 240 000\) dimensions. Even considering only the equilibrium properties, those are described in principle by points in a \(3N = 120 000\) dimensional space.
Luckily for us, the important properties of our system are dictated by a far lower number of degrees of freedom. We know we can ignore water in several cases, or at least the bulk of it. Furthermore, atoms in proteins are often rigidly bonded (e.g. the peptidic bond) so that, for example, we can describe the whole backbone conformation of a protein with \(M\) residues with \(2M\) variables, the \(\phi\) and \(\psi\) angles.
FE profile for an extended simulation of the alanine dipeptide
What happens in the simulation of biomolecules is a particular case of a more general reality, i.e. that phenomena can often be described with a reduced number of variables once we identify what is the shape of the subspace they live in (for the physicists here: working on those subspaces, or manifolds, is the whole point of Lagrangian mechanics).
An example of data ‘living’ on a curved subspace (a manifold).
Dimensional reduction of the data relative to phenomena is at the basis of science, and we will see it again in the rest of the course when we talk about coarse-grained simulations. Before that, let us see how some reduction can be achieved in simple cases.
Principal Component Analysis
PCA or Principal Component Analysis is a technique to identify those directions in a high dimensional space that capture most information about the data under investigation.
Let see an example. Consider the motion of a ball of mass m attached to a spring in 3 dimensions. We record it with 3 cameras, which are not aligned to the oscillation plane of the ball.
Image adapted from J. Shlens, ArXiv 1404.1100
where \(\vec{\epsilon}\) is a random noise. The three cameras record a point \((x_\alpha(t),y_\alpha(t))\), on their vision plane; we have a 6 dimensional space. Nonetheless, we know in this case that most of the motion will be in 1 dimension, the rest is just noise.. How can we proceed to identify the relevant 1D subspace?
Signal to Noise Ratio
Signal to noise ratio for direction \(\vec{u}\)
In simple cases, like the one in the example, we can use for the amount of “signal” in a given direction \(\vec{u}_i\) the variance of the data points projected on that direction. For the noise instead we consider the variance of the data points on the orthogonal space. Clearly, if we find a direction that maximizes this ratio, we can use it to capture most of diversity observed in our data.
Covariance matrix
The approach above can be aptly generalized by using the Covariance matrix \(C_{ij}\).
\[C_{ij} := \sigma^2_{ij} = \langle(x_i - \langle x_i\rangle )\rangle\langle(x_j - \langle x_j \rangle)\rangle\]Each element of the covariance matrix capture the covariance between two variables, i.e. two components of our vectors. For example, for 4 dimensions:
\[C = \begin{pmatrix} \sigma^2_{11} & \sigma^2_{12} & \sigma^2_{13} &\sigma^2_{14}\\ \sigma^2_{21} & \sigma^2_{22} & \sigma^2_{23} &\sigma^2_{24}\\ \sigma^2_{31} & \sigma^2_{32} & \sigma^2_{33} &\sigma^2_{34}\\ \sigma^2_{41} & \sigma^2_{42} & \sigma^2_{43} &\sigma^2_{44}\\ \end{pmatrix}\]As you know, if the covariance \(\textrm{Cov}(X,Y)\) of two random variable \(X,Y\) is zero, \(X\) and \(Y\) are statistically independent, and \(\textrm{Cov(X,X)} = \textrm{Var}(X)\), variance of the variable \(X\). The matrix \(C\) therefore has the following properties
- The diagonal capture the variance of the components of our vector
- Clearly, \(C_{ij} = C_{ji}\)
- Since the matrix is simmetric, we can diagonalize it.
Intuitively, when we diagonalize it we obtain an orthnormal base of vectors which all have zero covariance among them:
\[C_{diag} = \begin{pmatrix} \lambda_1 & 0 & 0 &0\\ 0 & \lambda_2 & 0 &0\\ 0 & 0 & \lambda_3 &0\\ 0 & 0 & 0 &\lambda_4\\ \end{pmatrix}\]The eigenvalues \(\lambda_i\) are now the variances captured by these directions. If we sort them from highest to lowest, we obtain a scale from the direction with highest SNR, to the one with lowest SNR.
In short:
- The eigenvectors of the Covariance matrix are our Principal Components
- They are sorted from higher to lowest SNR.
- One can consider only the first \(m\) eigenvectors.
tells us how much information we captured.
Some caveat
- The main limitation of PCA is that it works with linear transformations. If our data live on a curved manifold, PCA will not be able to describe this manifold.
- Of course, the principal components depend on the data if your sample is not representative, your components will not be representative.
For those of you more interested in the derivation of PCA, please check Bishop, Pattern Recognition and Machine Learning.
Exercise
We prepared a notebook for you to try your hands on PCA of protein trajectories. The notebook is based on MDAnalysis, as usual. We encourage you to try your hands on with scikit-learn as well. After playing with the notebook, try to adapt it to the trajectory used in the clustering tutorial.
The material can be downloaded here. The total size of the folder is of about 500 Mb.