Clustering methods have been widely used to group together similar conformational states from molecular simulations of biomolecules in solution. For applications such as the interaction of a protein with a surface, the orientation of the protein relative to the surface is also an important clustering parameter because of Its potential effect on adsorbed-state bioactivity. Here,we will explain in detail the tutorial of protein conformation cluster analysis. In the context of MD, we say cluster analysis, which generally means that given a configuration that contains multiple atoms, we assign these atoms to different clusters according to certain rules (commonly used distance) In this way, the system can be divided into clusters, and each cluster can be analyzed to obtain some information. This approach is based on the bond between a bunch of atoms and divide it into one A molecule is the same. In this case, our rule is whether there are bonds between atoms, and clusters are molecules. If two atoms form bonds, they belong to the same molecule and belong to the same cluster.
There are two programs involved in cluster analysis in GROMACS. gmx clustize is used for cluster analysis in a narrow sense, that is, the aforementioned group of atoms is divided into different clusters; gmx cluster can perform generalized cluster analysis, However, the function realized by the program is only to perform cluster analysis on different conformations of proteins according to RMSD, thereby dividing a large number of protein conformations into different categories. This approach is generally used for the folding research of polypeptides, and the obtained trajectories are analyzed, looking at the main peptides What types of conformations exist.
For the cluster analysis performed by gmx cluster, the objects are proteins of different conformations, the collection of all objects is all the protein conformations in the entire track, the attribution criterion (distance) is RMSD, and each cluster obtained is some RMSD mutual Close protein conformations, and the average conformation obtained by superimposing the conformations of the same cluster can be used as a representative of this cluster.
There are two output options for gmx cluster. By default, the output of RMSD matrix is rmsd.xpm file. This is the data in xpm format, which is convenient for direct viewing, but not convenient for other mapping software. We can convert it to commonly used Data file, but the accuracy of the data obtained in this way is limited. gmx cluster also has an output option -bin, which can directly output the RMSD matrix to the rmsd.dat file. You can use the tools that come with the shell. One of them is od, which can output any file in any format, of course the binary is not a problem. If we get the single-precision rmsd.dat, use the following command
od -f -v rmsd.dat
There are several clustering analysis algorithms provided in gmx cluster, but in fact there are many clustering algorithms, because cluster analysis is the most basic thing in machine learning, and all fields involving machine learning will be used.