PCA is used for exploratory data analysis and building predictive models. It is usually used for dimensionality reduction. The method is to project each data point onto only the first few main components to obtain lower-dimensional data while retaining as many data changes as possible. It can be seen from any target that the principal component is the eigenvector of the data covariance matrix. Therefore, the principal components are usually calculated by the eigen decomposition of the data covariance matrix or the singular value decomposition of the data matrix. PCA is the simplest multivariate analysis based on real feature vectors and is closely related to factor analysis. Factor analysis usually includes more domain-specific assumptions about the underlying structure and solves the matrix with slightly different eigenvectors.
This tutorial will use the principal component analysis (PCA, principal component analysis) in CPPTRAJ to study the 36-mer double-stranded DNA with sequence d (GCACGAACGAACGAACGC). To check whether DNA simulation calculations can get similar conformational spaces under different hardware. This tutorial will simulate and calculate a trajectory on the GPU, and the remaining trajectories will be simulated on the CPU. The trajectories processed in the tutorial are for example purposes. The complete data can be found here (http://amber.utah.edu/DNA-dynamics) download.
▪ Step 1: Calculate the covariance matrix of the coordinates
The number of conformations of the calculated covariance matrix we input is the same as the number of rows or columns of the matrix (i.e. 3*number of selected atoms).
Next, we will use the script: pca-cpu-gpu.cpptraj for calculation. Since the number of atoms in different trajectories may be different, it is necessary to read in the topology file (prmtop) separately, and then use the [name] command to specify the topology file of the trajectory. Set tag name or file name: cpu/cpu.prmtop, gpu/gpu.prmtop<、p>
parm cpu/cpu.prmtop [cpu]parm gpu/gpu.prmtop [gpu]
The topology file cpu/cpu.prmtop can be referred to by [cpu]. To generate the average coordinate information, the first step is to fit the DNA other atoms except hydrogen atoms to a common structure through RMS, that is, the first structure of the trajectory.
rms first :1-36&!@H=
Next, by processing all the conformations, a file named cpu-gpu-average is generated and saved
average crdset cpu-gpu-average
All the above commands will produce the average structure of the trajectory as the reference structure. But to execute these commands, we need to enter the run command to run without exiting the program.
Run
Now, we have the average structure cpu-gpu-average and the saved coordinate information of the input trajectory. Next, we need to RMS fit the saved trajectory coordinate information with the reference structure to remove the rotation and translation changes of the molecules. These require crdaction instructions to complete.
crdaction cpu-gpu-trajectories rms ref cpu-gpu-average :1-36&!@H=
Next, we use the matrix command to get the coordinate covariance matrix, and name it cpu-gpu-covar.
crdaction cpu-gpu-trajectories matrix covar \ name cpu-gpu-covar :1-36&!@H=
▪ Step 2: Calculate the principal components and coordinate mapping
runanalysis diagmatrix cpu-gpu-covar out cpu-gpu-evecs.dat \ vecs 3 name myEvecs \ nmwiz nmwizvecs 3 nmwizfile dna.nmd nmwizmask :1-36&!@H=
crdaction cpu-gpu-trajectories projection CPU modes myEvecs \ beg 1 end 3 :1-36&!@H= crdframes 1,10001crdaction cpu-gpu-trajectories projection GPU modes myEvecs \ beg 1 end 3 :1-36&!@H= crdframes 10002,last
The results obtained after the mapping process from the cpu and gpu trajectories are named CPU and GPU respectively. As long as these mapping results are obtained, we can normalize the first three histograms of each trajectory mapping result. Finally, enter the run command to run these processing instructions.
hist CPU:1 bins 100 out cpu-gpu-hist.agr norm name CPU-1hist CPU:2 bins 100 out cpu-gpu-hist.agr norm name CPU-2hist CPU:3 bins 100 out cpu-gpu-hist.agr norm name CPU-3hist GPU:1 bins 100 out cpu-gpu-hist.agr norm name GPU-1hist GPU:2 bins 100 out cpu-gpu-hist.agr norm name GPU-2hist GPU:3 bins 100 out cpu-gpu-hist.agr norm name GPU-3run
▪ Step 3: Visualize the principal component movement pattern
Now that the analysis is complete, we can run the clear all command to delete all the data cached in the memory to ensure that the next analysis is completed without any other interference.
clear all
The next step is to visualize the fluctuation changes corresponding to the feature motion pattern. To achieve this, we use the readdata command to read the previously generated feature vector file.
readdata cpu-gpu-evecs.dat name Evecs
Read in the topology file and modify it to make it consistent with the topology of the calculated coordinate covariance matrix (and later eigenvectors).
parm cpu/cpu.prmtopparmstrip !(:1-36&!@H=)parmwrite out cpu-gpu-modes.prmtop
Create a pseudo-motion trajectory file in NetCDF format under the first dominant PC. The minimum and maximum values can be selected by the histogram of the PC mapping result.
runanalysis modes name Evecs trajout cpu-gpu-mode1.nc \ pcmin -100 pcmax 100 tmode 1 trajoutmask :1-36&!@H= trajoutfmt netcdf
Now you can open the file in Chimera/VMD and watch the video of the first motion mode that dominates the PC, cpu-gpu-modes.prmtop and cpu-gpu-modes.nc.