As the price of transcriptome sequencing drops, the number of samples to be sequenced is gradually increasing. WGCNA (weighted gene co-expression network analysis), an analysis method suitable for large samples, is used in diseases and other traits and genes. Correlation analysis and other aspects are more and more widely used. The biggest advantage of WGCNA is that it can divide thousands of genes in multiple samples into several to dozens of modules according to the expression pattern, and then analyze them in units of modules, thus reducing our computational complexity and improving accuracy. Each module gene has a specific expression pattern. We can perform correlation and cluster analysis of the obtained modules or between samples and modules to understand the specific conditions of the modules. If there is phenotypic information such as traits, you can also analyze the relationship between the module and the trait to find the most relevant module to the trait.
From the methodological point of view, WGCNA is divided into two parts: expression cluster analysis and phenotypic correlation. It mainly includes four steps: calculation of correlation coefficient between genes, determination of gene modules, co-expression network, and correlation between modules and traits.
The first step is to calculate the correlation coefficient (Person Coefficient) between any two genes. In order to measure whether two genes have similar expression patterns, it is generally necessary to set a threshold for screening, and those higher than the threshold are considered to be similar. However, if the threshold is set to 0.8, it is difficult to explain that 0.8 and 0.79 are significantly different. Therefore, the WGCNA analysis uses the weighted value of the correlation coefficient, that is, the gene correlation coefficient is taken to the power of N, so that the connections between the genes in the network obey the scale-free network distribution (scale-freenetworks). This algorithm has more biological significance.
The second step is to construct a hierarchical clustering tree through the correlation coefficients between genes. Different branches of the clustering tree represent different gene modules, and different colors represent different modules. Based on the weighted correlation coefficients of genes, genes are classified according to their expression patterns, and genes with similar patterns are grouped into one module. In this way, tens of thousands of genes can be divided into dozens of modules through gene expression patterns, which is a process of extracting summary information.