Overview of the Gene Expression Analysis Script
This script performs a comprehensive analysis of multiple GEO datasets, including data downloading, normalization, batch effect correction, and visualization of differential gene expression. It consists of 18 custom functions that handle various stages of the analysis.
Key Functions
1. DownloadGEO:
Downloads the data for a specified GEO accession number and stores it locally, making it ready for further analysis.
2. ReadGEO:
Reads and processes the downloaded GEO dataset, converting it into a matrix format for downstream analysis.
3. Makephenotype:
Generates phenotype data files from GEO datasets based on the project name and comparison factors (e.g., 'grade').
4. MakeBoxPlot:
Creates boxplots to visualize gene expression data distribution both before and after normalization, highlighting the variance within the dataset.
5. MakePCA:
Performs Principal Component Analysis (PCA) and generates visualizations to explore sample grouping based on experimental factors such as 'grade.'
6. VSNQuantilNorm:
Applies Variance Stabilizing Normalization (VSN) and quantile normalization to the gene expression data, providing normalized datasets for further analysis.
7. Mergestudies:
Merges gene expression matrices from multiple studies without batch effect correction, allowing a comparison across multiple datasets.
8. StudyBatchEffect:
Corrects for batch effects in merged datasets using statistical methods, ensuring that technical variation is minimized and biological variation is emphasized.
9. MakeDensityPlot:
Generates density plots to assess the distribution of gene expression levels in datasets before and after batch effect correction.
10. MakeMDSPlot:
Creates Multidimensional Scaling (MDS) plots to visualize similarities and differences between samples based on gene expression profiles.
11. CochranQTest:
Performs Cochran's Q test to evaluate the heterogeneity among studies, a critical step in meta-analysis when comparing multiple datasets.
12. FEMREMAnalysis:
Conducts meta-analysis using either the Fixed Effects Model (FEM) or Random Effects Model (REM) to identify significantly differentially expressed genes across multiple studies.
13. Makevolcano:
Generates a volcano plot for differential gene expression analysis. The plot visually represents the effect size versus statistical significance and highlights upregulated and downregulated genes.
14. MakeVenna:
Creates a Venn diagram to visualize the overlap of upregulated and downregulated genes across multiple datasets. It also identifies common genes and generates files of shared DEGs.
15. Makenetworkanalyst:
Prepares the gene expression data matrix for use in the NetworkAnalyst platform, a tool used for network-based analysis.
16. MakeHeatmap:
Generates heatmaps for visualizing the expression patterns of top differentially expressed genes, grouping samples based on experimental conditions such as 'grade.'
17. GSEAAnalysis:
Performs Gene Set Enrichment Analysis (GSEA) to identify significantly enriched pathways, using gene set data from MSigDB and presenting the results in an NES-based visualization.
18. MakeQQPlot:
Creates a QQ plot using the results from Cochran’s Q test to assess the distribution of p-values and identify potential discrepancies in the meta-analysis.
Usage Flow
- Download and process GEO datasets with
DownloadGEO
andReadGEO
. - Preprocess and normalize data using
VSNQuantilNorm
. - Perform exploratory analyses such as PCA and boxplots using
MakePCA
andMakeBoxPlot
. - Correct for batch effects and visualize data distribution before and after correction using
StudyBatchEffect
,MakeDensityPlot
, andMakeMDSPlot
. - Conduct meta-analysis using
FEMREMAnalysis
, generate volcano plots (Makevolcano
), and visualize common DEGs with Venn diagrams (MakeVenna
). - Perform network analysis with
Makenetworkanalyst
and Gene Set Enrichment Analysis withGSEAAnalysis
. - Generate heatmaps to visualize top DEGs (
MakeHeatmap
) and evaluate heterogeneity withCochranQTest
andQQ plots
(MakeQQPlot
).
the Phase 3 R packages:
- GEOquery: For downloading and processing datasets from the GEO database.
- limma: For linear modeling and differential expression analysis.
- sva: For batch effect correction using surrogate variable analysis (SVA) or ComBat.
- vsn: For variance stabilization and normalization of microarray data.
- GeneMeta: For performing meta-analysis across multiple gene expression datasets.
- ComplexHeatmap: For generating advanced and customizable heatmaps.
- clusterProfiler: For functional enrichment analysis (GO, KEGG, etc.).
- fgsea: For fast gene set enrichment analysis (GSEA).
- msigdbr: For accessing MSigDB gene sets used in enrichment analysis.
- ggpubr: For enhancing ggplot2 visualizations with publication-ready themes.
- reshape2: For reshaping data between wide and long formats.
- caret: For creating and evaluating machine learning models and pipelines.
- rgl: For 3D visualization and rendering in R.
- ggvenn: For generating Venn diagrams using ggplot2 style.
- plotly: For converting static plots into interactive visualizations.
- RColorBrewer: For color palette selection in visualizations.
- tidyverse: A collection of R packages (including dplyr, ggplot2, readr, etc.) for data manipulation and visualization.