Differential Gene Expression Analysis in Atopic Dermatitis Using DESeq2

Introduction to Differential Gene Expression in Atopic Dermatitis (AD) Patients

In this analysis, we perform Differential Gene Expression (DEG) analysis using the DESeq2 package on the GSE157194 dataset, which consists of RNA-seq data from patients with Atopic Dermatitis (AD). This dataset includes a total of 166 samples obtained from 57 patients with moderate to severe AD, prior to and after systemic treatment.

Dataset Overview:

GSE157194: RNA-seq data from AD patients.
111 samples before treatment: Includes lesional (AL) and non-lesional (AN) skin biopsies. Non-lesional samples were taken at least 5 cm from the active lesion, ensuring that the comparison between these conditions captures the underlying gene expression differences in the disease state before any intervention.
40 samples post-treatment with Dupilumab (3 months).
15 samples post-treatment with Cyclosporine (3 months).

Objective of This Analysis:

Our primary focus is to conduct a differential expression analysis between the lesional (AL) and non-lesional (AN) skin samples before treatment. This helps in identifying key differentially expressed genes (DEGs) and associated pathways that distinguish diseased skin from non-diseased skin. Insights gained from this comparison can:
Reveal potential biomarkers for AD diagnosis and monitoring.
Highlight novel therapeutic targets or pathways relevant to AD pathogenesis.
Provide groundwork for drug discovery and personalized medicine by identifying pathways that could respond to specific therapeutic interventions.

Comprehensive Analysis Approach:

This analysis incorporates various bioinformatics techniques and visualization tools (e.g., PCA, volcano plots, etc.) to ensure a holistic understanding of the gene expression profiles. While some aspects of the analysis may provide standalone insights, the goal is to create a comprehensive reference for DEG analysis using DESeq2. This can serve as a functional analysis resource for future studies.

In addition to the DEG analysis, we also perform pathway and gene enrichment analysis, including:

KEGG and GO term analyses.
Gene Set Enrichment Analysis (GSEA) and Overrepresentation Analysis (ORA) to explore the biological processes, molecular functions, and cellular components involved in AD.

1. Load Necessary Libraries

Before starting the analysis, we need to load the essential R libraries required for data manipulation, visualization, and performing differential gene expression (DEG) analysis.


	# Load necessary libraries for RNA-seq analysis and visualization
	library(GEOquery)        # Load the GEOquery package to access GEO datasets
	library(DESeq2)          # For differential expression analysis
	library(data.table)      # Efficient data manipulation
	library(umap)            # Dimensionality reduction for visualization
	library(ggplot2)         # General-purpose plotting
	library(clusterProfiler) # For functional enrichment analysis
	library(org.Hs.eg.db)    # Human genome annotations
	library(VennDiagram)     # Venn diagram visualization
	library(knitr)           # Report generation and table formatting (for kable)
	library(tinytex)         # LaTeX support for report generation

2. Load and Prepare Data

2.1. Retrieving Data from GEO

In this step, we download the raw gene expression data (count matrix) and the sample information (referred to as metadata or phenotype data) from the Gene Expression Omnibus (GEO) using GEO's accession ID GSE157194. The count matrix contains raw gene expression counts, while the phenotype data contains metadata for each sample (such as treatment groups and sample type).

In this code (Code Block 2.1), we begin by downloading and preparing the dataset for analysis. Step 1 involves defining the GEO accession ID (gse_id <- "GSE157194") and using getGEOSuppFiles() from the GEOquery library to download the supplementary count matrix, which is then loaded into R using read.table() after decompressing the .gz file. In Step 2, the phenotype data (metadata) is obtained using getGEO() from GEOquery, and stored in metadata. The metadata provides additional descriptive information about the samples in the count matrix. Step 3 prints summary information ( Figure 2.1), including the number of genes and samples in the count matrix and the number of samples and features in the metadata, after downloading the data from GEO, for a quick overview of the dataset.

Code Block 2.1. Retrieving Data from GEO


	# ===================================================================
	# Step 1. Download Supplementary Files for the Count Matrix
	# ===================================================================

	# Define the GEO accession ID
	gse_id <- "GSE157194"
	
	# Download the supplementary files for this GEO dataset
	getGEOSuppFiles(gse_id)
	
	# Check the list of files downloaded
	list.files(gse_id)
	
	# ================================
	# Path to the downloaded file
	supp_file_path <- paste0(gse_id, "/", "GSE157194_Raw_gene_counts_matrix.txt.gz")
	
	# Unzip and load the file into R as a data frame (counts matrix)
	counts_matrix <- read.table(gzfile(supp_file_path), header = TRUE, row.names = 1)
 
	# ===================================================================
	# Step 2. Download Phenotype Data
	# ===================================================================
	# Specify the GEO accession number
	gse_id <- "GSE157194"
	
	
	# Download the phenotype data (GEO series matrix)
	gse_data <- getGEO(gse_id, GSEMatrix = TRUE, AnnotGPL = TRUE)
	
	# If multiple series are available, choose the first one
	gse_data <- gse_data[[1]]
	
	# Extract metadata from the GEO dataset
	# We access the metadata (also known as phenotype data) stored in the 'phenoData' slot of the GEO object.
	metadata <- gse_data@phenoData@data
	
	# ===================================================================
	# Step 3. Print Summary Information about the Count Matrix and Metadata
	# ===================================================================
	cat("\n\n# Summary Information of the Count Matrix and Metadata (Post-GEO Download)\n")
	cat("=========================================================\n")
	
	# Count matrix summary
	cat("\nCount Matrix Summary:\n")
	cat("---------------------------------------------------------\n")
	cat("Number of Genes (Rows): ", nrow(counts_matrix), "\n")
	cat("Number of Samples (Columns): ", ncol(counts_matrix), "\n\n")
	
	
	# Metadata summary
	cat("\nMetadata Summary:\n")
	cat("---------------------------------------------------------\n")
	cat("Number of Samples (Rows): ", nrow(metadata), "\n")
	cat("Number of Features (Columns): ", ncol(metadata), "\n\n")
	
	cat("\n=========================================================\n")

Heatmap 1 — Figure 2.1. Summary Information of the Count Matrix and Metadata (Post-GEO Download). This figure displays the number of genes and samples in the count matrix, as well as the number of samples and features in the metadata after downloading from the GEO dataset.

2.2. Preparing Data

Effective data preparation is crucial for ensuring high-quality and accurate results in downstream analyses. This step includes identifying and removing potential outlier samples to avoid biases and pre-filtering low-count genes to enhance the overall data quality.

2.2.1. Identify and Remove Outlier Samples

Outlier samples, particularly those with significantly low or high library sizes, can introduce biases into the analysis and skew the results. Therefore, it is crucial to identify and remove such samples before conducting further data manipulations to ensure that the remaining data is representative of high-quality samples, allowing for reliable downstream analyses.
Traditionally, removing samples below the lower threshold (calculated as Q1 - 1.5 * IQR) is a common practice, as these samples often indicate technical issues, such as failed sequencing or low quality. Samples above the upper threshold (Q3 + 1.5 * IQR) are typically retained, as they are often biologically meaningful (e.g., having high sequencing depth), and removing them could risk discarding valuable information. However, if it can be demonstrated that such samples introduce bias or are clearly problematic, it might be necessary to remove them to maintain data integrity.
Based on the dot plot of library sizes, we identified a total of six outlier samples from 166 original samples: three with significantly low counts (Patient_12_AN_m3, Patient_24_AN_m3, Patient_31_AL_m0) and three with significantly high counts (Patient_41_AL_m3, Patient_42_AL_m0, Patient_48_AL_m3). Although the high-count samples might be biologically distinct, their presence could mask potential biomarkers within the majority of the population by over-influencing the analysis . Therefore, to ensure that we focus on identifying biomarkers that are representative of the majority, we have decided to remove both lower and upper threshold outliers.
The provided code is written to handle either scenario, allowing users to remove only the lower threshold outliers, or both lower and upper threshold outliers. Comments within the code explain which lines to uncomment or comment out based on the user's chosen approach. In this case, both thresholds are applied to remove the six outlier samples, resulting in a filtered dataset that better represents the core population.
Not all the plots presented in the code are necessary to generate; they are provided here to demonstrate different ways of visualizing the effect of removing outlier samples.
In this code (Code Block 2.2.1), we begin with Step 1, which involves calculating the library sizes for each sample using colSums(counts_matrix). This operation returns the total count of all reads for each sample. The summary statistics (summary(library_sizes)) is then printed to provide an overview of the data distribution (Min= 9,628,756, 1st Qu=20,647,659, Median=22,896,256, Mean=23,218,539, 3rd Qu= 25,719,325, Max= 43,076,805). Next, in Step 1.1, we set two thresholds for outlier identification, using quantiles (Q) and the interquartile range (IQR). The first quartile (quantile(library_sizes, 0.25)) and the third quartile (quantile(library_sizes, 0.75)) represent the 25th and 75th percentiles of the data, respectively. The IQR (IQR(library_sizes)) indicates the spread of the central 50% of the data and helps to assess data variability. We then calculate the lower threshold as Q1 - 1.5 * IQR, which targets low outliers that may indicate poor-quality or problematic samples. Optionally, the upper threshold is calculated as Q3 + 1.5 * IQR to identify samples with unusually high counts, which may also be removed if deemed necessary. In Step 1.2, depending on the user's choice, either or both thresholds can be used to define outlier samples. These outliers are identified using logical conditions, and their indices are used to remove them from both "counts_matrix" and "metadata", resulting in the creation of filtered versions ("counts_matrix_filtered" and "metadata_filtered") (Step 3). After identifying the outliers, in Step 2, we store the original dimensions of both "counts_matrix" and "metadata" before filtering. This allows us to compare them against the dimensions after outlier removal, which we store in Step 3.3. In Step 3, we remove the outlier samples from the "counts_matrix" (Step 3.1) and "metadata" (Step 3.2), ensuring that the metadata matches the "filtered counts_matrix". After filtering, we reassign the filtered versions back to "counts_matrix" and "metadata" to ensure consistency for subsequent analysis steps (Step 6). In Step 4, following the removal of outliers, we create various visualizations (dot plot (Figure 2.2.1.1), bar plot (Figure 2.2.1.2), boxplot (Figure 2.2.1.3), density plot (Figure 2.2.1.4), and violin plot (Figure 2.2.1.5)) to illustrate the changes in library sizes before and after filtering. These visualizations provide insights into the effects of outlier removal on the dataset. Specifically:

Boxplots and Violin plots are used to compare the distributions of library sizes before and after filtering.
Dot plots annotate specific outlier samples that were removed.
Bar plots allow for a direct comparison of individual sample library sizes before and after filtering.
Density plots provide an overview of the entire distribution of library sizes.

In Step 5, we generate visualizations to confirm the changes in the dimensions of the "counts_matrix" and "metadata" before and after outlier removal using dimensions comparison plot (Figure 2.2.1.6) as well as Summary Information (Figure 2.2.1.7). To achieve this, we create data frames representing the count matrix dimensions and metadata dimensions, both before and after filtering (Step 5.1 and Step 5.2). These data frames are then used to generate four bar plots, which illustrate the number of genes, samples, and metadata features before and after filtering. These four plots are combined into a single figure using the grid.arrange() function from the gridExtra library (Step 5.3). This combined visualization provides a clear overview of how filtering affected the dataset dimensions.

Code Block 2.2.1: Identifying and Removing Outlier Samples, and Visualizing the Impact


	# ========================================================
	# Load necessary libraries
	# ========================================================
	library(ggplot2)
	library(gridExtra) # For arranging multiple plots in one figure
	
	# ========================================================
	# Step 1: Calculate Library Size and Identify Outliers
	# ========================================================
	
	# Calculate library sizes
	library_sizes <- colSums(counts_matrix)
	
	# Optionally, print summary to check for anomalies
	summary(library_sizes)
	
	# Identify potential outlier samples based on library size
	# Step 1.1: Set thresholds for outlier identification
	# Set the lower threshold to identify samples with low counts
	lower_threshold <- quantile(library_sizes, 0.25) - 1.5 * IQR(library_sizes)
	
	# Set the upper threshold to identify samples with very high counts
	upper_threshold <- quantile(library_sizes, 0.75) + 1.5 * IQR(library_sizes)
	
	
	# Step 1.2: Find outlier samples
	# Uncomment the corresponding line depending on whether you want to use:
	# - Only the lower threshold
	# - Only the upper threshold
	# - Both lower and upper thresholds
	
	# Use only the lower threshold:
	# outlier_samples <- colnames(counts_matrix)[library_sizes < lower_threshold]
	
	# Use both lower and upper thresholds:
	outlier_samples <- colnames(counts_matrix)[library_sizes < lower_threshold | library_sizes > upper_threshold]
	
	# ========================================================
	# Step 2: Store Original Dimensions Before Filtering (First  part for Plotting Dimensions of Count Matrix and Metadata)
	# ========================================================
	
	# Store original dimensions before filtering
	counts_matrix_before <- dim(counts_matrix)  # [number of genes, number of samples]
	metadata_before <- dim(metadata)            # [number of samples, number of features]
	
	
	# ========================================================
	# Step 3: Remove Outlier Samples from Data
	# ========================================================
	
	# Step 3.1: Remove outlier samples from counts matrix
	counts_matrix_filtered <- counts_matrix[, !(colnames(counts_matrix) %in% outlier_samples)]
	
	# Step 3.2: Remove outlier samples from metadata
	metadata_filtered <- metadata[colnames(counts_matrix_filtered), ]
	
	# Step 3.3: Store dimensions after filtering (Second part for Plotting Dimensions of Count Matrix and Metadata)
	counts_matrix_after <- dim(counts_matrix_filtered)
	metadata_after <- dim(metadata_filtered)
	
	# Step 3.4: Calculate filtered library sizes
	filtered_library_sizes <- colSums(counts_matrix_filtered)
	
	# ========================================================
	# Step 4: Plot the Filtered Library Sizes (Box, Violin, Dot, Bar, Density plots)
	# ========================================================
	# Step 4.1: Prepare Data for Plotting:
	
	# Combine original and filtered library sizes into a data frame for comparison
	comparison_df <- data.frame(
	  Sample = c(names(library_sizes), names(filtered_library_sizes)),
	  Library_Size = c(library_sizes, filtered_library_sizes),
	  Group = c(rep("Original", length(library_sizes)), rep("Filtered", length(filtered_library_sizes)))
	)
	
	
	# Step 4.2: Boxplot of Library Sizes Before and After Filtering
	box_plot <- ggplot(comparison_df, aes(x = Group, y = Library_Size, fill = Group)) +
	  geom_boxplot() +
	  theme_minimal() +
	  labs(title = "Boxplot of Library Sizes Before and After Removing Outliers",
	       x = "Group", y = "Total Counts per Sample") +
	  theme(legend.position = "none")
	
	ggsave("Figures/2.2.1.Outlier_Samples/Figure2.2.1.3.Box_Plot_Library_Sizes.png", plot = box_plot, width = 8, height = 5, bg = "white")
	
	
	# Step 4.3: Violin Plot of Library Sizes Before and After Filtering
	violin_plot <- ggplot(comparison_df, aes(x = Group, y = Library_Size, fill = Group)) +
	  geom_violin(trim = FALSE, alpha = 0.6) +
	  theme_minimal() +
	  labs(title = "Violin Plot of Library Sizes Before and After Removing Outliers",
	       x = "Group", y = "Total Counts per Sample") +
	  theme(legend.position = "none")
	
	ggsave("Figures/2.2.1.Outlier_Samples/Figure2.2.1.5.Violin_Plot_Library_Sizes.png", plot = violin_plot, width = 8, height = 5, bg = "white")
	
	
	
	# Step 4.4: Dot Plot with Annotation of Outliers
	dot_plot <- ggplot(comparison_df, aes(x = Sample, y = Library_Size, color = Group)) +
	  geom_point() +
	  theme_minimal() +
	  labs(title = "Dot Plot of Library Sizes with Outlier Annotation", 
	       x = "Sample", y = "Total Counts per Sample") +
	  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) +
	  geom_text(data = subset(comparison_df, Group == "Original" & Sample %in% outlier_samples), 
		    aes(label = Sample), vjust = -1, size = 3, color = "red", angle = 5)
	
	ggsave("Figures/2.2.1.Outlier_Samples/Figure2.2.1.1.Dot_Plot_Library_Sizes.png", plot = dot_plot, width = 10, height = 5, bg = "white")
	
	
	
	# Step 4.5: Comparative Bar Plot Before and After Filtering
	bar_plot <- ggplot(comparison_df, aes(x = Sample, y = Library_Size, fill = Group)) +
	  geom_bar(stat = "identity", position = "dodge") +
	  theme_minimal() +
	  labs(title = "Comparative Bar Plot of Library Sizes Before and After Removing Outliers", 
	       x = "Sample", y = "Total Counts per Sample") +
	  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
	
	ggsave("Figures/2.2.1.Outlier_Samples/Figure2.2.1.2.Bar_Plot_Library_Sizes.png", plot = bar_plot, width = 10, height = 5, bg = "white")
	
	
	
	# Step 4.6: Side-by-Side Density Plot Before and After Filtering
	density_plot <- ggplot(comparison_df, aes(x = Library_Size, fill = Group)) +
	  geom_density(alpha = 0.5) +
	  theme_minimal() +
	  labs(title = "Density Plot of Library Sizes Before and After Removing Outliers", 
	       x = "Total Counts per Sample", y = "Density") +
	  scale_x_log10() +
	  theme(legend.position = "top")
	
	ggsave("Figures/2.2.1.Outlier_Samples/Figure2.2.1.4.Density_Plot_Library_Sizes.png", plot = density_plot, width = 8, height = 5, bg = "white")
	
	
	# ========================================================
	# Step 5: Plot Dimensions of Count Matrix and Metadata (Third  part for Plotting Dimensions of Count Matrix and Metadata)
	# ========================================================
	
	# Step 5.1: Create Data Frames for Count Matrix Dimensions
	counts_matrix_dims <- data.frame(
	  Measure = rep(c("Samples", "Genes"), each = 2),
	  Stage = rep(c("Before Filtering", "After Filtering"), 2),
	  Count = c(counts_matrix_before[2], counts_matrix_after[2],
		    counts_matrix_before[1], counts_matrix_after[1])
	)
	
	# Step 5.2: Create Data Frames for Count Matrix and Metadata Dimensions
	metadata_dims <- data.frame(
	  Measure = rep(c("Samples", "Features"), each = 2),
	  Stage = rep(c("Before Filtering", "After Filtering"), 2),
	  Count = c(metadata_before[1], metadata_after[1],
		    metadata_before[2], metadata_after[2])
	)
	
	
	# Step 5.3: Generate and Combine Plots for Dimensions
	
	# Plot for count matrix samples before and after filtering
	count_samples_plot <- ggplot(subset(counts_matrix_dims, Measure == "Samples"), aes(x = Stage, y = Count, fill = Stage)) +
	  geom_bar(stat = "identity") +
	  geom_text(aes(label = Count), vjust = -0.5) +  # Add count labels on top of bars
	  theme_minimal() +
	  labs(title = "Count Matrix: Number of Samples (Columns)", y = "Samples", x = "") +
	  theme(legend.position = "none") +
	  scale_fill_manual(values = c("Before Filtering" = "#4682B4", "After Filtering" = "#8A2BE2")) # Change colors
	
	# Plot for count matrix genes (transcripts) before and after filtering
	count_genes_plot <- ggplot(subset(counts_matrix_dims, Measure == "Genes"), aes(x = Stage, y = Count, fill = Stage)) +
	  geom_bar(stat = "identity") +
	  geom_text(aes(label = Count), vjust = -0.5) +  # Add count labels on top of bars
	  theme_minimal() +
	  labs(title = "Count Matrix: Number of Genes (Rows)", y = "Genes (Transcripts)", x = "") +
	  theme(legend.position = "none") +
	  scale_fill_manual(values = c("Before Filtering" = "#4682B4", "After Filtering" = "#8A2BE2")) # Change colors
	
	# Plot for metadata samples before and after filtering
	metadata_samples_plot <- ggplot(subset(metadata_dims, Measure == "Samples"), aes(x = Stage, y = Count, fill = Stage)) +
	  geom_bar(stat = "identity") +
	  geom_text(aes(label = Count), vjust = -0.5) +  # Add count labels on top of bars
	  theme_minimal() +
	  labs(title = "Metadata: Number of Samples (Rows)", y = "Count", x = "") +
	  theme(legend.position = "none") +
	  scale_fill_manual(values = c("Before Filtering" = "#32CD32", "After Filtering" = "#FFD700")) # Change colors
	
	# Plot for metadata features before and after filtering
	metadata_features_plot <- ggplot(subset(metadata_dims, Measure == "Features"), aes(x = Stage, y = Count, fill = Stage)) +
	  geom_bar(stat = "identity") +
	  geom_text(aes(label = Count), vjust = -0.5) +  # Add count labels on top of bars
	  theme_minimal() +
	  labs(title = "Metadata: Number of Features (Columns)", y = "Sample Attributes", x = "") +
	  theme(legend.position = "none") + 
	  scale_fill_manual(values = c("Before Filtering" =  "#32CD32", "After Filtering" = "#FFD700")) # Change colors
	
	# Arrange the four plots in a grid
	combined_plot <- grid.arrange(count_samples_plot, count_genes_plot,
				      metadata_samples_plot, metadata_features_plot,
				      ncol = 2)
	
	# Save the combined plot
	ggsave("Figures/2.2.1.Outlier_Samples/Figure2.2.1.6.Dimensions_Comparison.png", plot = combined_plot, width = 12, height = 10, bg = "white")
	
	
	
	# ========================================================
	# Step 6: Reassign Filtered Data
	# ========================================================
	# Reassign filtered versions back to counts_matrix and metadata
	counts_matrix <- counts_matrix_filtered
	metadata <- metadata_filtered
	
	# ======================================
	# Step 7. Print Summary Information about the Count Matrix and Metadata
	# ======================================
	cat("\n\n# Summary Information of the Count Matrix and Metadata (Post-Outlier Removal)\n")
	cat("=========================================================\n")
	
	# Count matrix summary
	cat("\nCount Matrix Summary:\n")
	cat("---------------------------------------------------------\n")
	cat("Number of Genes (Rows): ", nrow(counts_matrix), "\n")
	cat("Number of Samples (Columns): ", ncol(counts_matrix), "\n\n")
	
	
	# Metadata summary
	cat("\nMetadata Summary:\n")
	cat("---------------------------------------------------------\n")
	cat("Number of Samples (Rows): ", nrow(metadata), "\n")
	cat("Number of Features (Columns): ", ncol(metadata), "\n\n")
		
			cat("\n=========================================================\n")

**Figure 2.2.1.1.** Dot Plot of Library Sizes with Outlier Annotation. This dot plot visualizes the library sizes for all samples before and after outlier removal. The outlier samples are clearly labeled to show their positions, with the filtered samples marked in red and labeled. It illustrates the rationale behind removing outliers by highlighting samples that significantly deviate from the main distribution.

**Figure 2.2.1.2.** Comparative Bar Plot of Library Sizes Before and After Removing Outliers. This bar plot compares the total counts per sample before and after removing the identified outliers. It allows a direct comparison of individual library sizes, illustrating how outlier removal affects the data distribution.

**Figure 2.2.1.3.** Boxplot of Library Sizes Before and After Removing Outliers. This boxplot provides a summary of the library size distributions for the original and filtered datasets. By comparing the interquartile ranges and identifying extreme values, it highlights how removing outliers impacts the spread of the data.

**Figure 2.2.1.4.** Density Plot of Library Sizes Before and After Removing Outliers. The density plot shows the distribution of library sizes across all samples before and after outlier removal. The plot visualizes changes in the overall density, allowing us to observe shifts in data distribution and the effects of filtering extreme values.

**Figure 2.2.1.5.** Violin Plot of Library Sizes Before and After Removing Outliers. This violin plot combines elements of a boxplot and a density plot to illustrate both the distribution and summary statistics of the library sizes before and after removing outliers. It helps visualize the spread and density of values, showing the differences between the original and filtered groups.

Dimensions Comparison of Count Matrix and Metadata — **Figure 2.2.1.6.** Dimensions Comparison: Four bar plots showing changes in the count matrix and metadata dimensions before and after outlier removal. Only the number of samples changed, confirming that outliers were removed, while genes and features remained unchanged.

**Figure 2.2.1.7.** Summary Information of the Count Matrix and Metadata (Post-Outlier Removal), showing the updated number of genes, samples, and features after removing outlier samples.

2.2.2. Pre-filtering Low-Count Genes to Improve Data Quality

Low-count genes, which are expressed at insufficient levels, can contribute noise and increase the likelihood of false-positive results. Removing these lowly-expressed genes ensures that the analysis focuses on biologically meaningful signals and helps to reduce computational complexity and improve the robustness of downstream analyses.
In this code (Code Block 2.2.2), we first (Step 1) create a logical matrix to identify genes with sufficient expression (counts ≥ 10). Specifically, we retain only those genes that have counts of at least 10 in at least half of the samples (ncol(counts_matrix) / 2). This step ensures that the filtered dataset retains genes with meaningful levels of expression across a significant number of samples. After identifying the genes to keep, we filter the counts_matrix to retain only those genes, reassign the filtered matrix back to counts_matrix (Step 2), and then check the dimensions of the filtered dataset (Step 3) to confirm the number of genes and samples that remain (Figure 2.2.2).

Code Block 2.2.2: Pre-filtering Low-Count Genes to Improve Data Quality


			# ========================================================
			# Step 1: Pre-filter low-count genes
			# ========================================================
			
			# Step 1: Create a logical matrix to check if each value in counts_matrix is >= 10 (sufficient expression).
			# Step 2: Retain genes that have counts >= 10 in at least half of the samples (ncol(counts_matrix) / 2). 
			keep_genes <- rowSums(counts_matrix >= 10) >= (ncol(counts_matrix) / 2)
			
			# Filter the counts matrix to retain only the genes that meet the filtering criteria
			filtered_counts_matrix <- counts_matrix[keep_genes, ]
			
			# ========================================================
			# Step 2: Reassign filtered matrix to counts_matrix
			# ========================================================
			
			# Assign the filtered counts matrix back to counts_matrix for consistency in later steps
			counts_matrix <- filtered_counts_matrix
			
			# ======================================
			# Step 3. Check dimensions of the filtered matrix (Summary Information after Low-Count Filtering)
			# ======================================
			cat("\n\n# Summary Information of the Count Matrix and Metadata (Post-Low-Count Filtering)\n")
			cat("=========================================================\n")
			
			# Count matrix summary
			cat("\nCount Matrix Summary:\n")
			cat("---------------------------------------------------------\n")
			cat("Number of Genes (Rows): ", nrow(counts_matrix), "\n")
			cat("Number of Samples (Columns): ", ncol(counts_matrix), "\n\n")
			
			# Metadata summary
			cat("\nMetadata Summary:\n")
			cat("---------------------------------------------------------\n")
			cat("Number of Samples (Rows): ", nrow(metadata), "\n")
			cat("Number of Features (Columns): ", ncol(metadata), "\n\n")
			
			cat("\n=========================================================\n")

Summary Information After Removing Low-Count Genes — **Figure 2.2.2.**Summary Information of the Count Matrix and Metadata (Post-Low-Count Filtering), showing the number of genes, samples, and features after removing low-count genes.

3. Exploratory Data Analysis (EDA)- Before the "DESeqDataSet" object (dds) Creation

EDA is essential before performing DEA (Differential Expression Analysis) to explore the relationship between samples and to gain insights into the structure and distribution of the data. This step involves exploring both the metadata and the count matrix to ensure that the dataset is well-prepared for downstream analysis.

3.1. Metadata Exploration

3.1.1. Metadata Overview

We examine the metadata to identify key columns for analysis, such as sample ID, experimental conditions, tissue type, time points, and treatments. Selected columns are then viewed using View() function for a quick overview of the samples (Code Block 3.1.1.; Figure 3.1.1).

Code Block 3.1.1. Metadata Exploration


			
		# View all column names in the metadata to select relevant columns for analysis
		# This step helps us to explore the structure of the metadata and understand which columns are available.		
		colnames(metadata)
		
		# Select relevant columns for analysis based on the experiment
		selected_metadata <- metadata[, c("geo_accession", "title", "source_name_ch1", "characteristics_ch1.2", "characteristics_ch1.3", "timepoint:ch1", "therapy:ch1", "patient id:ch1", "platform_id", "data_processing")]
		
		# Open the selected metadata in a viewer for further exploration
		View(selected_metadata)

Cumulative Variance Plot — Figure 3.1.1. Metadata Exploration as Part of Exploratory Data Analysis (EDA) (After filtering outlier samples).

3.1.2. Check Distribution of Samples Across Key Metadata Variables (using table() and kable())

3.1.2.1. Summary of Sample Distribution by Patient Group and Timepoints

By exploring key metadata variables such as sample type (e.g., lesional vs non-lesional or treated vs untreated), timepoints (e.g., baseline vs post-treatment), and therapy groups (e.g., Cyclosporine and Dupilumab), we gain insights into the distribution of the samples across different conditions. This exploration also helps to ensure that the dataset is well-balanced and appropriately structured for the upcoming differential expression analysis.
For example, after filtering out sample outliers, the source_name_ch1 column in metadata shows the distribution of samples across various experimental conditions (Code Block 3.1.2; Figure 3.1.2.1 a), such as baseline samples (m0) for both lesional (AL) (m0_AL: 55 samples) and non-lesional (AN) (m0_AN: 54 samples) skin, which are key to our current analysis, as well as samples taken post-treatment with Cyclosporine (8 lesional & 6 non-lesional) and Dupilumab (19 lesional & 18 non-lesional samples). This data can be useful for exploring treatment effects in future experiments.
In addition, characteristics_ch1.2 (Code Block 3.1.2.1; Figure 3.1.2.1 b) provides information on the timepoints (e.g., baseline or post-treatment). This is crucial for analyzing how gene expression changes over time, particularly in response to therapies like Cyclosporine and Dupilumab. In this dataset, 51 samples were collected 3 months post-treatment, and 109 samples were collected at baseline.

Code Block 3.1.2.1: Sample Distribution Table Generation


# Table of samples by patient group (e.g., lesional vs non-lesional)
group_table <- table(metadata$source_name_ch1)
kable(group_table, format = "html", col.names = c("Patient Group", "Count"))

# Table of samples by timepoints (e.g., baseline vs post-treatment)
timepoint_table <- table(metadata$characteristics_ch1.2)
kable(timepoint_table, format = "html", col.names = c("Timepoint", "Count"))

Heatmap 2 — **Figure 3.1.2.2.** Sample Distribution by Timepoint and Patient ID (After Filtering Outlier Samples).

3.1.2.2. Summary of Sample Distribution Across Multiple Metadata Variables

Other metadata columns can also be explored (Code Block 3.1.2.2):
The characteristics_ch1.3 column (Figure 3.1.2.2 a) provides information about the treatment each patient received. It shows that 14 samples were treated with Cyclosporine and 37 samples with Dupilumab, while the remaining 109 samples represent baseline conditions (m0_AL and m0_AN combined). This is important for distinguishing between treatment groups when analyzing differential expression across conditions.
The patient id:ch1 column (Figure 3.1.2.2 b) helps track individual patients across different conditions. Some patients may have samples from multiple conditions (e.g., lesional, non-lesional, and post-treatment with both Cyclosporine and Dupilumab), allowing for longitudinal analyses within the same individuals. For researchers, this provides an opportunity to focus on patients who have undergone multiple treatments, reducing variability and enhancing the accuracy of comparative analyses.
In addition to these, other metadata columns can provide insights into the experimental setup and data processing. For example, platform_id shows all utilized platforms (GPL21290 in this experiment), extract_protocol_ch1 and extract_protocol_ch1.1 describe how the RNA was extracted and processed in the lab, and how the RNA libraries were prepared for sequencing. Furthermore, data_processing and its related fields provide detailed information on how the sequencing data was processed, from basecalling using Illumina software to trimming primers with cutadapt and mapping reads to the human reference genome with Tophat2 and Bowtie2. The use of Samtools to clean and sort the reads, and HTSeq to count reads per gene, ensures the data is accurately prepared for downstream analysis. The genome build used for mapping is GRCh38, and the raw count data is stored in a tab-delimited format.

Code Block 3.1.2.2. Assessing Sample Characteristics for Experimental Design



# ======================
# List of all relevant columns
# These columns are selected based on the characteristics of the experiment, 
# including metadata features like source names, patient IDs, and data processing details.
columns <- c("source_name_ch1",              # Source of the sample
             "characteristics_ch1.2",        # Experimental characteristic 1
             "characteristics_ch1.3",        # Experimental characteristic 2
             "therapy:ch1",                  # Therapy information
             "patient id:ch1",               # Patient ID
             "platform_id",                  # Platform used for data generation
             "extract_protocol_ch1",         # Sample extraction protocol 1
             "extract_protocol_ch1.1",       # Sample extraction protocol 2
             "data_processing",              # Data processing detail 1
             "data_processing.1",            # Data processing detail 2
             "data_processing.2",            # Data processing detail 3
             "data_processing.3",            # Data processing detail 4
             "data_processing.4",            # Data processing detail 5
             "data_processing.5",            # Data processing detail 6
             "data_processing.6"             # Data processing detail 7
)

# ==========
# Loop through each column and generate a summary table for each
# The loop checks if each column exists in the 'metadata' dataframe and prints a table of its unique values and frequencies.
for (col in columns) {
  if (col %in% colnames(metadata)) {  # Ensure the column exists in the metadata dataframe
    group_table <- table(metadata[[col]])  # Access and count the occurrences of each unique value in the column
    print(kable(group_table))              # Display the table using kable for a nicely formatted output
  } else {
    print(paste("Column", col, "does not exist in the metadata"))  # Handle cases where the column doesn't exist
  }
}

3.1.3. Check Distribution of Samples Across Key Metadata Variables (using bar charts)

Visualizing the distribution of samples with bar charts provides a more intuitive and visual approach compared to table-based summaries. It helps to quickly identify patterns, imbalances, or trends in the data, making it easier to understand the distribution across different experimental conditions or metadata variables.

3.1.3.1. Visualizing Sample Distribution for a Single Metadata Variable Using Bar Charts

This code (Code Block 3.1.3.1) focuses on visualizing a single variable, such as source_name_ch1, to display the sample distribution by experimental condition (Figure 3.1.3.1). However, the variable can be easily substituted with others if needed. (Step 1): The ggplot2 library is loaded for creating the plot. (Step 2): A bar plot is constructed using geom_bar() to create the bars and geom_text() to add count labels, where ..count.. refers to the computed count statistic. The plot is styled using theme_minimal() and color is applied with scale_fill_brewer(). (Step 3): Finally, the ggsave() function saves the plot to a specified location.

Code Block 3.1.3.1. Visualizing Sample Distribution by Experimental Condition


# ====================================
# Step 1: Load the necessary library
# ====================================
library(ggplot2)

# ============================================
# Step 2: Create a bar plot for sample counts
# ============================================
p <- ggplot(metadata, aes(x = `source_name_ch1`, fill = `source_name_ch1`)) +
  geom_bar() +  # Add bars representing sample counts
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5, size = 3) +  # Add count labels above bars
  theme_minimal() +  # Apply a minimal theme for better readability
  labs(title = "Sample Distribution by Experimental Condition", 
       x = "Source Name", 
       y = "Number of Samples") +
  theme(axis.text.x = element_text(angle = 60, vjust = 0.5, hjust = 0.6, size = 8)) +  # Adjust axis label size and angle
  scale_fill_brewer(palette = "Set3")  # Apply a color palette for the bars

# The variable "source_name_ch1" in this code can be manually replaced by other variables 
# such as "characteristics_ch1.2", "characteristics_ch1.3", or "patient id:ch1" to visualize their distributions.

# ==================================
# Step 3: Save the plot to a file
# ==================================
ggsave("Figures/3.1.3.1.Bar_Chart_Metadata_Variables/3.1.3.1.Sample_Distribution_by_Condition.png", 
       plot = p, width = 10, height = 6, dpi = 300, bg = "white")

3.1.3.2. Automated Bar Chart Generation for Multiple Metadata Variables

This approach extends the visualization to multiple metadata variables (such as "source_name_ch1" (Figure 3.1.3.1), "characteristics_ch1.2" (Figure 3.1.3.2 a), "characteristics_ch1.3" (Figure 3.1.3.2 b), "patient id:ch1" (Figure 3.1.3.2 c)), allowing for the exploration of different aspects of the dataset. It automates the creation of bar charts, making it easy to visualize sample distributions across various metadata fields. (Step 1): The ggplot2 library is loaded for plotting and viridis library for using a color palette that supports many unique colors. (Step 2): A list of columns to visualize is defined. (Step 3): The code loops through each column, converting column names into symbols using sym() and sanitizing the names with gsub() for valid file naming. (Step 4): The plots are created using ggplot(), styled with theme_minimal(), and colored using scale_fill_viridis_d(option = "C"). We used scale_fill_viridis_d(option = "C") instead of scale_fill_brewer() because scale_fill_brewer() has a limited number of distinct colors to handle all categories, whereas scale_fill_viridis_d() provides more colors and ensures better contrast, making it suitable for larger datasets. Each plot is displayed and saved to a file using ggsave().

Code Block 3.1.3.2. Visualizing Sample Distribution for Multiple Metadata Variables Using Automated Bar Charts


# =========================== Step 1: Install and Load Necessary Packages =========================== #
# Load the required packages for plotting
library(ggplot2)  # For creating the bar plots
library(viridis)  # For using a color palette that supports many unique colors

# =========================== Step 2: Define Columns to Visualize =========================== #
# List of columns to visualize (these correspond to the metadata columns you want to plot)
columns_to_plot <- c("source_name_ch1", "characteristics_ch1.2", "characteristics_ch1.3", "patient id:ch1")

# =========================== Step 3: Loop Through Each Column and Create Plots =========================== #
# Loop through each column and create a bar plot for sample distribution
for (col in columns_to_plot) {
  
  # --------------------- Convert Column Name for Tidy Evaluation --------------------- #
  # Convert the column name (a string) to a symbol that ggplot can interpret within aes()
  col_sym <- sym(col)
  
  # --------------------- Sanitize Column Name for Saving Files --------------------- #
  # Replace ":" with "_" to create a valid filename (since ":" is not allowed in filenames)
  sanitized_col <- gsub("[:]", "_", col)
  
  # =========================== Step 4: Create Bar Plot =========================== #
  # Create the bar plot for the current column
  p <- ggplot(metadata, aes(x = !!col_sym, fill = !!col_sym)) +  # Fill colors based on the column
    geom_bar() +  # Create the bar chart
    geom_text(stat = "count", aes(label = ..count..), vjust = -0.5, size = 3) +  # Add text labels above bars to show the count
    theme_minimal() +  # Use a minimal theme for a clean appearance
    labs(title = paste("Sample Distribution by", col), x = col, y = "Number of Samples") +  # Add title and axis labels
    theme(axis.text.x = element_text(angle = 60, vjust = 0.5, hjust = 0.6, size = 8)) +  # Rotate x-axis labels and adjust their size
    scale_fill_viridis_d(option = "C")  # Use the "viridis" color palette to ensure enough distinct colors for each category
  
  # --------------------- Display the Plot --------------------- #
  # Print the plot so it displays in the R console or notebook
  print(p)
  
  # =========================== Step 5: Save the Plot as PNG =========================== #
  # Save the plot to a file in the specified directory with a relevant name
  ggsave(filename = paste0("Figures/3.1.3.2.Bar_Chart_Metadata_Variables/Figure3.1.3.2_", sanitized_col, "_Sample_Distribution.png"),
         plot = p, width = 10, height = 6, dpi = 300, bg = "white")
}

Sample Distribution by source_name_ch1 — **Figure 3.1.3.1.** Sample Distribution by **source_name_ch1**.

Sample Distribution by characteristics_ch1.2 — **Figure 3.1.3.2 a.** Sample Distribution by **characteristics_ch1.2**.

Sample Distribution by characteristics_ch1.3 — **Figure 3.1.3.2 b.** Sample Distribution by **characteristics_ch1.3**.

Sample Distribution by patient id:ch1 — **Figure 3.1.3.2 c.** Sample Distribution by **patient id:ch1**.

3.2. Count Matrix: Exploration and Visualization

3.2.1. Count Matrix Overview (Inspecting the Count Matrix using View())

We examine the count matrix, which contains raw gene expression counts for each sample. This step helps us understand the structure of the data, including whether it is normalized or not, and allows us to check the column names (samples) and row names (Ensembl Gene Identifiers) to ensure the data is formatted correctly for the next steps of differential expression analysis (Code Block 3.2.1; Figure 3.2.1).

Code Block 3.2.1. Viewing the count matrix to inspect the gene expression data for each sample.


	# View the first few rows of the count matrix
	# This allows us to inspect the raw gene expression data for each sample.
	View(counts_matrix)

3.2.2. Boxplot of Raw Counts

In RNA-seq analysis, boxplots of raw counts provide a crucial view of the distribution of gene expression across samples. By plotting the log-transformed counts, we can detect outliers, assess sequencing depth variations, and verify whether the data is ready for downstream analysis. Boxplots help identify potential issues, such as batch effects or technical biases, that may affect the accuracy of differential expression analysis (DEA) and the identification of biomarkers.

3.2.2.1. Boxplot of Raw Counts (Uncolored)

This simple boxplot Figure 3.2.2.1 (after the removal of low-count genes and outlier samples) provides an overview of the log2-transformed gene expression counts across all samples without any group-based color distinction. It allows us to quickly assess the range of counts for each sample and identify potential outliers. This uncolored version gives a broad view of data distribution and helps to highlight sample-level variations after the removal of low-count genes and outlier samples, ensuring the data is cleaned and ready for downstream analyses.

In this code (Code Block 3.2.2.1), we create a boxplot for the log2-transformed counts of the count matrix. (Step 1): We load the ggplot2 library to generate the plot. (Step 2): The count matrix is log2-transformed and converted into a long format for visualization. (Step 3): A basic boxplot is created with minimal styling using geom_boxplot() to show the count distribution across samples. (Step 4): The plot is saved to a specific location using the ggsave() function.

Code Block 3.2.2.1. Boxplot of Raw Counts (Uncolored)


# ============================
# Step 1: Load necessary library
# ============================
library(ggplot2)

# ============================
# Step 2: Prepare Data for Plotting
# ============================
# Convert counts_matrix to a data frame and log-transform it
df <- as.data.frame(log2(counts_matrix + 1))  # Log2 transform the counts matrix

# Add a sample column (rownames will become a new column for plotting)
df$Sample <- rownames(df)

# Reshape the data for ggplot2 (long format)
df_long <- reshape2::melt(df, id.vars = "Sample")

# ============================
# Step 3: Create Boxplot
# ============================
p <- ggplot(df_long, aes(x = variable, y = value)) +
  geom_boxplot() +  # Create boxplot
  theme_minimal() +  # Apply a minimal theme for better readability
  labs(title = "Log2 Raw Counts Across Samples", x = "Samples", y = "Log2 Counts") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 4),
        plot.background = element_rect(fill = "white", colour = NA),  # Set white background
        panel.background = element_rect(fill = "white", colour = NA))  # Set white background for the panel

# ============================
# Step 4: Save the Plot 3.2.2.1.Uncolored_Boxplot_Raw_Counts
# ============================
ggsave("Figures/3.2.2.Boxplot_Raw_Counts/3.2.2.1.Uncolored_Boxplot_Raw_Counts.png"", 
       plot = p, width = 13, height = 4)

3.2.2.2. Boxplot of Raw Counts Colored by Experimental Condition

This enhanced version of the boxplot introduces color coding based on experimental conditions. By grouping the samples according to metadata (e.g., different treatments), we can more easily visualize how the distribution of counts changes across different conditions. This plot is particularly useful for identifying condition-specific biases or batch effects that could affect downstream analyses, such as DEA for biomarker discovery. Figure 3.2.2.2 a & Figure 3.2.2.2 b show the distribution of raw counts before and after removing low-count genes and outlier samples, respectively, allowing us to assess how preprocessing affects the consistency of counts across experimental conditions.

In this code (Code Block 3.2.2.2), we create a boxplot colored by experimental condition. (Step 1): The metadata and count matrix are aligned based on sample IDs. (Step 2): The log2-transformed count matrix is prepared for plotting and merged with metadata information. (Step 3): A boxplot is created with colors representing the experimental conditions using geom_boxplot() and a color palette. (Step 4): The total sample count is added to the legend, and (Step 5): The final plot is saved to the desired directory using the ggsave() function.

Code Block 3.2.2.2. Boxplot of Raw Counts Colored by Experimental Condition


# ============================
# Load necessary libraries
# ============================
library(ggplot2)
library(reshape2)

# ============================
# Step 1: Align Sample IDs and Metadata
# ============================

# Set the row names of metadata to the sample names (stored in metadata$title)
rownames(metadata) <- metadata$title

# Extract the sample IDs from counts_matrix (columns represent samples)
sample_ids <- colnames(counts_matrix)

# Filter metadata to contain only rows that match the sample IDs in the count matrix
metadata_filtered <- metadata[metadata$title %in% sample_ids, ]

# Reorder metadata to match the order of sample IDs in the count matrix
metadata_filtered <- metadata_filtered[match(sample_ids, rownames(metadata_filtered)), ]


# ============================
# Step 2: Prepare Data for Plotting
# ============================

# Transpose the counts matrix to make samples the rows
counts_matrix_t <- t(counts_matrix)

# Convert the transposed counts matrix to a data frame and log-transform the counts
# Add 1 to avoid log(0), which is undefined
df <- as.data.frame(log2(counts_matrix_t + 1))

# Add a sample column (rownames will become a new column for plotting)
df$Sample <- rownames(df)

# Add the group information from metadata (e.g., source_name_ch1)
df$group <- metadata_filtered$source_name_ch1[match(df$Sample, rownames(metadata_filtered))]

# Ensure `group` is a factor for proper grouping in the plot
df$group <- as.factor(df$group)


# ============================
# Step 3: Reshape the Data for ggplot2 (Long Format)
# ============================

# Melt the data frame to create a long format for ggplot2
df_long <- melt(df, id.vars = c("Sample", "group"))


# ============================
# Step 4: Define Colors for Each Unique Group and Add Sample Counts to the Labels
# ============================

# Calculate the number of samples for each group
group_counts <- table(df$group)

# Calculate the total number of samples
total_samples <- sum(group_counts)

# Create the color palette for the unique groups
unique_groups <- unique(df$group)
color_palette <- c("#FF6347", "#4682B4", "#32CD32", "#FFD700", "#8A2BE2", "#00CED1")

# Ensure the color palette has enough colors for all groups
group_colors <- setNames(color_palette[1:length(unique_groups)], unique_groups)

# ============================
# Step 5: Create the Plot with Updated Group Labels
# ============================

p <- ggplot(df_long, aes(x = factor(Sample, levels = sample_ids), y = value, fill = group)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Log2 Raw Counts Across Samples by Source Name", x = "Samples", y = "Log2 Counts", 
       fill = paste0("group (Total n=", total_samples, ")")) +  # Add total number to the legend title
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 4), # Rotate and adjust x-axis labels
        plot.background = element_rect(fill = "white", colour = NA),  # Set white background
        panel.background = element_rect(fill = "white", colour = NA)) + # Set white background for the panel
  scale_fill_manual(values = group_colors, labels = paste0(levels(df$group), " (n=", group_counts[levels(df$group)], ")"))  # Use the custom color vector and add counts in the legend

# ============================
# Step 6: Save the Plot
# ============================

ggsave("Figures/3.2.2.Boxplot_Raw_Counts/3.2.2.2.Colored_Boxplot_Raw_Counts.png", 
       plot = p, width = 20, height = 6)

Boxplot Before Removing Low-Count Genes and Outliers — **Figure 3.2.2.2 a.** Boxplot of Log2 Raw Counts across Samples Colored by Experimental Condition, before removing low-count genes and outlier samples. This figure displays the distribution of raw counts across all experimental conditions, highlighting potential outliers and inconsistencies.

Boxplot After Removing Low-Count Genes and Outliers — **Figure 3.2.2.2 b.** Boxplot of Log2 Raw Counts across Samples Colored by Experimental Condition, after removing low-count genes and outlier samples. This figure demonstrates how filtering out low-count genes and outliers improves the consistency of count distributions across conditions.

Interpretation: In Figure 3.2.2.2 a, before removing low-count genes and outlier samples, the log2-transformed raw counts display a broad range of variability. There are several extreme outliers (black dots) far above the whiskers of the boxplots, indicating that some samples have unusually high counts. This suggests the presence of technical or biological variability that could skew downstream analysis. The boxplots themselves are wider and less consistent across different groups, reflecting the noise introduced by these low-count genes and outliers.
After filtering, in Figure 3.2.2.2 b, the data shows a much tighter distribution. While there are still outliers (black dots), these are less extreme, and they appear closer to the whiskers, representing smaller deviations from the group median. The overall range of counts has become more consistent across groups, with fewer extreme variations, allowing for a clearer comparison between the experimental conditions. This filtering process has effectively removed noise from the data, enhancing its quality and reliability for subsequent analysis by reducing the impact of low-count genes and extreme outliers.

3.2.3. Histogram of Total Gene Counts Per Sample

The histogram of total gene counts per sample is essential for assessing the overall sequencing depth and library size across samples in RNA-seq data. By visualizing the distribution of total gene counts (log-transformed), we can identify under-sequenced or over-sequenced samples that may need further normalization or removal during preprocessing. This visualization helps to ensure that sequencing depth is consistent across samples before differential expression analysis (DEA) and biomarker discovery. Figure 3.2.3 a & Figure 3.2.3 b display the distribution of gene counts before and after removing low-count genes and outlier samples, respectively, which helps to assess the impact of filtering on sequencing depth.

In this code (Code Block 3.2.3), we create a histogram that visualizes the distribution of log10-transformed gene counts per sample. (Step 1): We define the output file path and specify where to save the generated plot. (Step 2): The PNG device is opened to capture the plot. (Step 3): We calculate the log10-transformed gene counts and create a histogram using the hist() function. (Step 4): To enhance the visualization, the gene count values are added as labels above each bar. Finally, (Step 5): The PNG device is closed, and the plot is saved.

Code Block 3.2.3. Histogram of Total Gene Counts Per Sample


# ============================
# Step 1: Define Output File Path
# ============================

# Define the file path where you want to save the image
output_file <- "C:/Users/omidm/OneDrive/Desktop/RNA_Seq_Task1/3/3_Count_Matrix_Exploration/3.2.3.a.histogram_Of_total_GeneCounts_Per_Sample_Before_Removing_LowCountGenes_AND_Outlier_Samples.png"

# ============================
# Step 2: Open PNG Device
# ============================

# Open the PNG device to save the plot
png(filename = output_file, width = 1200, height = 1000)

# ============================
# Step 3: Create Histogram of Log-Transformed Gene Counts
# ============================

# Calculate log10 of total counts per gene
log_gene_counts <- log10(rowSums(counts_matrix))

# Create the histogram and get the counts for each bin
hist_data <- hist(log_gene_counts,  
                  breaks = 50,  # Specify number of bins in the histogram
                  main = "Gene Count Distribution",  # Title of the plot
                  xlab = "Log10 Counts",  # Label for the x-axis
                  col = "steelblue")  # Add color to the bars

# Update the y-axis limit to make room for labels
ylim <- c(0, max(hist_data$counts) * 1.1)
plot(hist_data, col = "steelblue", main = "Gene Count Distribution", xlab = "Log10 Counts", ylim = ylim) # cex: to change the font size of the labels on the bars

# ============================
# Step 4: Add Numbers Above Bars
# ============================

# Add the text labels above the histogram bars
text(hist_data$mids, hist_data$counts, labels = hist_data$counts, pos = 3, cex = 1, srt = 45)

# ============================
# Step 5: Close the PNG Device
# ============================

# Close the PNG device
dev.off()

Histogram Before Removing Low-Count Genes and Outliers — **Figure 3.2.3 a.** Histogram of Total Gene Counts Per Sample before removing low-count genes and outlier samples. This figure highlights the initial distribution of total gene counts per sample, which can help identify under-sequenced samples or sequencing depth variability.

Histogram After Removing Low-Count Genes and Outliers — **Figure 3.2.3 b.** Histogram of Total Gene Counts Per Sample after removing low-count genes and outlier samples. This figure shows the distribution after filtering, providing a cleaner dataset with more consistent sequencing depth across samples.

Interpretation: The histograms visualize the distribution of gene counts before and after filtering low-count genes and outlier samples. Before filtering (Figure 3.2.3 a), the data exhibits a large proportion of genes with low expression levels, as shown by the sharp peak near the lower end of the Log10 scale. This suggests a significant amount of noise in the dataset, which can obscure meaningful biological signals. After filtering (Figure 3.2.3 b), the distribution shifts, with the removal of low-count genes leading to a more balanced representation of moderately to highly expressed genes. The refined dataset, with fewer extreme values and noise, is now more suitable for downstream analyses, offering a clearer and more reliable view of gene expression patterns. This cleaning process enhances data quality, improving the robustness and accuracy of subsequent biological interpretations.

3.2.4. Determining the Optimal Gene Count for Heatmap (Cumulative Variance Explained Plot)

Heatmaps are primarily exploratory tools to visually assess patterns in the data. The goal is to select enough genes to reveal meaningful clusters without overcomplicating the visualization. If too many genes are included, the heatmap can become cluttered and less informative. On the other hand, if too few are used, important patterns may be missed. Typically, 100 to 500 of the most variable genes are used for exploratory heatmaps in RNA-seq studies. To ensure clarity, we often start by testing heatmaps with the top 100, 200, 300, and 500 most variable genes, assessing how well they cluster and separate groups.
However, a more data-driven approach can be taken by calculating the variance for all genes and selecting the number of genes that explains a specific proportion of the total variance. This method ensures that we capture the majority of variability in our data without including too many redundant or uninformative genes. By examining the curve of cumulative variance, we can choose a number of genes that explains a high proportion of variance, focusing on the point where the curve starts to flatten ((Code Block 3.2.4; Figure 3.2.4) ).

This code (Code Block 3.2.4) focuses on analyzing gene expression variability. (Step 1): Required libraries like matrixStats and ggplot2 are loaded for variance calculations and plotting. (Step 2): The gene expression data in counts_matrix is log2-transformed to reduce its dynamic range, making it easier to interpret. (Step 3): Using the rowVars() function, the variance of each gene across all samples is calculated, identifying highly variable genes for further analysis. (Step 4): The cumulative variance is computed by sorting these variances and calculating their cumulative sum, which helps assess how much variability is captured by the top genes. (Step 5): Finally, a line plot is generated (Figure 3.2.4) using ggplot2 to visualize the cumulative variance explained by an increasing number of genes, aiding in the selection of the optimal number of genes for further analysis.

Code Block 3.2.4. Cumulative Variance Plot to Determine Optimal Gene Count for Heatmap


# =========== Step 1: Load Required Libraries =============
library(matrixStats)  # Or DelayedMatrixStats for variance calculations
library(ggplot2)      # For enhanced plotting capabilities

# =========== Step 2: Log2 Transformation =============
# Convert counts_matrix to a matrix and apply log2 transformation
# Log transformation reduces the dynamic range of the data, making it easier to interpret
counts_matrix_log <- log2(as.matrix(counts_matrix) + 1) 

# =========== Step 3: Calculate Gene Variances =============
# Use the rowVars() function to calculate the variance for each gene across samples
# Gene variance is used to identify the most variable genes for further analysis
gene_variances <- rowVars(counts_matrix_log)

# =========== Step 4: Calculate Cumulative Variance =============
# Sort the variances in descending order and calculate the cumulative sum
cum_variance <- cumsum(sort(gene_variances, decreasing = TRUE)) / sum(gene_variances)

# =========== Step 5: Plot Cumulative Variance using ggplot2 for enhanced plotting =============
# Plot the cumulative proportion of variance explained as the number of genes increases
# This plot will help visualize how many top genes are needed to explain a high proportion of variance
ggplot(data.frame(Genes = 1:length(cum_variance), CumulativeVariance = cum_variance), aes(x = Genes, y = CumulativeVariance)) +
  geom_line() +
  labs(title = "Cumulative Variance Explained by Top Genes", 
       x = "Number of Genes", 
       y = "Cumulative Proportion of Variance Explained") +
  theme_minimal()

The plot above (Figure 3.2.4) shows that the curve, with a strict focus, starts to flatten around 300 genes, where additional genes contribute diminishing returns in terms of explaining the total variance. This suggests that selecting approximately 300 genes for the heatmap would balance between capturing significant variability and avoiding unnecessary noise.

3.2.5. Exploratory Heatmaps for Gene Expression Data

In this section, we present a series of heatmaps to explore and visualize the count data before proceeding with differential expression analysis (DEA) using DESeq2. These heatmaps allow us to understand "the structure and relationships within the data", helping identify patterns such as "clustering of samples", "variations across experimental conditions", or potential "batch effects".
We employ two types of heatmaps for this exploratory phase: "Unsupervised Gene Expression Heatmaps" and "Grouped Gene Expression Heatmaps".
1. Unsupervised Gene Expression Heatmaps: The data is clustered purely based on "gene expression patterns", without taking the predefined "experimental conditions" into account. This provides an unbiased "view of the data", revealing "natural groupings" and potential "outliers".
2. Grouped Gene Expression Heatmaps: These heatmaps focus on "specific groups" of interest, allowing for a "more targeted comparison between predefined conditions" (e.g., lesional vs. non-lesional samples, or treatment vs. baseline). This can provide insight into how specific groups of samples behave in relation to one another.
For all heatmaps, we focus on the "top 300 most variable genes", as this strikes a balance between capturing "sufficient variation" and maintaining "clarity in the visualizations".

Unsupervised Heatmaps

3.2.5.1. Unsupervised Heatmap for All Samples (160 samples)

As is shown in Figure 3.2.5.1, this heatmap clusters all 160 "samples" based on the expression of the top 300 most variable "genes". The counts matrix is "log-transformed", and both rows ("genes") and columns ("samples") are clustered.
In this code (Code Block 3.2.5.1), we begin by loading the "matrixStats library" to work with matrix data and the "pheatmap library" for generating heatmaps. First (Step 1), the "as.matrix() function" converts the counts_matrix into a matrix format if it was originally a data frame, ensuring compatibility with downstream operations. Then (Step 2), we apply a "log2 transformation" to the matrix to "reduce the dynamic range" and "make the data more suitable for visualization". Next (Step 3), the top 300 most variable genes are selected based on row variance calculated from the log-transformed counts. Finally (Step 4), a heatmap is generated (Figure 3.2.5.1) using pheatmap(), with options to cluster rows and columns, adjust axis label angles, and font size, and the output is saved as an image file.

Code Block 3.2.5.1. Unsupervised Heatmap for All Samples (160 samples)


# ============ Load Required Libraries ============
# Load the required libraries for matrix operations and heatmap generation
library(matrixStats)  # Ensure matrixStats is loaded for row variance calculation
library(pheatmap)     # For generating heatmaps

# ============ Step 1: Convert counts_matrix to Matrix ============
# Convert counts_matrix to a matrix if it is currently a data frame
counts_matrix <- as.matrix(counts_matrix)  # Ensures compatibility with downstream functions

# ============ Step 2: Log2 Transformation of Counts ============
# Apply a log2 transformation to reduce the dynamic range of the count data
counts_matrix_log <- log2(counts_matrix + 1)  # Adding 1 to avoid log(0) issues

# ============ Step 3: Identify Top Variable Genes ============
# Recalculate the most variable genes based on the log-transformed counts
# We select the top 300 most variable genes using row variance
top_var_genes_log <- head(order(rowVars(counts_matrix_log), decreasing = TRUE), 300)  

# ============ Step 4: Generate Heatmap and Save ============
# Create the heatmap for all 166 samples based on the top 300 most variable genes
# Adjust clustering and visual parameters such as label angles and font sizes
pheatmap(counts_matrix_log[top_var_genes_log, ], 
         cluster_rows = TRUE, cluster_cols = TRUE, 
         show_rownames = FALSE, show_colnames = TRUE,  
         angle_col = 90,  # Rotate x-axis labels for better readability
         fontsize_col = 4,  # Reduce font size for the column labels
         filename = "Figures/3.2.5.Exploratory_Heatmaps/3.2.5.1.Unsupervised_Heatmap_160_Samples.png")  # Save the heatmap as an image

Interpretation: This heatmap (Figure 3.2.5.1) provides an overview of how the "gene expression patterns" vary across all "samples." It shows "distinct clusters of samples" (at the top of the heatmap) based solely on their expression data (genes on the left of the heatmap), without considering any experimental conditions. This type of unsupervised clustering can reveal "natural groupings" within the data or highlight potential "outliers."

3.2.5.2. Unsupervised Heatmap for Baseline Lesional and Non-lesional Samples (109 samples)

Code Overview: This heatmap (Figure 3.2.5.2) focuses on a subset of 109 samples (lesional and non-lesional) to compare expression patterns between baseline lesional and non-lesional samples.
In this code (Code Block 3.2.5.2), we focus on comparing baseline lesional and non-lesional samples. Load Required Libraries: First, the matrixStats and pheatmap libraries are loaded for matrix operations and heatmap generation. (Step 1): The metadata is subset to include only the baseline lesional (AL_m0) and non-lesional (AN_m0) samples by filtering based on metadata$title. These subsets are combined into one dataset for comparison. (Step 2): The counts_matrix is filtered to include only the lesional and non-lesional samples based on the sample names in metadata$title. (Step 3): A log2 transformation is applied to the filtered count matrix using log2() to reduce the dynamic range. (Step 4): The top 300 most variable genes are identified based on variance using rowVars(). (Step 5): A heatmap is generated (Figure 3.2.5.2) using pheatmap, with rows and columns clustered to visualize the relationship between samples and gene expression. Adjustments are made for label angles and font size, and the heatmap is saved as an image file.

Code Block 3.2.5.2. Unsupervised Heatmap for Baseline Lesional and Non-lesional Samples (109 samples)


# ============ Load Required Libraries ============
# Load the required libraries for matrix operations and heatmap generation
library(matrixStats)  # Ensure matrixStats is loaded for row variance calculation
library(pheatmap)     # For generating heatmaps

# ============ Step 1: Subset Metadata for Lesional and Non-lesional Samples ============
# Subset the metadata to include only the baseline lesional (AL_m0) and non-lesional (AN_m0) samples
lesional_samples <- metadata[grepl("AL_m0", metadata$title), ]  # Filter metadata for lesional samples
non_lesional_samples <- metadata[grepl("AN_m0", metadata$title), ]  # Filter metadata for non-lesional samples

# Combine lesional and non-lesional samples into one dataset for comparison
selected_samples <- rbind(lesional_samples, non_lesional_samples)

# ============ Step 2: Subset the Count Matrix ============
# Subset the count matrix to include only the lesional and non-lesional samples
# Assuming the sample names (columns) in counts_matrix match those in metadata$title
selected_sample_names <- selected_samples$title
counts_matrix_filtered <- counts_matrix[, selected_sample_names]  # Filter the count matrix for the selected samples

# ============ Step 3: Log2 Transformation of Filtered Counts ============
# Apply a log2 transformation to the filtered count matrix to reduce the dynamic range
counts_matrix_log <- log2(as.matrix(counts_matrix_filtered) + 1)  # Adding 1 to avoid log(0) issues

# ============ Step 4: Identify Top Variable Genes ============
# Recalculate the most variable genes based on the log-transformed counts
# Select the top 300 most variable genes
top_var_genes_log <- head(order(rowVars(counts_matrix_log), decreasing = TRUE), 300)  

# ============ Step 5: Generate Heatmap and Save ============
# Create a heatmap for the lesional and non-lesional samples based on the top 300 most variable genes
# Adjust clustering and visual parameters such as label angles and font sizes
pheatmap(counts_matrix_log[top_var_genes_log, ], 
         cluster_rows = TRUE, cluster_cols = TRUE, 
         show_rownames = FALSE, show_colnames = TRUE,  
         angle_col = 90,  # Rotate x-axis labels for better readability
         fontsize_col = 4,  # Reduce font size for the column labels
         filename = "Figures/3.2.5.Exploratory_Heatmaps/3.2.5.2.Unsupervised_Heatmap_Baseline_Samples.png")  # Save the heatmap as an image

Interpretation: This unsupervised heatmap (see Figure 3.2.5.2 above) specifically examines how gene expression varies between baseline lesional and non-lesional samples. The clustering shows whether the lesional and non-lesional samples form distinct groups based on their gene expression, helping to assess the strength of separation between these two conditions.

Grouped Heatmaps

3.2.5.3. Grouped Heatmap for All Groups

This heatmap (Figure 3.2.5.3) compares six different groups (baseline lesional, baseline non-lesional, cyclosporine lesional, cyclosporine non-lesional, dupilumab lesional, and dupilumab non-lesional).
Code Overview: In this code, we begin by loading the matrixStats and pheatmap libraries to handle matrix operations and generate heatmaps. First (Step 1), we apply a log2 transformation to the counts_matrix to reduce the dynamic range of the data. Next (Step 2), the sample_groups are extracted from metadata$source_name_ch1, representing the original group labels. We then (Step 3) define a new_labels vector, which maps the original group names (e.g., 'm0_AL') to more descriptive labels (e.g., 'Baseline Lesional').
In Step 3.1, the selected_groups vector allows for different comparisons that provide insights into the data. For example, Lesional vs. Non-lesional comparisons (e.g., baseline (Figure 3.2.5.4), cyclosporine (Figure 3.2.5.5), dupilumab (Figure 3.2.5.6) explore how gene expression differs between affected and unaffected skin; Therapy Comparisons (e.g., "Baseline Non-Lesional" vs. "Baseline Lesional", "Cyclosporine Lesional", and "Dupilumab Lesional") investigate how lesional samples respond to different treatments (Figure 3.2.5.7); and Pre-treatment vs. Post-treatment comparisons assess cyclosporine or dupilumab effectiveness by examining gene expression changes before and after treatment (Figure 3.2.5.9 & Figure 3.2.5.10) . These comparisons are crucial for understanding both disease pathology and treatment effects.
Afterward (Step 4), the columns of counts_matrix_log are reordered based on the selected groups, and the column names are updated using the new descriptive labels. In (Step 5), then the top 300 most variable genes are selected based on the variance of the log-transformed counts. Finally (Step 6), a heatmap is generated with the updated column labels, and the output is saved as an image file.

Code Block 3.2.5.3. Grouped Heatmap for All Groups


# =========== Load Required Libraries ===========
library(matrixStats)
library(pheatmap)

# =========== Step 1: Apply Log2 Transformation ===========
counts_matrix_log <- log2(as.matrix(counts_matrix) + 1)

# =========== Step 2: Extract Sample Grouping Information ===========
sample_groups <- metadata$source_name_ch1

# =========== Step 3: Define Descriptive Labels ===========
new_labels <- c(
  "m0_AN" = "Baseline Non-lesional",
  "cyclosporine_m3_AN" = "Cyclosporine Non-lesional",
  "dupilumab_m3_AN" = "Dupilumab Non-lesional",
  "m0_AL" = "Baseline Lesional",
  "cyclosporine_m3_AL" = "Cyclosporine Lesional",
  "dupilumab_m3_AL" = "Dupilumab Lesional"
)
	
# =================================================================================
# =========== Step 3.1: Select Groups for Analysis ===========
selected_groups <- c("m0_AN", "cyclosporine_m3_AN", "dupilumab_m3_AN", "m0_AL", "cyclosporine_m3_AL", "dupilumab_m3_AL")

# ===== we can replace the above selected_groups with one of the bellow  ===== 
# 1. Heatmap for all groups:
# c("m0_AN", "cyclosporine_m3_AN", "dupilumab_m3_AN", "m0_AL", "cyclosporine_m3_AL", "dupilumab_m3_AL")

# 2. (Lesional vs. Non-lesional) Compare "Baseline Lesional" with "Baseline Non-lesional": 
# c("m0_AN", "m0_AL")

# 3. (Lesional vs. Non-lesional) Compare Cyclosporine Lesional with Cyclosporine Non-lesional:
# c("cyclosporine_m3_AN", "cyclosporine_m3_AL") 

# 4. (Lesional vs. Non-lesional) Compare Dupilumab Lesional with Dupilumab Non-lesional:
# c("dupilumab_m3_AN", "dupilumab_m3_AL") 

# 5. (Therapy Comparisons) Compare Baseline Lesional, Cyclosporine Lesional, and Dupilumab Lesional (compared to Baseline Non-Lesional).
# c("m0_AN", "m0_AL", "cyclosporine_m3_AL", "dupilumab_m3_AL")  

# 6. (Therapy Comparisons) Compare Dupilumab Lesional and Cyclosporine Lesional.
# c("dupilumab_m3_AL", "cyclosporine_m3_AL")

# 7. (Pre-treatment vs. Post-treatment) Compare Baseline Lesional & Cyclosporine Lesional.
# c("m0_AL", "cyclosporine_m3_AL")

# 8. (Pre-treatment vs. Post-treatment) Compare Baseline Lesional & Dupilumab Lesional.
# c("m0_AL", "dupilumab_m3_AL")
# =================================================================================

	
# =========== Step 4: Reorder Columns Based on Group Selection ===========
desired_order <- selected_groups
reordered_columns <- match(desired_order, sample_groups)
counts_matrix_log <- counts_matrix_log[, reordered_columns]
colnames(counts_matrix_log) <- new_labels[sample_groups[reordered_columns]]

# =========== Step 5: Calculate Top Variable Genes ===========
top_var_genes_log <- head(order(rowVars(counts_matrix_log), decreasing = TRUE), 300)

# =========== Step 6: Generate the Heatmap ===========
pheatmap(counts_matrix_log[top_var_genes_log, ], 
         cluster_rows = TRUE, cluster_cols = TRUE,
         show_rownames = FALSE, show_colnames = TRUE,
         angle = 45, vjust = 0.5, hjust = 0.6,
         fontsize_col = 8, 
         filename = "Figures/3.2.5.Exploratory_Heatmaps/3.2.5.3.Grouped_Heatmap_All_Groups.png")

Interpretation: As is shown in Figure 3.2.5.3, this heatmap allows a broad comparison between all six groups. It helps assess how different treatments (cyclosporine and dupilumab) affect lesional and non-lesional samples, as well as providing insights into baseline differences between lesional and non-lesional samples.

3.2.5.4. Grouped Heatmap to Compare "Baseline Lesional" with "Baseline Non-lesional" (Lesional vs. Non-lesional)

This heatmap (Code Block 3.2.5.4; Figure 3.2.5.4) focuses on comparing "baseline lesional" and "baseline non-lesional" samples.
This comparison between Baseline Lesional and Baseline Non-lesional is crucial for identifying differentially expressed genes (DEGs) that distinguish diseased (lesional) skin from healthy (non-lesional) skin.
Firstly, these DEGs can serve diagnostic purposes, helping to identify biomarkers that differentiate Atopic Dermatitis (AD) from unaffected skin.
Secondly, understanding the baseline differences in gene expression allows for monitoring how treatments, such as Cyclosporine or Dupilumab, bring lesional skin back towards the non-lesional state, thereby evaluating treatment effectiveness. By comparing post-treatment samples with non-lesional baseline samples, we can assess which therapy restores gene expression closer to the non-lesional (healthy) state, which is important for evaluating the therapeutic impact and determining which treatment works best for normalizing skin condition.

Code Block 3.2.5.4. Grouped Heatmap: Compare Baseline Lesional with Baseline Non-lesional


# ============================
selected_groups <- c("m0_AN", "m0_AL")
# ============================

3.2.5.5. Grouped Heatmap to Compare Cyclosporine Lesional vs. Cyclosporine Non-lesional (Lesional vs. Non-lesional)

This heatmap (Code Block 3.2.5.5. & Figure 3.2.5.5) compares lesional and non-lesional samples treated with cyclosporine.
By comparing Cyclosporine Lesional to Cyclosporine Non-lesional, we are assessing the Therapeutic Effectiveness and Monitoring Treatment Impact. The focus is on observing how the treatment affects gene expression in both lesional and non-lesional skin, looking for any therapeutic improvements or unintended side effects, which falls under Monitoring Treatment Impact.
Firstly, this comparison is crucial for determining whether Cyclosporine modulates gene expression in lesional skin towards a more normalized state (Therapeutic Effectiveness).
Secondly, it examines the effects on non-lesional skin, potentially revealing any unintended gene expression changes in unaffected tissue (Monitoring Treatment Impact).
Therefore, understanding these differences provides valuable insights into the treatment’s localized effectiveness and highlights ongoing gene expression disparities between treated lesional and treated non-lesional skin (Therapeutic Effectiveness and Monitoring).

Code Block 3.2.5.5. Grouped Heatmap: Compare Cyclosporine Lesional vs. Non-lesional


# ============================
selected_groups <- c("cyclosporine_m3_AN", "cyclosporine_m3_AL")
# ============================

3.2.5.6. Grouped Heatmap to Compare Dupilumab Lesional vs. Dupilumab Non-lesional (Lesional vs. Non-lesional)

This heatmap (Code Block 3.2.5.6 & Figure 3.2.5.6) reveals how dupilumab treatment impacts gene expression across lesional and non-lesional samples.
By comparing Dupilumab Lesional to Dupilumab Non-lesional, we are also assessing the Therapeutic Effectiveness and Monitoring Treatment Impact, similar to what was done for Cyclosporine. The focus remains on observing whether Dupilumab modulates gene expression in lesional skin towards a more normalized state and if there are any unintended changes in non-lesional skin, which highlights the treatment’s overall impact.
Firstly, this comparison is crucial for determining whether Dupilumab effectively restores the gene expression in lesional skin (Therapeutic Effectiveness).
Secondly, it reveals any unintended gene expression changes in the non-lesional skin, assessing the safety of the treatment on unaffected tissue (Monitoring Treatment Impact).
Therefore, understanding these gene expression differences offers insights into the localized therapeutic effects of Dupilumab, providing a similar perspective to the Cyclosporine comparison but focused on a different treatment (Therapeutic Effectiveness and Monitoring).

Code Block 3.2.5.6. Grouped Heatmap: Compare Dupilumab Lesional vs. Dupilumab Non-lesional


# ============================
selected_groups <- c("dupilumab_m3_AN", "dupilumab_m3_AL")
# ============================

Heatmap 3 — **Figure 3.2.5.5.** Grouped Gene Expression Heatmap comparing Cyclosporine Lesional and Cyclosporine Non-lesional samples. This focuses on how cyclosporine treatment impacts gene expression in affected vs. unaffected skin (After filtering out outlier samples & Low-Count Genes).

Heatmap 4 — **Figure 3.2.5.6.** Grouped Gene Expression Heatmap comparing Dupilumab Lesional and Dupilumab Non-lesional samples. This visualizes the effect of dupilumab treatment on lesional and non-lesional samples (After filtering out outlier samples & Low-Count Genes).

3.2.5.7. Grouped Heatmap to Compare Baseline Lesional, Cyclosporine Lesional, and Dupilumab Lesional (compared to Baseline Non-Lesional ) (Therapy Comparisons)

This heatmap (Code Block 3.2.5.7. & Figure 3.2.5.7.) compares lesional samples across three conditions: baseline, cyclosporine, and dupilumab treatment (in comparison with Baseline Non-Lesional). This comparison helps us to understand how lesional samples respond to different therapies and Which treatment, Cyclosporine or Dupilumab, is more effective in moving lesional skin towards a healthier or more normalized gene expression state, as compared to Baseline Non-Lesional?
The comparison between Baseline Lesional, Cyclosporine Lesional, Dupilumab Lesional, and Baseline Non-Lesional helps assess the Therapeutic Effectiveness of two treatments—Cyclosporine and Dupilumab—on lesional skin, using Baseline Non-Lesional (healthy) as a reference. This comparison provides insights into how each treatment modulates the gene expression profile in lesional skin after therapy.
Firstly, this comparison highlights the extent to which each treatment modulates lesional skin's gene expression, showing how well Dupilumab or Cyclosporine restores the lesional skin towards a healthier or more normalized state when compared to Baseline Non-Lesional (Therapeutic Effectiveness).
Secondly, by comparing both treatments against the Baseline Non-Lesional and Baseline Lesional states, this analysis reveals that Dupilumab appears more successful in restoring gene expression, as it shows high similarity to Baseline Non-Lesional and they cluster closely, suggesting a stronger positive effect (Therapeutic Effectiveness). In contrast, Cyclosporine shows less similarity to Baseline Non-Lesional and the least similarity to the other conditions, suggesting a less effective outcome.
Therefore, this comparison provides a clear assessment of the relative efficacy of Cyclosporine and Dupilumab, offering critical insights into which treatment is more effective in modulating the gene expression of lesional skin back towards a healthy state (Therapeutic Effectiveness).

Code Block 3.2.5.7. Grouped Heatmap: Compare Baseline Non-Lesional vs. Baseline Lesional, Cyclosporine Lesional, and Dupilumab Lesional


# ============================
selected_groups <- c("m0_AN", "m0_AL", "cyclosporine_m3_AL", "dupilumab_m3_AL") 
# ============================

3.2.5.8. Grouped Heatmap to Compare Dupilumab Lesional vs. Cyclosporine Lesional (Therapy Comparisons)

This heatmap (Code Block 3.2.5.8 & Figure 3.2.5.8) compares lesional samples treated with dupilumab versus cyclosporine. This comparison allows you to directly assess how gene expression varies between dupilumab and cyclosporine treatments in lesional samples.
The comparison between Dupilumab Lesional and Cyclosporine Lesional focuses on identifying how these two treatments differentially influence gene expression in lesional skin. This analysis allows us to observe the unique gene expression changes induced by each therapy and to identify treatment-specific patterns.
By examining the differentially expressed genes (DEGs) between these two treatment groups, we can uncover distinct molecular pathways targeted by Dupilumab and Cyclosporine. This comparison is essential for understanding how each treatment modulates the biological processes in lesional skin and for identifying any therapeutic pathways that may be more prominent in one treatment compared to the other.
Although this analysis does not directly reveal which treatment restores skin to a healthier state (since it lacks a non-lesional reference), it offers valuable insights into the mechanisms of action of both therapies and how they diverge in their effects on lesional skin.

Code Block 3.2.5.8. Grouped Heatmap: Compare Dupilumab Lesional vs. Cyclosporine Lesional


# ============================
selected_groups <- c("dupilumab_m3_AL", "cyclosporine_m3_AL")
# ============================

3.2.5.9. Grouped Heatmap to Compare Baseline Lesional vs. Cyclosporine Lesional (Pre-treatment vs. Post-treatment)

This heatmap (Code Block 3.2.5.9 & Figure 3.2.5.9) compares pre-treatment (baseline lesional) to post-treatment (cyclosporine lesional) samples.
The comparison between Baseline Lesional and Cyclosporine Lesional focuses on evaluating the effect of Cyclosporine on lesional skin by observing how the treatment alters gene expression post-treatment compared to the untreated baseline state. This comparison provides insights into the Therapeutic Effectiveness of Cyclosporine in modulating the gene expression profile of lesional skin.
Firstly, this comparison helps determine the extent to which Cyclosporine shifts the gene expression in lesional skin towards a modified or improved state after treatment, relative to its pre-treatment condition (Baseline Lesional).
Secondly, the comparison highlights the specific differentially expressed genes (DEGs) that are altered post-treatment, offering insights into the molecular changes and pathways modulated by Cyclosporine therapy.
Thus, this analysis provides valuable data on how Cyclosporine influences lesional skin, reflecting its impact on the underlying disease processes at the gene expression level and suggesting its therapeutic potential in treating the condition.

Code Block 3.2.5.9. Grouped Heatmap: Compare Pre-treatment vs. Post-treatment (Baseline Lesional & Cyclosporine Lesional)


# ============================
selected_groups <- c("m0_AL", "cyclosporine_m3_AL")
# ============================

3.2.5.10. Grouped Heatmap to Compare Baseline Lesional vs. dupilumab Lesional (Pre-treatment vs. Post-treatment)

This heatmap (Code Block 3.2.5.10 & Figure 3.2.5.10) compares pre-treatment (baseline lesional) to post-treatment (dupilumab lesional) samples.
The comparison between Baseline Lesional and Dupilumab Lesional allows us to evaluate the Therapeutic Effectiveness of Dupilumab by analyzing how the treatment alters the gene expression profile in lesional skin after therapy, compared to its pre-treatment condition.
Firstly, this comparison helps determine the extent to which Dupilumab modulates the gene expression of lesional skin, moving it towards an improved state, compared to its baseline (untreated) condition.
Secondly, by identifying differentially expressed genes (DEGs) between the pre-treatment and post-treatment states, we gain insights into the molecular pathways and biological processes that are influenced by Dupilumab, offering a deeper understanding of how the treatment affects lesional skin at the genetic level.
Thus, this comparison provides a critical assessment of how Dupilumab impacts lesional skin, shedding light on its potential to restore a more normal gene expression profile in affected tissue.

Code Block 3.2.5.10. Grouped Heatmap: Compare Pre-treatment vs. Post-treatment (Baseline Lesional & Dupilumab Lesional)


	# ============================
	selected_groups <- c("m0_AL", "dupilumab_m3_AL")
	# ============================

)from here onward, everything will be replaced)
1. Load Necessary Libraries

We begin by loading the essential R libraries needed for data manipulation, DEG analysis, and visualization.


# Load Required Libraries
library(DESeq2)
library(data.table)
library(umap)
library(ggplot2)
library(clusterProfiler)
library(org.Hs.eg.db)
library(VennDiagram)
library(knitr)
library(tinytex)

2. Load and Prepare Data

This step involves downloading the count data and gene annotations from GEO. We convert these into suitable formats for further analysis.


# Load counts table from GEO
urld <- "https://www.ncbi.nlm.nih.gov/geo/download/?format=file&type=rnaseq_counts"
path <- paste(urld, "acc=GSE157194", "file=GSE157194_raw_counts_GRCh38.p13_NCBI.tsv.gz", sep="&")
tbl <- as.matrix(data.table::fread(path, header=T, colClasses="integer"), rownames="GeneID")

# Load gene annotations 
apath <- paste(urld, "type=rnaseq_counts", "file=Human.GRCh38.p13.annot.tsv.gz", sep="&")
annot <- data.table::fread(apath, header=T, quote="", stringsAsFactors=F, data.table=F)
rownames(annot) <- annot$GeneID

3. Sample Selection and Group Assignment

This step filters the samples and assigns them to the Lesional or Non-Lesional groups based on predefined group labels.


# Sample selection
gsms <- paste0("01101001010101011010011001011001011001101001011010",
               "01100101011010101001011001010101010010101010101001",
               "01100100110XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
               "XXXXXXXXXXXXXXXX")
sml <- strsplit(gsms, split="")[[1]]

# Filter out excluded samples (marked as "X")
sel <- which(sml != "X")
sml <- sml[sel]
tbl <- tbl[ ,sel]

# Group membership for samples
gs <- factor(sml)
groups <- make.names(c("Lesional","Non-Lesional"))
levels(gs) <- groups
sample_info <- data.frame(Group = gs, row.names = colnames(tbl))

4. Data Pre-Filtering

To improve analysis accuracy, we filter out genes with low counts across samples.


# Pre-filter low count genes
keep <- rowSums(tbl >= 10) >= min(table(gs))
tbl <- tbl[keep, ]

5. Exploratory Data Analysis

Before proceeding with differential expression analysis, we perform a PCA to explore the variance in the dataset and ensure that the samples group as expected.


# PCA Plot
vsd <- vst(ds, blind=FALSE)
pcaData <- plotPCA(vsd, intgroup="Group", returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))

ggplot(pcaData, aes(PC1, PC2, color=Group)) +
  geom_point(size=3) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  coord_fixed() +
  theme_minimal()

6. Create DESeqDataSet and Run Differential Expression Analysis

We create a DESeqDataSet object, normalize the data, and run the differential expression analysis using DESeq2. This model tests for differences in gene expression between the Lesional and Non-Lesional groups.


# Create DESeqDataSet and run DESeq2
ds <- DESeqDataSetFromMatrix(countData=tbl, colData=sample_info, design= ~Group)
ds <- DESeq(ds, test="Wald", sfType="poscount")

7. Extract and Process Results

We extract the results from DESeq2, separating the differentially expressed genes (DEGs) into upregulated and downregulated categories, and prepare the top genes for further analysis.


# Extract results for top genes
r <- results(ds, contrast = c("Group", groups[1], groups[2]), alpha = 0.01, pAdjustMethod = "fdr")

# Separate downregulated and upregulated DEGs
downregulated <- r[r$log2FoldChange < 0 & r$padj < 0.01, ]
upregulated <- r[r$log2FoldChange > 0 & r$padj < 0.01, ]

# Take the top 10 genes for downregulated and upregulated
top_downregulated <- downregulated[order(downregulated$log2FoldChange)[1:10], ]
top_upregulated <- upregulated[order(upregulated$log2FoldChange, decreasing = TRUE)[1:10], ]

# Merge with gene annotations
top_downregulated <- merge(as.data.frame(top_downregulated), annot, by = 0, sort = FALSE)
top_upregulated <- merge(as.data.frame(top_upregulated), annot, by = 0, sort = FALSE)

# Round "padj" and "log2FoldChange" columns for readability
top_downregulated <- within(top_downregulated, {
  padj <- sprintf("%.4f", round(padj, 2))
  log2FoldChange <- sprintf("%.4f", round(log2FoldChange, 2))
})

top_upregulated <- within(top_upregulated, {
  padj <- sprintf("%.4f", round(padj, 2))
  log2FoldChange <- sprintf("%.4f", round(log2FoldChange, 2))
})

# Print the top_downregulated table using kable
kable(top_downregulated, format = "markdown")
kable(top_upregulated, format = "markdown")

8. General Data Visualization

We generate various plots to visualize the data and the results of the differential expression analysis.

8.1 Box Plot


# Box-and-whisker plot
dat <- log10(counts(ds, normalized = T) + 1)
ord <- order(gs)
palette(c("#1B9E77", "#7570B3"))
par(mar=c(7,4,2,1))
boxplot(dat[,ord], boxwex=0.6, notch=T, main="GSE157194", ylab="lg(norm.counts)", outline=F, las=2, col=gs[ord])
legend("topleft", groups, fill=palette(), bty="n")

8.2 Dispersion Estimates


# Dispersion estimates plot
plotDispEsts(ds, main="GSE157194 Dispersion Estimates")

8.3 UMAP Plot


# UMAP plot
dat <- dat[!duplicated(dat), ]
par(mar=c(3,3,2,6), xpd=TRUE, cex.main=1.5)
ump <- umap(t(dat), n_neighbors = 15, random_state = 123)
plot(ump$layout, main="UMAP plot, nbrs=15", xlab="", ylab="", col=gs, pch=20, cex=1.5)
legend("topright", inset=c(-0.15,0), legend=groups, pch=20, col=1:length(groups), title="Group", pt.cex=1.5)

8.4 Histogram of P-values


# Histogram of adjusted p-values
hist(r$padj, breaks=seq(0, 1, length = 21), col = "grey", border = "white", 
     xlab = "", ylab = "", main = "GSE157194 Frequencies of padj-values")

8.5 Volcano Plot


# Volcano plot
old.pal <- palette(c("#00BFFF", "#FF3030"))
par(mar=c(4,4,2,1), cex.main=1.5)
plot(r$log2FoldChange, -log10(r$padj), main=paste(groups[1], "vs", groups[2]),
     xlab="log2FC", ylab="-log10(Padj)", pch=20, cex=0.5)
with(subset(r, padj<0.01 & abs(log2FoldChange) >= 1.5),
     points(log2FoldChange, -log10(padj), pch=20, col=(sign(log2FoldChange) + 3)/2, cex=1))
legend("bottomleft", title=paste("Padj<", 0.01, sep=""), legend=c("down", "up"), pch=20,col=1:2)

8.6 MA Plot


# MA plot
par(mar=c(4,4,2,1), cex.main=1.5)
plot(log10(r$baseMean), r$log2FoldChange, main=paste(groups[1], "vs", groups[2]),
     xlab="log10(mean of normalized counts)", ylab="log2FoldChange", pch=20, cex=0.5)
with(subset(r, padj<0.01 & abs(log2FoldChange) >= 1.5),
     points(log10(baseMean), log2FoldChange, pch=20, col=(sign(log2FoldChange) + 3)/2, cex=1))
legend("bottomleft", title=paste("Padj<", 0.01, sep=""), legend=c("down", "up"), pch=20,col=1:2)
abline(h=0)
palette(old.pal)

9. Functional Annotation and Pathway Analysis

After identifying DEGs, we perform GO and KEGG pathway enrichment analyses to understand the biological significance of the results.

9.1 Gene Ontology (GO) Enrichment Analysis


# GO enrichment analysis for upregulated and downregulated genes
up_genes <- top_upregulated$GeneID
down_genes <- top_downregulated$GeneID

# GO enrichment for upregulated genes
ego_up <- enrichGO(gene = up_genes, OrgDb = org.Hs.eg.db, keyType = "ENSEMBL",
                   ont = "BP", pAdjustMethod = "fdr", qvalueCutoff = 0.05)

# GO enrichment for downregulated genes
ego_down <- enrichGO(gene = down_genes, OrgDb = org.Hs.eg.db, keyType = "ENSEMBL",
                     ont = "BP", pAdjustMethod = "fdr", qvalueCutoff = 0.05)

# Plotting the top GO terms
barplot(ego_up, showCategory=10, title="Top GO terms for Upregulated Genes", drop=TRUE)
barplot(ego_down, showCategory=10, title="Top GO terms for Downregulated Genes", drop=TRUE)

9.2 KEGG Pathway Enrichment Analysis


# KEGG pathway enrichment analysis for upregulated and downregulated genes
kegg_up <- enrichKEGG(gene = up_genes, organism = "hsa", pAdjustMethod = "fdr", qvalueCutoff = 0.05)
kegg_down <- enrichKEGG(gene = down_genes, organism = "hsa", pAdjustMethod = "fdr", qvalueCutoff = 0.05)

# Plotting the top KEGG pathways
barplot(kegg_up, showCategory=10, title="Top KEGG Pathways for Upregulated Genes")
barplot(kegg_down, showCategory=10, title="Top KEGG Pathways for Downregulated Genes")

10. Venn Diagram for Gene Sets

We visualize the overlap between different gene sets (total, upregulated, downregulated) using a Venn diagram.


# Venn diagram for gene sets
total_genes <- rownames(tbl)
upregulated_genes <- rownames(top_upregulated)
downregulated_genes <- rownames(top_downregulated)

venn_data <- list(
  Total = total_genes,
  Upregulated = upregulated_genes,
  Downregulated = downregulated_genes
)

venn_plot <- venn.diagram(
  x = venn_data,
  category.names = c("Total", "Upregulated", "Downregulated"),
  filename = NULL,
  margin = 0,
  cex.prop = 0.8,
  offset.prop = 0.2
)

grid.draw(venn_plot)

11. Extract Top 20 Genes and Print Results

Finally, we extract the top 20 differentially expressed genes from both categories and print the results.


# Extract results for top genes table (rounded)
r <- results(ds, contrast = c("Group", groups[1], groups[2]), alpha = 0.01, pAdjustMethod = "fdr")

# Separate downregulated and upregulated DEGs
downregulated <- r[r$log2FoldChange < 0 & r$padj < 0.01, ]
upregulated <- r[r$log2FoldChange > 0 & r$padj < 0.01, ]

# Take the top 20 genes for downregulated and upregulated
top_downregulated <- downregulated[order(downregulated$log2FoldChange)[1:20], ]
top_upregulated <- upregulated[order(upregulated$log2FoldChange, decreasing = TRUE)[1:20], ]

# Merge with gene annotations
top_downregulated <- merge(as.data.frame(top_downregulated), annot, by = 0, sort = FALSE)
top_upregulated <- merge(as.data.frame(top_upregulated), annot, by = 0, sort = FALSE)

# Round "padj" and "log2FoldChange" columns
top_downregulated <- within(top_downregulated, {
  padj <- sprintf("%.4f", round(padj, 4))
  log2FoldChange <- sprintf("%.4f", round(log2FoldChange, 4))
 })

top_upregulated <- within(top_upregulated, {
  padj <- sprintf("%.4f", round(padj, 4))
  log2FoldChange <- sprintf("%.4f", round(log2FoldChange, 4))
})

# Print the top_downregulated table using kable
kable(top_downregulated, format = "markdown") 
kable(top_upregulated, format = "markdown")

Conclusion

This complete pipeline guides you through performing differential gene expression analysis on the GSE157194 dataset using DESeq2. The analysis identifies significant gene expression changes between lesional and non-lesional skin samples from Atopic Dermatitis patients, and includes essential data visualizations and enrichment analyses to interpret the results biologically.

/* ============================================================================================================= */ /* ====================== To highlight the figure or caption when a user clicks on the link: ================ */ /* ====================== Step 2: Add JavaScript to apply the class ================ */ /* ============================================================================================================= */

Differential Gene Expression Analysis in Atopic Dermatitis Using DESeq2

Introduction to Differential Gene Expression in Atopic Dermatitis (AD) Patients

Dataset Overview:

Objective of This Analysis:

Comprehensive Analysis Approach:

In addition to the DEG analysis, we also perform pathway and gene enrichment analysis, including:

1. Load Necessary Libraries

2. Load and Prepare Data

2.1. Retrieving Data from GEO

2.2. Preparing Data

2.2.1. Identify and Remove Outlier Samples

2.2.2. Pre-filtering Low-Count Genes to Improve Data Quality

3. Exploratory Data Analysis (EDA)- Before the "DESeqDataSet" object (dds) Creation

3.1. Metadata Exploration

3.1.1. Metadata Overview

3.1.2. Check Distribution of Samples Across Key Metadata Variables (using table() and kable())

3.1.2.1. Summary of Sample Distribution by Patient Group and Timepoints

3.1.2.2. Summary of Sample Distribution Across Multiple Metadata Variables

3.1.3. Check Distribution of Samples Across Key Metadata Variables (using bar charts)

3.1.3.1. Visualizing Sample Distribution for a Single Metadata Variable Using Bar Charts

3.1.3.2. Automated Bar Chart Generation for Multiple Metadata Variables

3.2. Count Matrix: Exploration and Visualization

3.2.1. Count Matrix Overview (Inspecting the Count Matrix using View())

3.2.2. Boxplot of Raw Counts

3.2.2.1. Boxplot of Raw Counts (Uncolored)

3.2.2.2. Boxplot of Raw Counts Colored by Experimental Condition

3.2.3. Histogram of Total Gene Counts Per Sample

3.2.4. Determining the Optimal Gene Count for Heatmap (Cumulative Variance Explained Plot)

3.2.5. Exploratory Heatmaps for Gene Expression Data

Unsupervised Heatmaps

3.2.5.1. Unsupervised Heatmap for All Samples (160 samples)

3.2.5.2. Unsupervised Heatmap for Baseline Lesional and Non-lesional Samples (109 samples)

Grouped Heatmaps

3.2.5.3. Grouped Heatmap for All Groups

3.2.5.4. Grouped Heatmap to Compare "Baseline Lesional" with "Baseline Non-lesional" (Lesional vs. Non-lesional)

3.2.5.5. Grouped Heatmap to Compare Cyclosporine Lesional vs. Cyclosporine Non-lesional (Lesional vs. Non-lesional)

3.2.5.6. Grouped Heatmap to Compare Dupilumab Lesional vs. Dupilumab Non-lesional (Lesional vs. Non-lesional)

3.2.5.7. Grouped Heatmap to Compare Baseline Lesional, Cyclosporine Lesional, and Dupilumab Lesional (compared to Baseline Non-Lesional ) (Therapy Comparisons)

3.2.5.8. Grouped Heatmap to Compare Dupilumab Lesional vs. Cyclosporine Lesional (Therapy Comparisons)

3.2.5.9. Grouped Heatmap to Compare Baseline Lesional vs. Cyclosporine Lesional (Pre-treatment vs. Post-treatment)

3.2.5.10. Grouped Heatmap to Compare Baseline Lesional vs. dupilumab Lesional (Pre-treatment vs. Post-treatment)

)from here onward, everything will be replaced) 1. Load Necessary Libraries

2. Load and Prepare Data

3. Sample Selection and Group Assignment

4. Data Pre-Filtering

5. Exploratory Data Analysis

6. Create DESeqDataSet and Run Differential Expression Analysis

7. Extract and Process Results

8. General Data Visualization

8.1 Box Plot

8.2 Dispersion Estimates

8.3 UMAP Plot

8.4 Histogram of P-values

8.5 Volcano Plot

8.6 MA Plot

9. Functional Annotation and Pathway Analysis

9.1 Gene Ontology (GO) Enrichment Analysis

9.2 KEGG Pathway Enrichment Analysis

10. Venn Diagram for Gene Sets

11. Extract Top 20 Genes and Print Results

Conclusion

Differential Gene Expression Analysis
in Atopic Dermatitis
Using DESeq2

)from here onward, everything will be replaced)
1. Load Necessary Libraries