🏠 Home

A Bioinformatics Pipeline for Biologically Relevant HVG Selection in scRNA-seq Data for Training Predictive DL Models

A flexible and biologically informed pipeline that refines highly variable genes from single-cell RNA-seq data to optimize deep learning model training.

1. Introduction

2. Pipeline Overview

Sample Distribution by source_name_ch1
Figure 2.1. Schematic diagram of the Bioinformatics Pipeline for Selecting Biologically Relevant HVGs in scRNA-seq to Train Predictive DL Models This infographic illustrates a streamlined, user-friendly pipeline developed to extract highly variable and biologically relevant genes from single-cell RNA-seq data for machine learning (ML) and deep learning (DL) model training. The pipeline is designed to support biological insight-driven feature selection, even for users with minimal coding or biology expertise.
  • Step 1: Extract HVGs: Users start by uploading standard scRNA-seq files (10X format) and selecting specific cell types of interest—such as only immune checkpoint blockade (ICB)-exposed cells or responders & non-responders—based on metadata. This tailored inclusion ensures flexibility across research questions. Multiple HVG sets are generated (e.g., 100 to 9,000 genes) to allow evaluation of optimal size later.
  • Step 2: Evaluate Biological Relevance of HVG Lists: Using biomaRt and msigdbr packages in R, users compile a custom list of biologically relevant genes based on selected GO terms and immune-related pathways. This list serves as a reference to strong>biologically assess the generated HVG sets. The pipeline visualizes how overlap changes with increasing HVG count and helps pinpoint an optimal HVG threshold that maximizes biological relevance while minimizing redundancy.
  • Step 3: Final Gene List Extraction: From the optimal HVG set (e.g., 5,000 genes), the pipeline intersects with the reference gene list to extract a final output: genes that are both highly variable and biologically meaningful - ideal for predictive modeling.
  • Key Innovations âś” Integrates statistical variability with biological meaning âś” Enables metadata-based filtering (e.g., ICB-Exposed only cells) âś” Supports user-defined GO terms and pathways âś” Identifies an optimal HVG cutoff âś” Generates gene lists for DL/ML model training âś” Designed for users with minimal coding or biology background & expertise

3. Step-by-Step Breakdown

3.1 Step One: Generate HVG Sets

Sample Distribution by source_name_ch1
  • Figure 3.1. Step One: Generate HVG Sets: In this step, we have preprocessing and HVG Identification Workflow Using Seurat package.
  • A stepwise outline of the initial preprocessing phase for single-cell RNA sequencing (scRNA-seq) data using the Seurat package in R. This step forms the foundation of the biologically guided HVG selection pipeline.
  • Starting from the standard 10X format (matrix, barcodes, genes), the workflow involves integrating metadata to annotate cells by phenotypes of interest (e.g., treated, untreated, responders, non-responders, malignant, non-malignant). These annotations guide the selection of relevant cell types to be included in the analysis.
  • Following cell selection, quality control (QC) filtering is applied during the creation of the Seurat object, including steps such as filtering out low-quality cells based on mitochondrial gene content or gene/cell count thresholds.
  • The next step includes data normalization and the identification of highly variable genes (HVGs). Multiple HVG sets are generated by varying the number of top variable genes selected (e.g., 100, 200, … up to 9,000), which are saved for downstream biological relevance evaluation and optimal threshold selection.

# Code will be added here soon
# Code will be added here soon
# Code will be added here soon
# Code will be added here soon
# Code will be added here soon
# Code will be added here soon
# Code will be added here soon
# Code will be added here soon
	    

3.2 Step Two: Assess Biological Relevance of HVG Sets

Sample Distribution by source_name_ch1
  • Figure 3.2. This diagram outlines the process of assessing the biological relevance of each HVG set by benchmarking against curated gene lists (Reference list of genes) (Step 2). The workflow starts with compiling reference gene sets using the biomaRt and msigdbr packages in R, based on Gene Ontology terms and immunotherapy-related pathways derived from literature. Each HVG set is then evaluated for overlap with this reference list, and the number of overlapping genes is quantified and plotted. This visualization helps identify an optimal threshold—typically an inflection point—where increasing the number of HVGs no longer yields substantial gains in biological interpretability.

	# Code will be added here soon
# Code will be added here soon
# Code will be added here soon
# Code will be added here soon
# Code will be added here soon
# Code will be added here soon
# Code will be added here soon
# Code will be added here soon
	    

3.3 Step Three: Extract Biologically Relevant HVGs for Model Training

Sample Distribution by source_name_ch1
  • Figure 3.3. Step Three: This diagram illustrates the final step of the biologically guided HVG selection pipeline. A previously selected HVG set—identified as optimal based on overlap analysis—is intersected with the curated list of immune-related or biologically meaningful genes. This step refines the HVG set, narrowing it down to genes that are both highly variable and functionally relevant. The resulting gene list serves as the final input for downstream machine learning or deep learning model training.

	# Code will be added here soon
# Code will be added here soon
# Code will be added here soon
# Code will be added here soon
# Code will be added here soon
# Code will be added here soon
# Code will be added here soon
# Code will be added here soon
	    

4. Final Output and Applicability