Contrastive Deep Learning for Variant Detection in Wastewater Genomic Sequencing

Georgia State University

*Indicates Equal Contribution

A novel deep learning pipeline for learning discrete representations of viral genomic sequences from wastewater surveillance data using Vector-Quantized Variational Autoencoders (VQ-VAE).

Abstract

Wastewater-based epidemiology provides a non-invasive, community-wide approach to viral surveillance, but analyzing highly fragmented and noisy genomic sequences remains challenging. Traditional reference-based pipelines require known genomes and struggle with co-circulating variants and sequencing artifacts. We propose a reference-free approach using Vector-Quantized Variational Autoencoders (VQ-VAE) to learn discrete representations of viral genomic sequences from wastewater surveillance data.

Our method employs k-mer tokenization (k=6) to convert raw DNA sequences into fixed-length representations, which are then encoded into a discrete latent space using a learned codebook of 512 entries. We train the model on ~100,000 SARS-CoV-2 wastewater sequencing reads, achieving 99.52% mean token-level reconstruction accuracy and 56.33% exact sequence match rate with only 19.73% codebook utilization, indicating effective compression into meaningful discrete representations.

We extend the base VQ-VAE with two complementary approaches: (1) Masked VQ-VAE, which masks 20% of input tokens during training to improve robustness to missing data, achieving 95% accuracy on masked tokens and 12% improvement on corrupted sequences; and (2) Contrastive VQ-VAE, which fine-tunes the encoder using InfoNCE loss with augmented views, improving clustering quality by 35% (Silhouette score: 0.31 β†’ 0.42). Our discrete representation learning framework offers a scalable, reference-free approach to genomic sequence analysis suitable for real-time wastewater surveillance and variant detection.

Introduction


Wastewater Surveillance: Opportunities and Challenges

Wastewater-based epidemiology has emerged as a powerful tool for monitoring community-level viral prevalence, offering early warning signals for disease outbreaks and variant emergence. Unlike clinical testing, wastewater surveillance provides a non-invasive, population-wide snapshot that captures both symptomatic and asymptomatic infections. However, analyzing genomic sequences from wastewater presents unique computational challenges:

  • Fragmentation: Sequencing reads are highly fragmented (100-300 bp), making de novo assembly difficult
  • Noise: High sequencing noise and quality variation due to degraded viral RNA in wastewater
  • Low concentration: Viral RNA constitutes a small fraction of total genetic material
  • Complexity: Multiple co-circulating viral strains and variants are present simultaneously
  • Reference dependence: Traditional bioinformatics pipelines require known reference genomes

Our Approach: Reference-Free Discrete Representation Learning

We propose a reference-free deep learning approach based on Vector-Quantized Variational Autoencoders (VQ-VAE) to learn discrete, compressed representations of viral genomic sequences. Our pipeline consists of three key components:

  1. K-mer Tokenization: Convert raw DNA sequences into overlapping k-mer tokens (k=6), creating a vocabulary of 4,097 canonical k-mers that capture local sequence patterns
  2. Discrete Representation Learning: Train a VQ-VAE with a codebook of 512 discrete latent codes (dimension 64) to compress sequences while preserving reconstruction quality
  3. Enhanced Learning Objectives: Extend the base model with:
    • Masked VQ-VAE: BERT-style masked token prediction for robustness
    • Contrastive VQ-VAE: SimCLR-style contrastive learning for better clustering

This approach enables unsupervised discovery of sequence patterns, variant clustering, and robust sequence reconstruction without requiring reference genomesβ€”making it particularly suitable for novel variant detection and real-time surveillance applications.

Dataset & Preprocessing


Data Source

We use SARS-CoV-2 wastewater sequencing data in FASTQ format, consisting of approximately 100,000 reads with variable lengths ranging from 36 to 300 base pairs. The data represents real-world wastewater surveillance samples with typical challenges: low quality scores, adapter contamination, and significant fragmentation.

Preprocessing Pipeline

All preprocessing is automated through the scripts/preprocess.py script, which orchestrates quality control, filtering, and tokenization:

Step 1: Initial Quality Assessment (FastQC)

fastqc wastewater_seq_dataset.fastq -o fastqc_before/

Generates comprehensive quality reports including per-base quality scores, sequence length distribution, GC content, and adapter contamination detection. Results are saved in HTML format for visual inspection.

Step 2: Quality Filtering (Trimmomatic)

java -jar Trimmomatic/trimmomatic-0.39.jar SE \
  -threads 4 \
  wastewater_seq_dataset.fastq \
  cleaned_reads.fastq \
  LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

Parameters:

  • LEADING:3 - Remove low quality bases from the beginning (quality < 3)
  • TRAILING:3 - Remove low quality bases from the end (quality < 3)
  • SLIDINGWINDOW:4:15 - Scan with 4-base window, cut when average quality < 15
  • MINLEN:36 - Drop reads shorter than 36 base pairs

Step 3: Post-Processing Quality Check

fastqc cleaned_reads.fastq -o fastqc_after/

Verifies quality improvement after filtering. Typical results show significant reduction in low-quality bases and adapter contamination while retaining ~85-90% of original reads.

Step 4: K-mer Tokenization

Sequences are converted into overlapping k-mer tokens using canonical k-mer representation:

  • K-mer size: k=6 (provides good balance between specificity and vocabulary size)
  • Vocabulary size: 4,097 = 4^6 + 1 (all possible 6-mers + PAD token)
  • Canonical mapping: Each k-mer and its reverse complement map to the same token
  • Sequence length: Pad/truncate to L=150 tokens (~900 base pairs)
  • Special tokens: PAD (ID: 4096) for padding shorter sequences

Example tokenization:

# Original sequence
ATCGATCGATCG...

# K-mer sliding window (k=6, stride=1)
ATCGAT β†’ Token 1234
 TCGATC β†’ Token 2341
  CGATCG β†’ Token 3412
   GATCGA β†’ Token 4123
...

# Final tokenized sequence (length 150)
[1234, 2341, 3412, 4123, ..., 4096, 4096]  # PAD tokens at end

Usage Example

Run the complete preprocessing pipeline:

python scripts/preprocess.py \
  --input-fastq wastewater_seq_dataset.fastq \
  --output-fastq cleaned_reads.fastq \
  --k-mer 6 \
  --max-seq-length 150

Methods


1. Vector-Quantized Variational Autoencoder (VQ-VAE)

The VQ-VAE architecture consists of three main components:

Encoder

The encoder maps tokenized sequences to continuous latent representations:

$$z_e = f_\theta(x)$$

where \(x \in \mathbb{Z}^{L}\) is the input token sequence, and the encoder \(f_\theta\) consists of:

  • Token Embedding: \(V = 4{,}097\) vocabulary, embedding dimension \(d = 128\)
  • 1D Convolutions: Two layers with kernel size 3, stride 1, hidden dimension 256
  • Layer Normalization + Dropout: Dropout probability \(p = 0.1\)

Vector Quantizer

The quantizer discretizes continuous representations using a learned codebook:

$$z_q = \text{Quantize}(z_e) = e_k, \quad k = \arg\min_j \|z_e - e_j\|_2$$

where \(\{e_1, \ldots, e_K\}\) is the codebook with \(K=512\) entries, each of dimension \(D=64\). The codebook is updated using Exponential Moving Average (EMA) with decay \(\gamma = 0.95\).

Decoder

The decoder reconstructs the sequence from quantized representations:

$$\hat{x} = g_\phi(z_q)$$

The decoder architecture mirrors the encoder with 1D convolutions, layer normalization, and a final linear projection to vocabulary size.

Training Objective

The complete loss function combines reconstruction, commitment, and entropy regularization:

$$\mathcal{L} = \mathcal{L}_{\text{recon}} + \beta \mathcal{L}_{\text{commit}} - \lambda H[\mathcal{C}]$$

where:

  • \(\mathcal{L}_{\text{recon}} = -\sum_{i=1}^L \log p(x_i | z_q)\) is the cross-entropy reconstruction loss
  • \(\mathcal{L}_{\text{commit}} = \|z_e - \text{sg}[z_q]\|_2^2\) is the commitment loss with \(\beta = 0.1\)
  • \(H[\mathcal{C}] = -\sum_{k=1}^K p_k \log p_k\) is the codebook usage entropy with \(\lambda = 0.003\)

VQ-VAE architecture showing encoder, vector quantization, and decoder components for learning discrete representations of genomic sequences.

2. Masked VQ-VAE

To improve robustness to missing or corrupted data, we extend the VQ-VAE with a masked language modeling objective inspired by BERT:

During training, we randomly mask 20% of input tokens (\(p_{\text{mask}} = 0.2\)) and train the model to reconstruct these masked positions:

$$\mathcal{L}_{\text{masked}} = -\sum_{i \in \mathcal{M}} \log p(x_i | z_q(\tilde{x}))$$

where \(\mathcal{M}\) is the set of masked positions and \(\tilde{x}\) is the masked input.

3. Contrastive VQ-VAE

To learn better cluster-separated representations, we fine-tune the encoder using contrastive learning with InfoNCE loss:

$$\mathcal{L}_{\text{contrast}} = -\mathbb{E}\left[\log \frac{\exp(\text{sim}(v_i, v_{i'})/\tau)}{\sum_{k=1}^{2B} \exp(\text{sim}(v_i, v_k)/\tau)}\right]$$

where:

  • \(v_i, v_{i'}\) are L2-normalized embeddings from two augmented views of the same sequence
  • \(\text{sim}(u, v) = u^\top v\) is cosine similarity
  • \(\tau = 0.5\) is the temperature parameter
  • \(B\) is the batch size (128)

We generate two augmented views for each sequence:

  • View 1: Random masking with \(p_{\text{mask}} = 0.15\) (~22 tokens per sequence)
  • View 2: Random dropout with \(p_{\text{drop}} = 0.10\) (~15 tokens per sequence)

Training Configuration

Model Hyperparameters

Category Hyperparameter Value Description
Architecture Vocabulary Size 4,097 4^6 k-mers + PAD token
Embedding Dim 128 Token embedding dimension
Hidden Dim 256 Convolutional layer dimension
Max Sequence Length 150 Tokens per sequence
Vector Quantizer Codebook Size (K) 512 Number of discrete codes
Code Dimension (D) 64 Latent representation dimension
EMA Decay (Ξ³) 0.95 Exponential moving average
Training Epochs 50 Full training iterations
Batch Size 32 Sequences per batch
Learning Rate 2e-4 AdamW optimizer
GPUs 2 Multi-GPU training (IDs: 0,1)
Workers 4 Data loading threads
Loss Weights Commitment (Ξ²) 0.1 Encoder commitment to codebook
Entropy (Ξ») 0.003 Codebook usage regularization
Reconstruction 1.0 Cross-entropy weight

Training Command

The model is trained using the following command:

python scripts/train.py \
  --data-path cleaned_reads.fastq \
  --output-dir experiments/1_standard_vqvae \
  --experiment-name standard_vqvae_50epochs \
  --epochs 50 \
  --batch-size 32 \
  --learning-rate 2e-4 \
  --num-codes 512 \
  --code-dim 64 \
  --n-gpu 2 \
  --gpu-ids "0,1" \
  --wandb-project vqvae-genomics \
  --save-freq 5

Configuration Management

Experiments are managed using YAML configuration files stored in configs/experiment_configs/. This allows reproducible experiments and systematic hyperparameter tuning:

python scripts/train.py --config configs/experiment_configs/large_model.yaml

Experiment Tracking

All experiments are logged to Weights & Biases (W&B) for comprehensive tracking:

  • Metrics: Training/validation loss, token accuracy, exact match rate, codebook utilization
  • System: GPU utilization, memory usage, throughput (sequences/sec)
  • Model: Checkpoint saving every 5 epochs, best model selection
  • Artifacts: Configuration files, sample reconstructions, embeddings

Multi-GPU Training

The implementation supports distributed training using PyTorch DataParallel:

  • GPU Configuration: Specify GPU IDs via --gpu-ids "0,1,2,3"
  • Batch Splitting: Automatic batch distribution across GPUs
  • Gradient Aggregation: Synchronized gradient updates
  • Speedup: ~1.8x on 2 GPUs, ~3.2x on 4 GPUs (typical)

Results & Performance


Base VQ-VAE Performance

Reconstruction Quality

Metric Value
Mean Token Accuracy 99.52%
Median Token Accuracy 100.0%
Exact Sequence Match Rate 56.33%
Exact Sequence Match Rate 56.33%

Analysis: The high token-level accuracy (99.52%) demonstrates the expected capability of VQ-VAE to effectively compress and reconstruct genomic sequences. The modest exact match rate (56.33%) is realistic given the challenging nature of fragmented wastewater surveillance data, while the low codebook utilization (19.73%) suggests efficient compression into a smaller effective vocabulary. These results are based on experiments and similar work in discrete representation learning.

Masked VQ-VAE Results

Metric Value
Masked Token Accuracy 95%
Improvement on Corrupted Sequences +12%

The Masked VQ-VAE demonstrates strong performance on masked token prediction (95% accuracy), indicating that the model learns robust contextual representations. The 12% improvement on corrupted sequences shows enhanced robustness to missing data.

Contrastive VQ-VAE Results

Clustering Quality Comparison

Metric Base VQ-VAE Contrastive VQ-VAE Improvement
Silhouette Score 0.31 0.42 +35%
Davies-Bouldin Index 1.68 1.34 -20% (lower is better)
Calinski-Harabasz Score 1248 1876 +50%

Contrastive learning significantly improves clustering quality across all metrics. The 35% improvement in Silhouette score indicates much better-separated and cohesive clusters, making the learned representations more suitable for variant identification tasks.

Implementation Details & Code Organization


Project Structure

genomic_sequence_detection/
β”œβ”€β”€ configs/                      # Configuration management
β”‚   β”œβ”€β”€ default_config.yaml      # Default hyperparameters
β”‚   └── experiment_configs/      # Experiment-specific configurations
β”‚       └── large_model.yaml
β”‚
β”œβ”€β”€ scripts/                      # Executable scripts
β”‚   β”œβ”€β”€ train.py                 # Main training loop with multi-GPU support
β”‚   β”œβ”€β”€ evaluate.py              # Model evaluation and metrics
β”‚   └── preprocess.py            # Data preprocessing pipeline
β”‚
β”œβ”€β”€ src/                         # Source code modules
β”‚   β”œβ”€β”€ models/                  # Model architectures
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── vqvae.py            # VQ-VAE, Encoder, Decoder, VectorQuantizer
β”‚   β”œβ”€β”€ data/                    # Data processing
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── tokenizer.py        # K-mer tokenization utilities
β”‚   └── utils/                   # Utility functions
β”‚       β”œβ”€β”€ __init__.py
β”‚       └── wandb_init.py       # W&B experiment tracking
β”‚
β”œβ”€β”€ experiments/                 # Experiment outputs
β”‚   β”œβ”€β”€ 1_standard_vqvae/       # Standard VQ-VAE experiments
β”‚   β”œβ”€β”€ 2_masked_vqvae/         # Masked VQ-VAE experiments
β”‚   β”œβ”€β”€ 3_contrastive_vqvae/    # Contrastive VQ-VAE experiments
β”‚   └── 4_final_comparison/     # Comparative analysis
β”‚
β”œβ”€β”€ data/                        # Data files
β”‚   β”œβ”€β”€ wastewater_seq_dataset.fastq
β”‚   β”œβ”€β”€ cleaned_reads.fastq
β”‚   └── virus_sequences.fastq
β”‚
β”œβ”€β”€ Trimmomatic/                 # Quality control tool
β”‚   └── trimmomatic-0.39.jar
β”‚
β”œβ”€β”€ requirements.txt             # Python dependencies
└── README.md                    # Comprehensive documentation

Installation & Setup

System Requirements

  • Python: 3.8 or higher
  • CUDA: 11.3+ (for GPU training)
  • GPU: 8GB+ VRAM recommended (tested on NVIDIA RTX series)
  • RAM: 16GB+ system memory
  • Storage: 10GB+ for data and checkpoints
  • Java: 8+ (for Trimmomatic preprocessing)

Step 1: Clone Repository

git clone https://github.com/arrdel/genomic_sequence_detection.git
cd genomic_sequence_detection

Step 2: Install Dependencies

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install PyTorch (adjust for your CUDA version)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install other dependencies
pip install -r requirements.txt

Key Dependencies:

  • torch>=2.0.0 - Deep learning framework
  • biopython>=1.79 - Biological sequence processing
  • numpy>=1.21.0 - Numerical operations
  • wandb>=0.15.0 - Experiment tracking
  • pyyaml>=6.0 - Configuration management
  • tqdm>=4.65.0 - Progress bars

Step 3: Download Trimmomatic

wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip
unzip Trimmomatic-0.39.zip -d Trimmomatic/

Step 4: Setup Weights & Biases

# Login to W&B (required for experiment tracking)
wandb login

# Or disable W&B with --no-wandb flag during training

Usage Examples

1. Preprocessing

python scripts/preprocess.py \
  --input-fastq data/wastewater_seq_dataset.fastq \
  --output-fastq data/cleaned_reads.fastq \
  --k-mer 6 \
  --max-seq-length 150 \
  --run-fastqc \
  --run-trimmomatic

2. Training Standard VQ-VAE

python scripts/train.py \
  --data-path data/cleaned_reads.fastq \
  --output-dir experiments/1_standard_vqvae \
  --experiment-name my_experiment \
  --epochs 50 \
  --batch-size 32 \
  --num-codes 512 \
  --code-dim 64 \
  --n-gpu 2 \
  --gpu-ids "0,1"

3. Training with Configuration File

# Use predefined configuration
python scripts/train.py --config configs/default_config.yaml

# Create custom config in configs/experiment_configs/my_config.yaml
python scripts/train.py --config configs/experiment_configs/my_config.yaml

4. Evaluation

python scripts/evaluate.py \
  --checkpoint-path experiments/1_standard_vqvae/checkpoints/best_model.pt \
  --data-path data/cleaned_reads.fastq \
  --output-dir evaluation_results \
  --num-samples 100

Key Implementation Features

Dead Code Refresh Mechanism

To prevent codebook collapse (some codes never being used), we implement a dead code refresh mechanism that periodically replaces unused codes with high-variance encoder outputs:

# Enable dead code refresh during training
python scripts/train.py \
  --refresh-codes \
  --refresh-interval 500 \
  --refresh-min-count 5

Gradient Clipping

Gradient clipping stabilizes training and prevents exploding gradients:

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Checkpoint Management

Automatic checkpoint saving with configurable frequency:

  • Regular checkpoints: Every N epochs (--save-freq 5)
  • Best model: Saved when validation loss improves
  • Resume training: --resume experiments/.../checkpoint_epoch_20.pt

Troubleshooting

CUDA Out of Memory

  • Reduce batch size: --batch-size 16
  • Reduce model dimensions: --hidden-dim 128
  • Use gradient accumulation (implementation in progress)
  • Enable mixed precision training: --mixed-precision

Low Reconstruction Accuracy

  • Train longer: --epochs 100
  • Increase model capacity: --num-codes 1024
  • Verify data quality (run FastQC)
  • Adjust commitment weight: --commitment-cost 0.15

Low Codebook Utilization

  • Increase commitment cost: --commitment-cost 0.25
  • Reduce codebook size: --num-codes 256
  • Enable dead code refresh: --refresh-codes
  • Increase entropy regularization: --entropy-weight 0.005

Citation & License

@misc{chinda2025contrastivedeeplearningvariant,
      title={Contrastive Deep Learning for Variant Detection in Wastewater Genomic Sequencing}, 
      author={Adele Chinda and Richmond Azumah and Hemanth Demakethepalli Venkateswara},
      year={2025},
      eprint={2512.03158},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2512.03158}, 
}

License: MIT License - See LICENSE file for details
Contact: adelechinda@gsu.edu

Discussion & Future Work


Key Findings

  • βœ“ VQ-VAE achieves 99.5% reconstruction accuracy on fragmented wastewater sequences
  • βœ“ Discrete codebook effectively captures genomic patterns with only 19.7% utilization
  • βœ“ Entropy regularization prevents codebook collapse while maintaining diversity
  • βœ“ Masked learning improves robustness to missing/corrupted data by 12%
  • βœ“ Contrastive learning improves clustering separability by 35%
  • βœ“ Reference-free approach enables novel variant detection

Advantages Over Traditional Methods

  • No reference genome required: Works with novel and unknown variants
  • Computationally efficient: Training and inference in minutes vs. hours for alignment-based methods
  • Learns meaningful representations: Discrete codes capture interpretable sequence patterns
  • Robust to noise: Handles degraded RNA and sequencing artifacts effectively
  • Scalable: Can process large-scale wastewater surveillance data in real-time

Limitations

  • Modest exact sequence match rate (56%) leaves room for improvement
  • Codebook utilization suggests potential for more compact representations
  • Model currently focused on SARS-CoV-2; generalization to other pathogens needs validation
  • Interpretability of learned discrete codes requires further analysis

Future Directions

  1. Hierarchical VQ-VAE: Multi-scale representations to capture both local k-mer patterns and long-range dependencies
  2. Phylogenetic Integration: Combine learned embeddings with traditional phylogenetic analysis
  3. Multi-Pathogen Validation: Test on diverse viral datasets (influenza, RSV, norovirus) to assess generalization
  4. Real-Time Deployment: Integrate into operational wastewater surveillance pipelines
  5. Temporal Modeling: Incorporate time-series data to track variant emergence and evolution
  6. Interpretability Analysis: Investigate biological meaning of learned codebook entries
  7. Semi-Supervised Learning: Leverage small amounts of labeled variant data to improve clustering
  8. Uncertainty Quantification: Add Bayesian extensions for confidence estimation

Impact

This work demonstrates that discrete representation learning can provide a scalable, reference-free approach to genomic sequence analysis. By removing the dependency on reference genomes, our method democratizes genomic surveillance, making it accessible to resource-limited settings and enabling rapid response to emerging viral threats. The learned discrete representations could serve as a foundation for:

  • Early warning systems for novel variant detection
  • Automated clustering and classification of viral sequences
  • Robust sequence reconstruction in noisy surveillance data
  • Transfer learning to other pathogen surveillance applications

Project Poster


View our comprehensive project poster summarizing the methodology, results, and key findings.

Click poster to open the PDF version in a new tab.

Acknowledgments


This project was completed as part of the Deep Learning course at Georgia State University. We thank our instructor for his valuable feedback and support throughout this work.