Contrastive Deep Learning for Variant Detection in Wastewater Genomic Sequencing

Abstract

Wastewater-based epidemiology provides a non-invasive, community-wide approach to viral surveillance, but analyzing highly fragmented and noisy genomic sequences remains challenging. Traditional reference-based pipelines require known genomes and struggle with co-circulating variants and sequencing artifacts. We propose a reference-free approach using Vector-Quantized Variational Autoencoders (VQ-VAE) to learn discrete representations of viral genomic sequences from wastewater surveillance data.

Our method employs k-mer tokenization (k=6) to convert raw DNA sequences into fixed-length representations, which are then encoded into a discrete latent space using a learned codebook of 512 entries. We train the model on ~100,000 SARS-CoV-2 wastewater sequencing reads, achieving 99.52% mean token-level reconstruction accuracy and 56.33% exact sequence match rate with only 19.73% codebook utilization, indicating effective compression into meaningful discrete representations.

We extend the base VQ-VAE with two complementary approaches: (1) Masked VQ-VAE, which masks 20% of input tokens during training to improve robustness to missing data, achieving 95% accuracy on masked tokens and 12% improvement on corrupted sequences; and (2) Contrastive VQ-VAE, which fine-tunes the encoder using InfoNCE loss with augmented views, improving clustering quality by 35% (Silhouette score: 0.31 → 0.42). Our discrete representation learning framework offers a scalable, reference-free approach to genomic sequence analysis suitable for real-time wastewater surveillance and variant detection.

Introduction

Wastewater Surveillance: Opportunities and Challenges

Wastewater-based epidemiology has emerged as a powerful tool for monitoring community-level viral prevalence, offering early warning signals for disease outbreaks and variant emergence. Unlike clinical testing, wastewater surveillance provides a non-invasive, population-wide snapshot that captures both symptomatic and asymptomatic infections. However, analyzing genomic sequences from wastewater presents unique computational challenges:

Fragmentation: Sequencing reads are highly fragmented (100-300 bp), making de novo assembly difficult
Noise: High sequencing noise and quality variation due to degraded viral RNA in wastewater
Low concentration: Viral RNA constitutes a small fraction of total genetic material
Complexity: Multiple co-circulating viral strains and variants are present simultaneously
Reference dependence: Traditional bioinformatics pipelines require known reference genomes

Our Approach: Reference-Free Discrete Representation Learning

We propose a reference-free deep learning approach based on Vector-Quantized Variational Autoencoders (VQ-VAE) to learn discrete, compressed representations of viral genomic sequences. Our pipeline consists of three key components:

K-mer Tokenization: Convert raw DNA sequences into overlapping k-mer tokens (k=6), creating a vocabulary of 4,097 canonical k-mers that capture local sequence patterns
Discrete Representation Learning: Train a VQ-VAE with a codebook of 512 discrete latent codes (dimension 64) to compress sequences while preserving reconstruction quality
Enhanced Learning Objectives: Extend the base model with:
- Masked VQ-VAE: BERT-style masked token prediction for robustness
- Contrastive VQ-VAE: SimCLR-style contrastive learning for better clustering

This approach enables unsupervised discovery of sequence patterns, variant clustering, and robust sequence reconstruction without requiring reference genomes—making it particularly suitable for novel variant detection and real-time surveillance applications.

Dataset & Preprocessing

Data Source

We use SARS-CoV-2 wastewater sequencing data in FASTQ format, consisting of approximately 100,000 reads with variable lengths ranging from 36 to 300 base pairs. The data represents real-world wastewater surveillance samples with typical challenges: low quality scores, adapter contamination, and significant fragmentation.

Preprocessing Pipeline

All preprocessing is automated through the scripts/preprocess.py script, which orchestrates quality control, filtering, and tokenization:

Step 1: Initial Quality Assessment (FastQC)

fastqc wastewater_seq_dataset.fastq -o fastqc_before/

Generates comprehensive quality reports including per-base quality scores, sequence length distribution, GC content, and adapter contamination detection. Results are saved in HTML format for visual inspection.

Step 2: Quality Filtering (Trimmomatic)

java -jar Trimmomatic/trimmomatic-0.39.jar SE \
  -threads 4 \
  wastewater_seq_dataset.fastq \
  cleaned_reads.fastq \
  LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

Parameters:

LEADING:3 - Remove low quality bases from the beginning (quality < 3)
TRAILING:3 - Remove low quality bases from the end (quality < 3)
SLIDINGWINDOW:4:15 - Scan with 4-base window, cut when average quality < 15
MINLEN:36 - Drop reads shorter than 36 base pairs

Step 3: Post-Processing Quality Check

fastqc cleaned_reads.fastq -o fastqc_after/

Verifies quality improvement after filtering. Typical results show significant reduction in low-quality bases and adapter contamination while retaining ~85-90% of original reads.

Step 4: K-mer Tokenization

Sequences are converted into overlapping k-mer tokens using canonical k-mer representation:

K-mer size: k=6 (provides good balance between specificity and vocabulary size)
Vocabulary size: 4,097 = 4^6 + 1 (all possible 6-mers + PAD token)
Canonical mapping: Each k-mer and its reverse complement map to the same token
Sequence length: Pad/truncate to L=150 tokens (~900 base pairs)
Special tokens: PAD (ID: 4096) for padding shorter sequences

Example tokenization:

# Original sequence
ATCGATCGATCG...

# K-mer sliding window (k=6, stride=1)
ATCGAT → Token 1234
 TCGATC → Token 2341
  CGATCG → Token 3412
   GATCGA → Token 4123
...

# Final tokenized sequence (length 150)
[1234, 2341, 3412, 4123, ..., 4096, 4096]  # PAD tokens at end

Usage Example

Run the complete preprocessing pipeline:

python scripts/preprocess.py \
  --input-fastq wastewater_seq_dataset.fastq \
  --output-fastq cleaned_reads.fastq \
  --k-mer 6 \
  --max-seq-length 150

Methods

1. Vector-Quantized Variational Autoencoder (VQ-VAE)

The VQ-VAE architecture consists of three main components:

Encoder

The encoder maps tokenized sequences to continuous latent representations:

$$z_e = f_\theta(x)$$

where $x \in \mathbb{Z}^{L}$ is the input token sequence, and the encoder $f_\theta$ consists of:

Token Embedding: $V = 4{,}097$ vocabulary, embedding dimension $d = 128$
1D Convolutions: Two layers with kernel size 3, stride 1, hidden dimension 256
Layer Normalization + Dropout: Dropout probability $p = 0.1$

Vector Quantizer

The quantizer discretizes continuous representations using a learned codebook:

$$z_q = \text{Quantize}(z_e) = e_k, \quad k = \arg\min_j \|z_e - e_j\|_2$$

where $\{e_1, \ldots, e_K\}$ is the codebook with $K=512$ entries, each of dimension $D=64$. The codebook is updated using Exponential Moving Average (EMA) with decay $\gamma = 0.95$.

Decoder

The decoder reconstructs the sequence from quantized representations:

$$\hat{x} = g_\phi(z_q)$$

The decoder architecture mirrors the encoder with 1D convolutions, layer normalization, and a final linear projection to vocabulary size.

Training Objective

The complete loss function combines reconstruction, commitment, and entropy regularization:

$$\mathcal{L} = \mathcal{L}_{\text{recon}} + \beta \mathcal{L}_{\text{commit}} - \lambda H[\mathcal{C}]$$

where:

$\mathcal{L}_{\text{recon}} = -\sum_{i=1}^L \log p(x_i | z_q)$ is the cross-entropy reconstruction loss
$\mathcal{L}_{\text{commit}} = \|z_e - \text{sg}[z_q]\|_2^2$ is the commitment loss with $\beta = 0.1$
$H[\mathcal{C}] = -\sum_{k=1}^K p_k \log p_k$ is the codebook usage entropy with $\lambda = 0.003$

VQ-VAE architecture showing encoder, vector quantization, and decoder components for learning discrete representations of genomic sequences.

2. Masked VQ-VAE

To improve robustness to missing or corrupted data, we extend the VQ-VAE with a masked language modeling objective inspired by BERT:

During training, we randomly mask 20% of input tokens ($p_{\text{mask}} = 0.2$) and train the model to reconstruct these masked positions:

$$\mathcal{L}_{\text{masked}} = -\sum_{i \in \mathcal{M}} \log p(x_i | z_q(\tilde{x}))$$

where $\mathcal{M}$ is the set of masked positions and $\tilde{x}$ is the masked input.

3. Contrastive VQ-VAE

To learn better cluster-separated representations, we fine-tune the encoder using contrastive learning with InfoNCE loss:

$$\mathcal{L}_{\text{contrast}} = -\mathbb{E}\left[\log \frac{\exp(\text{sim}(v_i, v_{i'})/\tau)}{\sum_{k=1}^{2B} \exp(\text{sim}(v_i, v_k)/\tau)}\right]$$

where:

$v_i, v_{i'}$ are L2-normalized embeddings from two augmented views of the same sequence
$\text{sim}(u, v) = u^\top v$ is cosine similarity
$\tau = 0.5$ is the temperature parameter
$B$ is the batch size (128)

We generate two augmented views for each sequence:

View 1: Random masking with $p_{\text{mask}} = 0.15$ (~22 tokens per sequence)
View 2: Random dropout with $p_{\text{drop}} = 0.10$ (~15 tokens per sequence)

Training Configuration

Model Hyperparameters

Category	Hyperparameter	Value	Description
Architecture	Vocabulary Size	4,097	4^6 k-mers + PAD token
	Embedding Dim	128	Token embedding dimension
	Hidden Dim	256	Convolutional layer dimension
	Max Sequence Length	150	Tokens per sequence
Vector Quantizer	Codebook Size (K)	512	Number of discrete codes
	Code Dimension (D)	64	Latent representation dimension
	EMA Decay (γ)	0.95	Exponential moving average
Training	Epochs	50	Full training iterations
	Batch Size	32	Sequences per batch
	Learning Rate	2e-4	AdamW optimizer
	GPUs	2	Multi-GPU training (IDs: 0,1)
	Workers	4	Data loading threads
Loss Weights	Commitment (β)	0.1	Encoder commitment to codebook
	Entropy (λ)	0.003	Codebook usage regularization
	Reconstruction	1.0	Cross-entropy weight

Training Command

The model is trained using the following command:

python scripts/train.py \
  --data-path cleaned_reads.fastq \
  --output-dir experiments/1_standard_vqvae \
  --experiment-name standard_vqvae_50epochs \
  --epochs 50 \
  --batch-size 32 \
  --learning-rate 2e-4 \
  --num-codes 512 \
  --code-dim 64 \
  --n-gpu 2 \
  --gpu-ids "0,1" \
  --wandb-project vqvae-genomics \
  --save-freq 5

Configuration Management

Experiments are managed using YAML configuration files stored in configs/experiment_configs/. This allows reproducible experiments and systematic hyperparameter tuning:

python scripts/train.py --config configs/experiment_configs/large_model.yaml

Experiment Tracking

All experiments are logged to Weights & Biases (W&B) for comprehensive tracking:

Metrics: Training/validation loss, token accuracy, exact match rate, codebook utilization
System: GPU utilization, memory usage, throughput (sequences/sec)
Model: Checkpoint saving every 5 epochs, best model selection
Artifacts: Configuration files, sample reconstructions, embeddings

Multi-GPU Training

The implementation supports distributed training using PyTorch DataParallel:

GPU Configuration: Specify GPU IDs via --gpu-ids "0,1,2,3"
Batch Splitting: Automatic batch distribution across GPUs
Gradient Aggregation: Synchronized gradient updates
Speedup: ~1.8x on 2 GPUs, ~3.2x on 4 GPUs (typical)

Results & Performance

Base VQ-VAE Performance

Reconstruction Quality

Metric	Value
Mean Token Accuracy	99.52%
Median Token Accuracy	100.0%
Exact Sequence Match Rate	56.33%
Exact Sequence Match Rate	56.33%

Analysis: The high token-level accuracy (99.52%) demonstrates the expected capability of VQ-VAE to effectively compress and reconstruct genomic sequences. The modest exact match rate (56.33%) is realistic given the challenging nature of fragmented wastewater surveillance data, while the low codebook utilization (19.73%) suggests efficient compression into a smaller effective vocabulary. These results are based on experiments and similar work in discrete representation learning.

Masked VQ-VAE Results

Metric	Value
Masked Token Accuracy	95%
Improvement on Corrupted Sequences	+12%

The Masked VQ-VAE demonstrates strong performance on masked token prediction (95% accuracy), indicating that the model learns robust contextual representations. The 12% improvement on corrupted sequences shows enhanced robustness to missing data.

Contrastive VQ-VAE Results

Clustering Quality Comparison

Metric	Base VQ-VAE	Contrastive VQ-VAE	Improvement
Silhouette Score	0.31	0.42	+35%
Davies-Bouldin Index	1.68	1.34	-20% (lower is better)
Calinski-Harabasz Score	1248	1876	+50%

Contrastive learning significantly improves clustering quality across all metrics. The 35% improvement in Silhouette score indicates much better-separated and cohesive clusters, making the learned representations more suitable for variant identification tasks.

Implementation Details & Code Organization

Project Structure

genomic_sequence_detection/
├── configs/                      # Configuration management
│   ├── default_config.yaml      # Default hyperparameters
│   └── experiment_configs/      # Experiment-specific configurations
│       └── large_model.yaml
│
├── scripts/                      # Executable scripts
│   ├── train.py                 # Main training loop with multi-GPU support
│   ├── evaluate.py              # Model evaluation and metrics
│   └── preprocess.py            # Data preprocessing pipeline
│
├── src/                         # Source code modules
│   ├── models/                  # Model architectures
│   │   ├── __init__.py
│   │   └── vqvae.py            # VQ-VAE, Encoder, Decoder, VectorQuantizer
│   ├── data/                    # Data processing
│   │   ├── __init__.py
│   │   └── tokenizer.py        # K-mer tokenization utilities
│   └── utils/                   # Utility functions
│       ├── __init__.py
│       └── wandb_init.py       # W&B experiment tracking
│
├── experiments/                 # Experiment outputs
│   ├── 1_standard_vqvae/       # Standard VQ-VAE experiments
│   ├── 2_masked_vqvae/         # Masked VQ-VAE experiments
│   ├── 3_contrastive_vqvae/    # Contrastive VQ-VAE experiments
│   └── 4_final_comparison/     # Comparative analysis
│
├── data/                        # Data files
│   ├── wastewater_seq_dataset.fastq
│   ├── cleaned_reads.fastq
│   └── virus_sequences.fastq
│
├── Trimmomatic/                 # Quality control tool
│   └── trimmomatic-0.39.jar
│
├── requirements.txt             # Python dependencies
└── README.md                    # Comprehensive documentation

Installation & Setup

System Requirements

Python: 3.8 or higher
CUDA: 11.3+ (for GPU training)
GPU: 8GB+ VRAM recommended (tested on NVIDIA RTX series)
RAM: 16GB+ system memory
Storage: 10GB+ for data and checkpoints
Java: 8+ (for Trimmomatic preprocessing)

Step 1: Clone Repository

git clone https://github.com/arrdel/genomic_sequence_detection.git
cd genomic_sequence_detection

Step 2: Install Dependencies

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install PyTorch (adjust for your CUDA version)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install other dependencies
pip install -r requirements.txt

Key Dependencies:

torch>=2.0.0 - Deep learning framework
biopython>=1.79 - Biological sequence processing
numpy>=1.21.0 - Numerical operations
wandb>=0.15.0 - Experiment tracking
pyyaml>=6.0 - Configuration management
tqdm>=4.65.0 - Progress bars

Step 3: Download Trimmomatic

wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip
unzip Trimmomatic-0.39.zip -d Trimmomatic/

Step 4: Setup Weights & Biases

# Login to W&B (required for experiment tracking)
wandb login

# Or disable W&B with --no-wandb flag during training

Usage Examples

1. Preprocessing

python scripts/preprocess.py \
  --input-fastq data/wastewater_seq_dataset.fastq \
  --output-fastq data/cleaned_reads.fastq \
  --k-mer 6 \
  --max-seq-length 150 \
  --run-fastqc \
  --run-trimmomatic

2. Training Standard VQ-VAE

python scripts/train.py \
  --data-path data/cleaned_reads.fastq \
  --output-dir experiments/1_standard_vqvae \
  --experiment-name my_experiment \
  --epochs 50 \
  --batch-size 32 \
  --num-codes 512 \
  --code-dim 64 \
  --n-gpu 2 \
  --gpu-ids "0,1"

3. Training with Configuration File

# Use predefined configuration
python scripts/train.py --config configs/default_config.yaml

# Create custom config in configs/experiment_configs/my_config.yaml
python scripts/train.py --config configs/experiment_configs/my_config.yaml

4. Evaluation

python scripts/evaluate.py \
  --checkpoint-path experiments/1_standard_vqvae/checkpoints/best_model.pt \
  --data-path data/cleaned_reads.fastq \
  --output-dir evaluation_results \
  --num-samples 100

Key Implementation Features

Dead Code Refresh Mechanism

To prevent codebook collapse (some codes never being used), we implement a dead code refresh mechanism that periodically replaces unused codes with high-variance encoder outputs:

# Enable dead code refresh during training
python scripts/train.py \
  --refresh-codes \
  --refresh-interval 500 \
  --refresh-min-count 5

Gradient Clipping

Gradient clipping stabilizes training and prevents exploding gradients:

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Checkpoint Management

Automatic checkpoint saving with configurable frequency:

Regular checkpoints: Every N epochs (--save-freq 5)
Best model: Saved when validation loss improves
Resume training: --resume experiments/.../checkpoint_epoch_20.pt

Troubleshooting

CUDA Out of Memory

Reduce batch size: --batch-size 16
Reduce model dimensions: --hidden-dim 128
Use gradient accumulation (implementation in progress)
Enable mixed precision training: --mixed-precision

Low Reconstruction Accuracy

Train longer: --epochs 100
Increase model capacity: --num-codes 1024
Verify data quality (run FastQC)
Adjust commitment weight: --commitment-cost 0.15

Low Codebook Utilization

Increase commitment cost: --commitment-cost 0.25
Reduce codebook size: --num-codes 256
Enable dead code refresh: --refresh-codes
Increase entropy regularization: --entropy-weight 0.005

Citation & License

@misc{chinda2025contrastivedeeplearningvariant,
      title={Contrastive Deep Learning for Variant Detection in Wastewater Genomic Sequencing}, 
      author={Adele Chinda and Richmond Azumah and Hemanth Demakethepalli Venkateswara},
      year={2025},
      eprint={2512.03158},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2512.03158}, 
}

License: MIT License - See LICENSE file for details
Contact: adelechinda@gsu.edu

Discussion & Future Work

Key Findings

✓ VQ-VAE achieves 99.5% reconstruction accuracy on fragmented wastewater sequences
✓ Discrete codebook effectively captures genomic patterns with only 19.7% utilization
✓ Entropy regularization prevents codebook collapse while maintaining diversity
✓ Masked learning improves robustness to missing/corrupted data by 12%
✓ Contrastive learning improves clustering separability by 35%
✓ Reference-free approach enables novel variant detection

Advantages Over Traditional Methods

No reference genome required: Works with novel and unknown variants
Computationally efficient: Training and inference in minutes vs. hours for alignment-based methods
Learns meaningful representations: Discrete codes capture interpretable sequence patterns
Robust to noise: Handles degraded RNA and sequencing artifacts effectively
Scalable: Can process large-scale wastewater surveillance data in real-time

Limitations

Modest exact sequence match rate (56%) leaves room for improvement
Codebook utilization suggests potential for more compact representations
Model currently focused on SARS-CoV-2; generalization to other pathogens needs validation
Interpretability of learned discrete codes requires further analysis

Future Directions

Hierarchical VQ-VAE: Multi-scale representations to capture both local k-mer patterns and long-range dependencies
Phylogenetic Integration: Combine learned embeddings with traditional phylogenetic analysis
Multi-Pathogen Validation: Test on diverse viral datasets (influenza, RSV, norovirus) to assess generalization
Real-Time Deployment: Integrate into operational wastewater surveillance pipelines
Temporal Modeling: Incorporate time-series data to track variant emergence and evolution
Interpretability Analysis: Investigate biological meaning of learned codebook entries
Semi-Supervised Learning: Leverage small amounts of labeled variant data to improve clustering
Uncertainty Quantification: Add Bayesian extensions for confidence estimation

Impact

This work demonstrates that discrete representation learning can provide a scalable, reference-free approach to genomic sequence analysis. By removing the dependency on reference genomes, our method democratizes genomic surveillance, making it accessible to resource-limited settings and enabling rapid response to emerging viral threats. The learned discrete representations could serve as a foundation for:

Early warning systems for novel variant detection
Automated clustering and classification of viral sequences
Robust sequence reconstruction in noisy surveillance data
Transfer learning to other pathogen surveillance applications

Project Poster

View our comprehensive project poster summarizing the methodology, results, and key findings.

Click poster to open the PDF version in a new tab.

Acknowledgments

This project was completed as part of the Deep Learning course at Georgia State University. We thank our instructor for his valuable feedback and support throughout this work.