Wastewater-based epidemiology provides a non-invasive, community-wide approach to viral surveillance, but analyzing highly fragmented and noisy genomic sequences remains challenging. Traditional reference-based pipelines require known genomes and struggle with co-circulating variants and sequencing artifacts. We propose a reference-free approach using Vector-Quantized Variational Autoencoders (VQ-VAE) to learn discrete representations of viral genomic sequences from wastewater surveillance data.
Our method employs k-mer tokenization (k=6) to convert raw DNA sequences into fixed-length representations, which are then encoded into a discrete latent space using a learned codebook of 512 entries. We train the model on ~100,000 SARS-CoV-2 wastewater sequencing reads, achieving 99.52% mean token-level reconstruction accuracy and 56.33% exact sequence match rate with only 19.73% codebook utilization, indicating effective compression into meaningful discrete representations.
We extend the base VQ-VAE with two complementary approaches: (1) Masked VQ-VAE, which masks 20% of input tokens during training to improve robustness to missing data, achieving 95% accuracy on masked tokens and 12% improvement on corrupted sequences; and (2) Contrastive VQ-VAE, which fine-tunes the encoder using InfoNCE loss with augmented views, improving clustering quality by 35% (Silhouette score: 0.31 β 0.42). Our discrete representation learning framework offers a scalable, reference-free approach to genomic sequence analysis suitable for real-time wastewater surveillance and variant detection.
Wastewater-based epidemiology has emerged as a powerful tool for monitoring community-level viral prevalence, offering early warning signals for disease outbreaks and variant emergence. Unlike clinical testing, wastewater surveillance provides a non-invasive, population-wide snapshot that captures both symptomatic and asymptomatic infections. However, analyzing genomic sequences from wastewater presents unique computational challenges:
We propose a reference-free deep learning approach based on Vector-Quantized Variational Autoencoders (VQ-VAE) to learn discrete, compressed representations of viral genomic sequences. Our pipeline consists of three key components:
This approach enables unsupervised discovery of sequence patterns, variant clustering, and robust sequence reconstruction without requiring reference genomesβmaking it particularly suitable for novel variant detection and real-time surveillance applications.
We use SARS-CoV-2 wastewater sequencing data in FASTQ format, consisting of approximately 100,000 reads with variable lengths ranging from 36 to 300 base pairs. The data represents real-world wastewater surveillance samples with typical challenges: low quality scores, adapter contamination, and significant fragmentation.
All preprocessing is automated through the scripts/preprocess.py script, which orchestrates
quality control, filtering, and tokenization:
fastqc wastewater_seq_dataset.fastq -o fastqc_before/
Generates comprehensive quality reports including per-base quality scores, sequence length distribution, GC content, and adapter contamination detection. Results are saved in HTML format for visual inspection.
java -jar Trimmomatic/trimmomatic-0.39.jar SE \
-threads 4 \
wastewater_seq_dataset.fastq \
cleaned_reads.fastq \
LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
Parameters:
LEADING:3 - Remove low quality bases from the beginning (quality < 3)TRAILING:3 - Remove low quality bases from the end (quality < 3)SLIDINGWINDOW:4:15 - Scan with 4-base window, cut when average quality < 15MINLEN:36 - Drop reads shorter than 36 base pairsfastqc cleaned_reads.fastq -o fastqc_after/
Verifies quality improvement after filtering. Typical results show significant reduction in low-quality bases and adapter contamination while retaining ~85-90% of original reads.
Sequences are converted into overlapping k-mer tokens using canonical k-mer representation:
Example tokenization:
# Original sequence
ATCGATCGATCG...
# K-mer sliding window (k=6, stride=1)
ATCGAT β Token 1234
TCGATC β Token 2341
CGATCG β Token 3412
GATCGA β Token 4123
...
# Final tokenized sequence (length 150)
[1234, 2341, 3412, 4123, ..., 4096, 4096] # PAD tokens at end
Run the complete preprocessing pipeline:
python scripts/preprocess.py \
--input-fastq wastewater_seq_dataset.fastq \
--output-fastq cleaned_reads.fastq \
--k-mer 6 \
--max-seq-length 150
The VQ-VAE architecture consists of three main components:
The encoder maps tokenized sequences to continuous latent representations:
$$z_e = f_\theta(x)$$
where \(x \in \mathbb{Z}^{L}\) is the input token sequence, and the encoder \(f_\theta\) consists of:
The quantizer discretizes continuous representations using a learned codebook:
$$z_q = \text{Quantize}(z_e) = e_k, \quad k = \arg\min_j \|z_e - e_j\|_2$$
where \(\{e_1, \ldots, e_K\}\) is the codebook with \(K=512\) entries, each of dimension \(D=64\). The codebook is updated using Exponential Moving Average (EMA) with decay \(\gamma = 0.95\).
The decoder reconstructs the sequence from quantized representations:
$$\hat{x} = g_\phi(z_q)$$
The decoder architecture mirrors the encoder with 1D convolutions, layer normalization, and a final linear projection to vocabulary size.
The complete loss function combines reconstruction, commitment, and entropy regularization:
$$\mathcal{L} = \mathcal{L}_{\text{recon}} + \beta \mathcal{L}_{\text{commit}} - \lambda H[\mathcal{C}]$$
where:
VQ-VAE architecture showing encoder, vector quantization, and decoder components for learning discrete representations of genomic sequences.
To improve robustness to missing or corrupted data, we extend the VQ-VAE with a masked language modeling objective inspired by BERT:
During training, we randomly mask 20% of input tokens (\(p_{\text{mask}} = 0.2\)) and train the model to reconstruct these masked positions:
$$\mathcal{L}_{\text{masked}} = -\sum_{i \in \mathcal{M}} \log p(x_i | z_q(\tilde{x}))$$
where \(\mathcal{M}\) is the set of masked positions and \(\tilde{x}\) is the masked input.
To learn better cluster-separated representations, we fine-tune the encoder using contrastive learning with InfoNCE loss:
$$\mathcal{L}_{\text{contrast}} = -\mathbb{E}\left[\log \frac{\exp(\text{sim}(v_i, v_{i'})/\tau)}{\sum_{k=1}^{2B} \exp(\text{sim}(v_i, v_k)/\tau)}\right]$$
where:
We generate two augmented views for each sequence:
| Category | Hyperparameter | Value | Description |
|---|---|---|---|
| Architecture | Vocabulary Size | 4,097 | 4^6 k-mers + PAD token |
| Embedding Dim | 128 | Token embedding dimension | |
| Hidden Dim | 256 | Convolutional layer dimension | |
| Max Sequence Length | 150 | Tokens per sequence | |
| Vector Quantizer | Codebook Size (K) | 512 | Number of discrete codes |
| Code Dimension (D) | 64 | Latent representation dimension | |
| EMA Decay (Ξ³) | 0.95 | Exponential moving average | |
| Training | Epochs | 50 | Full training iterations |
| Batch Size | 32 | Sequences per batch | |
| Learning Rate | 2e-4 | AdamW optimizer | |
| GPUs | 2 | Multi-GPU training (IDs: 0,1) | |
| Workers | 4 | Data loading threads | |
| Loss Weights | Commitment (Ξ²) | 0.1 | Encoder commitment to codebook |
| Entropy (Ξ») | 0.003 | Codebook usage regularization | |
| Reconstruction | 1.0 | Cross-entropy weight |
The model is trained using the following command:
python scripts/train.py \
--data-path cleaned_reads.fastq \
--output-dir experiments/1_standard_vqvae \
--experiment-name standard_vqvae_50epochs \
--epochs 50 \
--batch-size 32 \
--learning-rate 2e-4 \
--num-codes 512 \
--code-dim 64 \
--n-gpu 2 \
--gpu-ids "0,1" \
--wandb-project vqvae-genomics \
--save-freq 5
Experiments are managed using YAML configuration files stored in configs/experiment_configs/.
This allows reproducible experiments and systematic hyperparameter tuning:
python scripts/train.py --config configs/experiment_configs/large_model.yaml
All experiments are logged to Weights & Biases (W&B) for comprehensive tracking:
The implementation supports distributed training using PyTorch DataParallel:
--gpu-ids "0,1,2,3"
| Metric | Value |
|---|---|
| Mean Token Accuracy | 99.52% |
| Median Token Accuracy | 100.0% |
| Exact Sequence Match Rate | 56.33% |
| Exact Sequence Match Rate | 56.33% |
Analysis: The high token-level accuracy (99.52%) demonstrates the expected capability of VQ-VAE to effectively compress and reconstruct genomic sequences. The modest exact match rate (56.33%) is realistic given the challenging nature of fragmented wastewater surveillance data, while the low codebook utilization (19.73%) suggests efficient compression into a smaller effective vocabulary. These results are based on experiments and similar work in discrete representation learning.
| Metric | Value |
|---|---|
| Masked Token Accuracy | 95% |
| Improvement on Corrupted Sequences | +12% |
The Masked VQ-VAE demonstrates strong performance on masked token prediction (95% accuracy), indicating that the model learns robust contextual representations. The 12% improvement on corrupted sequences shows enhanced robustness to missing data.
| Metric | Base VQ-VAE | Contrastive VQ-VAE | Improvement |
|---|---|---|---|
| Silhouette Score | 0.31 | 0.42 | +35% |
| Davies-Bouldin Index | 1.68 | 1.34 | -20% (lower is better) |
| Calinski-Harabasz Score | 1248 | 1876 | +50% |
Contrastive learning significantly improves clustering quality across all metrics. The 35% improvement in Silhouette score indicates much better-separated and cohesive clusters, making the learned representations more suitable for variant identification tasks.
genomic_sequence_detection/
βββ configs/ # Configuration management
β βββ default_config.yaml # Default hyperparameters
β βββ experiment_configs/ # Experiment-specific configurations
β βββ large_model.yaml
β
βββ scripts/ # Executable scripts
β βββ train.py # Main training loop with multi-GPU support
β βββ evaluate.py # Model evaluation and metrics
β βββ preprocess.py # Data preprocessing pipeline
β
βββ src/ # Source code modules
β βββ models/ # Model architectures
β β βββ __init__.py
β β βββ vqvae.py # VQ-VAE, Encoder, Decoder, VectorQuantizer
β βββ data/ # Data processing
β β βββ __init__.py
β β βββ tokenizer.py # K-mer tokenization utilities
β βββ utils/ # Utility functions
β βββ __init__.py
β βββ wandb_init.py # W&B experiment tracking
β
βββ experiments/ # Experiment outputs
β βββ 1_standard_vqvae/ # Standard VQ-VAE experiments
β βββ 2_masked_vqvae/ # Masked VQ-VAE experiments
β βββ 3_contrastive_vqvae/ # Contrastive VQ-VAE experiments
β βββ 4_final_comparison/ # Comparative analysis
β
βββ data/ # Data files
β βββ wastewater_seq_dataset.fastq
β βββ cleaned_reads.fastq
β βββ virus_sequences.fastq
β
βββ Trimmomatic/ # Quality control tool
β βββ trimmomatic-0.39.jar
β
βββ requirements.txt # Python dependencies
βββ README.md # Comprehensive documentation
git clone https://github.com/arrdel/genomic_sequence_detection.git
cd genomic_sequence_detection
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install PyTorch (adjust for your CUDA version)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Install other dependencies
pip install -r requirements.txt
Key Dependencies:
torch>=2.0.0 - Deep learning frameworkbiopython>=1.79 - Biological sequence processingnumpy>=1.21.0 - Numerical operationswandb>=0.15.0 - Experiment trackingpyyaml>=6.0 - Configuration managementtqdm>=4.65.0 - Progress barswget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip
unzip Trimmomatic-0.39.zip -d Trimmomatic/
# Login to W&B (required for experiment tracking)
wandb login
# Or disable W&B with --no-wandb flag during training
python scripts/preprocess.py \
--input-fastq data/wastewater_seq_dataset.fastq \
--output-fastq data/cleaned_reads.fastq \
--k-mer 6 \
--max-seq-length 150 \
--run-fastqc \
--run-trimmomatic
python scripts/train.py \
--data-path data/cleaned_reads.fastq \
--output-dir experiments/1_standard_vqvae \
--experiment-name my_experiment \
--epochs 50 \
--batch-size 32 \
--num-codes 512 \
--code-dim 64 \
--n-gpu 2 \
--gpu-ids "0,1"
# Use predefined configuration
python scripts/train.py --config configs/default_config.yaml
# Create custom config in configs/experiment_configs/my_config.yaml
python scripts/train.py --config configs/experiment_configs/my_config.yaml
python scripts/evaluate.py \
--checkpoint-path experiments/1_standard_vqvae/checkpoints/best_model.pt \
--data-path data/cleaned_reads.fastq \
--output-dir evaluation_results \
--num-samples 100
To prevent codebook collapse (some codes never being used), we implement a dead code refresh mechanism that periodically replaces unused codes with high-variance encoder outputs:
# Enable dead code refresh during training
python scripts/train.py \
--refresh-codes \
--refresh-interval 500 \
--refresh-min-count 5
Gradient clipping stabilizes training and prevents exploding gradients:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
Automatic checkpoint saving with configurable frequency:
--save-freq 5)--resume experiments/.../checkpoint_epoch_20.pt--batch-size 16--hidden-dim 128--mixed-precision--epochs 100--num-codes 1024--commitment-cost 0.15--commitment-cost 0.25--num-codes 256--refresh-codes--entropy-weight 0.005@misc{chinda2025contrastivedeeplearningvariant,
title={Contrastive Deep Learning for Variant Detection in Wastewater Genomic Sequencing},
author={Adele Chinda and Richmond Azumah and Hemanth Demakethepalli Venkateswara},
year={2025},
eprint={2512.03158},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2512.03158},
}
License: MIT License - See LICENSE file for details
Contact: adelechinda@gsu.edu
This work demonstrates that discrete representation learning can provide a scalable, reference-free approach to genomic sequence analysis. By removing the dependency on reference genomes, our method democratizes genomic surveillance, making it accessible to resource-limited settings and enabling rapid response to emerging viral threats. The learned discrete representations could serve as a foundation for:
This project was completed as part of the Deep Learning course at Georgia State University. We thank our instructor for his valuable feedback and support throughout this work.