User Guide
This guide provides a comprehensive overview of the genome_entropy pipeline, explaining concepts, data flow, and best practices.
Pipeline Overview
The genome_entropy pipeline transforms DNA sequences through multiple representation levels, computing Shannon entropy at each stage:
DNA (FASTA)
↓
[ORF Finding]
↓
ORFs (nucleotides) → Entropy₁
↓
[Translation]
↓
Proteins (amino acids) → Entropy₂
↓
[3Di Encoding via ProstT5]
↓
3Di Tokens (structural) → Entropy₃
↓
[Entropy Analysis]
↓
Complete Entropy Report
This multi-level analysis enables comparison of information content across different biological sequence representations.
Understanding ORFs
What is an ORF?
An Open Reading Frame (ORF) is a sequence of DNA between a start codon (typically ATG) and a stop codon (TAA, TAG, or TGA), representing a potential protein-coding region.
Reading Frames
DNA has six possible reading frames:
Three forward frames (starting at positions 0, 1, 2)
Three reverse frames (reverse complement, starting at positions 0, 1, 2)
Example:
DNA: ATGGCATAGCTAA
Frame 0: ATG GCA TAG CTA A
Frame 1: A TGG CAT AGC TAA
Frame 2: AT GGC ATA GCT AA
ORF Properties
Each ORF has the following properties:
Position: Start and end coordinates (0-based)
Strand: Forward (+) or reverse (-)
Frame: Which reading frame (0, 1, or 2)
Codons: Presence of start/stop codons
Sequences: Both nucleotide and amino acid sequences
Genetic Code Tables
The pipeline uses NCBI genetic code tables for translation. Different organisms use different genetic codes:
Common Tables
Table |
Description |
Typical Use |
|---|---|---|
1 |
Standard genetic code |
Eukaryotes |
11 |
Bacterial, archaeal, plant plastid (default) |
Bacteria, Archaea |
4 |
Mold, protozoan, coelenterate mitochondrial |
Some protozoans |
2 |
Vertebrate mitochondrial |
Mitochondria |
5 |
Invertebrate mitochondrial |
Mitochondria |
Key Differences
The main differences between genetic codes involve stop codons and rare amino acids:
Table 1: UGA = Stop
Table 11: UGA = Stop (same as standard)
Table 4: UGA = Trp (not a stop!)
Important: Always use the correct genetic code table for your organism.
Understanding 3Di
What is 3Di?
3Di (3D-interactions) is a structural alphabet that represents local 3D protein backbone geometry using 20 discrete states. It was developed for the Foldseek structural search tool.
Why 3Di?
Traditional approaches require: 1. Amino acid sequence 2. Protein structure prediction (e.g., AlphaFold) 3. Structure → 3Di conversion
ProstT5 enables direct sequence → 3Di prediction, skipping the expensive structure prediction step:
Traditional: AA → AlphaFold → PDB → Foldseek → 3Di
ProstT5: AA → ProstT5 → 3Di
Benefits:
Much faster (no structure prediction)
Lower computational requirements
Enables large-scale structural analysis
3Di Alphabet
The 3Di alphabet has 20 symbols (like amino acids) representing different structural states. Each symbol encodes local backbone geometry.
Shannon Entropy
What is Entropy?
Shannon entropy measures the information content or complexity of a sequence:
H = -Σ(p_i × log₂(p_i))
where p_i is the frequency of symbol i.
Interpretation
High entropy: More complex, diverse, unpredictable
Low entropy: More repetitive, simple, predictable
Examples:
# Maximum entropy (all symbols equally likely)
"ACGTACGT" → H ≈ 2.0 bits
# Minimum entropy (one symbol only)
"AAAAAAAA" → H = 0.0 bits
# Intermediate
"AAAACCCC" → H = 1.0 bits
Normalized Entropy
Normalized entropy scales values to [0, 1] by dividing by the maximum possible entropy:
H_norm = H / log₂(|alphabet|)
This allows fair comparison across different alphabets:
DNA: 4 symbols (max entropy = 2.0)
Protein: 20 symbols (max entropy ≈ 4.32)
3Di: 20 symbols (max entropy ≈ 4.32)
Entropy in Biology
Biological applications:
Low-complexity regions: Entropy < 2.0 indicates repetitive sequences
Sequence quality: High entropy suggests good diversity
Structural complexity: Compare protein vs. 3Di entropy
Functional sites: Often have distinct entropy patterns
Data Flow
Step 1: Input (FASTA)
>sequence1
ATGGCTAGCTAGCTAGCTAG...
>sequence2
ATGGGCCCTTTTAAA...
Step 2: ORF Finding
Extract all potential coding regions:
{
"parent_id": "sequence1",
"orf_id": "sequence1_orf_1",
"start": 0,
"end": 300,
"strand": "+",
"frame": 0,
"nt_sequence": "ATGGCTAGC...",
"aa_sequence": "MAS...",
"has_start_codon": true,
"has_stop_codon": true
}
Step 3: Translation
Convert nucleotides to amino acids:
Nucleotides: ATGGCTAGC → ATG GCT AGC
Amino acids: → M A S
Step 4: 3Di Encoding
Predict structural tokens using ProstT5:
{
"orf_id": "sequence1_orf_1",
"three_di": "AAABBBCCCDDD...",
"method": "prostt5_aa2fold",
"model_name": "Rostlab/ProstT5_fp16",
"inference_device": "cuda"
}
Step 5: Entropy Calculation
Compute entropy at all levels:
{
"dna_entropy_global": 1.95,
"orf_nt_entropy": {
"sequence1_orf_1": 1.85,
"sequence1_orf_2": 1.90
},
"protein_aa_entropy": {
"sequence1_orf_1": 3.12,
"sequence1_orf_2": 3.25
},
"three_di_entropy": {
"sequence1_orf_1": 2.89,
"sequence1_orf_2": 2.95
},
"alphabet_sizes": {
"dna": 4,
"protein": 20,
"three_di": 20
}
}
Performance Considerations
GPU vs CPU
ProstT5 encoding is the bottleneck:
CPU: Slow but works everywhere
CUDA: 10-50× faster with NVIDIA GPU
MPS: 5-20× faster on Apple Silicon
Memory Management
GPU memory is limited. Key parameters:
batch_size: Number of sequences processed simultaneously
encoding_size: Total amino acids per batch
If you get “CUDA out of memory”:
Reduce
batch_sizeReduce
encoding_sizeUse
--device cpu
Token Size Estimation
Use estimate-tokens to find optimal settings:
genome_entropy estimate-tokens --device cuda
This tests different encoding sizes and recommends the best value for your GPU.
Best Practices
Choosing Parameters
Genetic code table:
Use table 11 for bacteria and archaea (default)
Use table 1 for eukaryotes
Check NCBI documentation for unusual organisms
Minimum length:
Default 30 AA filters very short ORFs
Increase to 50-100 AA for higher confidence
Decrease to 10-20 AA for viral genomes
Device selection:
Use
autoto automatically detect best device (recommended)Use
cudato force GPU (fails if not available)Use
cpufor maximum compatibility
Logging
Enable debug logging for troubleshooting:
genome_entropy --log-level DEBUG --log-file debug.log run --input data.fasta --output results.json
Log files help diagnose:
Model loading issues
Memory problems
Processing bottlenecks
Unexpected results
Large Datasets
For processing many sequences:
Estimate tokens first: Find optimal batch size
Use GPU: Essential for large datasets
Filter short ORFs: Use
--min-aa 50or higherMonitor memory: Watch for OOM errors
Log to file: Track progress
Split input: Process in chunks if too large
Quality Control
Check your results:
ORF count: Too many or too few might indicate issues
Entropy values: Should be within expected ranges
3Di output: Should be same length as protein input
Log messages: Look for warnings or errors
Common Patterns
Entropy Comparisons
Typical entropy patterns:
DNA entropy: ~1.8-2.0 (max 2.0 for 4 symbols)
Protein entropy: ~3.0-4.0 (max 4.32 for 20 symbols)
3Di entropy: ~2.5-3.5 (varies by structure)
Observations:
Proteins usually have higher entropy than DNA (more symbols)
3Di entropy reflects structural complexity
Low-complexity regions have entropy < 2.0
Structural Predictions
3Di tokens enable:
Fast structural searches (via Foldseek)
Structural alignment
Structure-based clustering
Fold recognition
Troubleshooting
Common Issues
ORF finding fails:
Check get_orfs binary is installed and in PATH
Verify input is valid FASTA format
Try different genetic code table
Translation errors:
Ensure correct genetic code table
Check for ambiguous bases (N) in sequences
Encoding fails:
Verify model downloaded:
genome_entropy downloadCheck GPU memory: Use
--device cpuor reduce batch sizeUpdate PyTorch/Transformers:
pip install --upgrade torch transformers
Out of memory:
Reduce batch size:
--batch-size 1Reduce encoding size:
--encoding-size 2000Use CPU:
--device cpuProcess fewer sequences at once
Performance Issues
Slow encoding:
Use GPU if available
Increase batch size (if memory allows)
Use fp16 model:
Rostlab/ProstT5_fp16
Slow ORF finding:
This is usually fast; check input file size
Consider filtering input sequences
Next Steps
Try the Quick Start Guide examples
Read the CLI Commands Reference reference
Explore the API Reference for Python integration
Learn about Token Size Estimation for 3Di Encoding optimization