User Guide

This guide provides a comprehensive overview of the genome_entropy pipeline, explaining concepts, data flow, and best practices.

Pipeline Overview

The genome_entropy pipeline transforms DNA sequences through multiple representation levels, computing Shannon entropy at each stage:

DNA (FASTA)
    ↓
[ORF Finding]
    ↓
ORFs (nucleotides) → Entropy₁
    ↓
[Translation]
    ↓
Proteins (amino acids) → Entropy₂
    ↓
[3Di Encoding via ProstT5]
    ↓
3Di Tokens (structural) → Entropy₃
    ↓
[Entropy Analysis]
    ↓
Complete Entropy Report

This multi-level analysis enables comparison of information content across different biological sequence representations.

Understanding ORFs

What is an ORF?

An Open Reading Frame (ORF) is a sequence of DNA between a start codon (typically ATG) and a stop codon (TAA, TAG, or TGA), representing a potential protein-coding region.

Reading Frames

DNA has six possible reading frames:

Three forward frames (starting at positions 0, 1, 2)
Three reverse frames (reverse complement, starting at positions 0, 1, 2)

Example:

DNA:     ATGGCATAGCTAA
Frame 0: ATG GCA TAG CTA A
Frame 1: A TGG CAT AGC TAA
Frame 2: AT GGC ATA GCT AA

ORF Properties

Each ORF has the following properties:

Position: Start and end coordinates (0-based)
Strand: Forward (+) or reverse (-)
Frame: Which reading frame (0, 1, or 2)
Codons: Presence of start/stop codons
Sequences: Both nucleotide and amino acid sequences

Genetic Code Tables

The pipeline uses NCBI genetic code tables for translation. Different organisms use different genetic codes:

Common Tables

Table	Description	Typical Use
1	Standard genetic code	Eukaryotes
11	Bacterial, archaeal, plant plastid (default)	Bacteria, Archaea
4	Mold, protozoan, coelenterate mitochondrial	Some protozoans
2	Vertebrate mitochondrial	Mitochondria
5	Invertebrate mitochondrial	Mitochondria

Key Differences

The main differences between genetic codes involve stop codons and rare amino acids:

Table 1: UGA = Stop
Table 11: UGA = Stop (same as standard)
Table 4: UGA = Trp (not a stop!)

Important: Always use the correct genetic code table for your organism.

Understanding 3Di

What is 3Di?

3Di (3D-interactions) is a structural alphabet that represents local 3D protein backbone geometry using 20 discrete states. It was developed for the Foldseek structural search tool.

Why 3Di?

Traditional approaches require: 1. Amino acid sequence 2. Protein structure prediction (e.g., AlphaFold) 3. Structure → 3Di conversion

ProstT5 enables direct sequence → 3Di prediction, skipping the expensive structure prediction step:

Traditional:  AA → AlphaFold → PDB → Foldseek → 3Di
ProstT5:      AA → ProstT5 → 3Di

Benefits:

Much faster (no structure prediction)
Lower computational requirements
Enables large-scale structural analysis

3Di Alphabet

The 3Di alphabet has 20 symbols (like amino acids) representing different structural states. Each symbol encodes local backbone geometry.

Shannon Entropy

What is Entropy?

Shannon entropy measures the information content or complexity of a sequence:

H = -Σ(p_i × log₂(p_i))

where p_i is the frequency of symbol i.

Interpretation

High entropy: More complex, diverse, unpredictable
Low entropy: More repetitive, simple, predictable

Examples:

# Maximum entropy (all symbols equally likely)
"ACGTACGT" → H ≈ 2.0 bits

# Minimum entropy (one symbol only)
"AAAAAAAA" → H = 0.0 bits

# Intermediate
"AAAACCCC" → H = 1.0 bits

Normalized Entropy

Normalized entropy scales values to [0, 1] by dividing by the maximum possible entropy:

H_norm = H / log₂(|alphabet|)

This allows fair comparison across different alphabets:

DNA: 4 symbols (max entropy = 2.0)
Protein: 20 symbols (max entropy ≈ 4.32)
3Di: 20 symbols (max entropy ≈ 4.32)

Entropy in Biology

Biological applications:

Low-complexity regions: Entropy < 2.0 indicates repetitive sequences
Sequence quality: High entropy suggests good diversity
Structural complexity: Compare protein vs. 3Di entropy
Functional sites: Often have distinct entropy patterns

Data Flow

Step 1: Input (FASTA)

>sequence1
ATGGCTAGCTAGCTAGCTAG...
>sequence2
ATGGGCCCTTTTAAA...

Step 2: ORF Finding

Extract all potential coding regions:

{
  "parent_id": "sequence1",
  "orf_id": "sequence1_orf_1",
  "start": 0,
  "end": 300,
  "strand": "+",
  "frame": 0,
  "nt_sequence": "ATGGCTAGC...",
  "aa_sequence": "MAS...",
  "has_start_codon": true,
  "has_stop_codon": true
}

Step 3: Translation

Convert nucleotides to amino acids:

Nucleotides: ATGGCTAGC → ATG GCT AGC
Amino acids:             → M   A   S

Step 4: 3Di Encoding

Predict structural tokens using ProstT5:

{
  "orf_id": "sequence1_orf_1",
  "three_di": "AAABBBCCCDDD...",
  "method": "prostt5_aa2fold",
  "model_name": "Rostlab/ProstT5_fp16",
  "inference_device": "cuda"
}

Step 5: Entropy Calculation

Compute entropy at all levels:

{
  "dna_entropy_global": 1.95,
  "orf_nt_entropy": {
    "sequence1_orf_1": 1.85,
    "sequence1_orf_2": 1.90
  },
  "protein_aa_entropy": {
    "sequence1_orf_1": 3.12,
    "sequence1_orf_2": 3.25
  },
  "three_di_entropy": {
    "sequence1_orf_1": 2.89,
    "sequence1_orf_2": 2.95
  },
  "alphabet_sizes": {
    "dna": 4,
    "protein": 20,
    "three_di": 20
  }
}

Performance Considerations

GPU vs CPU

ProstT5 encoding is the bottleneck:

CPU: Slow but works everywhere
CUDA: 10-50× faster with NVIDIA GPU
MPS: 5-20× faster on Apple Silicon

Memory Management

GPU memory is limited. Key parameters:

batch_size: Number of sequences processed simultaneously
encoding_size: Total amino acids per batch

If you get “CUDA out of memory”:

Reduce batch_size
Reduce encoding_size
Use --device cpu

Token Size Estimation

Use estimate-tokens to find optimal settings:

genome_entropy estimate-tokens --device cuda

This tests different encoding sizes and recommends the best value for your GPU.

Best Practices

Choosing Parameters

Genetic code table:

Use table 11 for bacteria and archaea (default)
Use table 1 for eukaryotes
Check NCBI documentation for unusual organisms

Minimum length:

Default 30 AA filters very short ORFs
Increase to 50-100 AA for higher confidence
Decrease to 10-20 AA for viral genomes

Device selection:

Use auto to automatically detect best device (recommended)
Use cuda to force GPU (fails if not available)
Use cpu for maximum compatibility

Logging

Enable debug logging for troubleshooting:

genome_entropy --log-level DEBUG --log-file debug.log run --input data.fasta --output results.json

Log files help diagnose:

Model loading issues
Memory problems
Processing bottlenecks
Unexpected results

Large Datasets

For processing many sequences:

Estimate tokens first: Find optimal batch size
Use GPU: Essential for large datasets
Filter short ORFs: Use --min-aa 50 or higher
Monitor memory: Watch for OOM errors
Log to file: Track progress
Split input: Process in chunks if too large

Quality Control

Check your results:

ORF count: Too many or too few might indicate issues
Entropy values: Should be within expected ranges
3Di output: Should be same length as protein input
Log messages: Look for warnings or errors

Common Patterns

Entropy Comparisons

Typical entropy patterns:

DNA entropy:    ~1.8-2.0 (max 2.0 for 4 symbols)
Protein entropy: ~3.0-4.0 (max 4.32 for 20 symbols)
3Di entropy:     ~2.5-3.5 (varies by structure)

Observations:

Proteins usually have higher entropy than DNA (more symbols)
3Di entropy reflects structural complexity
Low-complexity regions have entropy < 2.0

Structural Predictions

3Di tokens enable:

Fast structural searches (via Foldseek)
Structural alignment
Structure-based clustering
Fold recognition

Troubleshooting

Common Issues

ORF finding fails:

Check get_orfs binary is installed and in PATH
Verify input is valid FASTA format
Try different genetic code table

Translation errors:

Ensure correct genetic code table
Check for ambiguous bases (N) in sequences

Encoding fails:

Verify model downloaded: genome_entropy download
Check GPU memory: Use --device cpu or reduce batch size
Update PyTorch/Transformers: pip install --upgrade torch transformers

Out of memory:

Reduce batch size: --batch-size 1
Reduce encoding size: --encoding-size 2000
Use CPU: --device cpu
Process fewer sequences at once

Performance Issues

Slow encoding:

Use GPU if available
Increase batch size (if memory allows)
Use fp16 model: Rostlab/ProstT5_fp16

Slow ORF finding:

This is usually fast; check input file size
Consider filtering input sequences

Next Steps

Try the Quick Start Guide examples
Read the CLI Commands Reference reference
Explore the API Reference for Python integration
Learn about Token Size Estimation for 3Di Encoding optimization