Quick Start Guide

This guide will help you get started with genome_entropy in minutes.

Prerequisites

genome_entropy installed (see Installation)
get_orfs binary available in PATH
Sample FASTA file with DNA sequences

Basic Usage

Complete Pipeline

Run the entire pipeline from DNA to 3Di with a single command:

genome_entropy run --input examples/example_small.fasta --output results.json

This command will:

Find all ORFs in the input DNA sequences
Translate ORFs to protein sequences
Encode proteins to 3Di structural tokens using ProstT5
Calculate Shannon entropy at all levels
Save results to JSON

Step-by-Step Pipeline

Alternatively, run each step individually:

# Step 1: Find ORFs
genome_entropy orf --input input.fasta --output orfs.json

# Step 2: Translate ORFs to proteins
genome_entropy translate --input orfs.json --output proteins.json

# Step 3: Encode proteins to 3Di
genome_entropy encode3di --input proteins.json --output 3di.json

# Step 4: Calculate entropy
genome_entropy entropy --input 3di.json --output entropy.json

Example Output

Results are saved in JSON format:

[
  {
    "input_id": "seq1",
    "input_dna_length": 1500,
    "orfs": [
      {
        "parent_id": "seq1",
        "orf_id": "seq1_orf_1",
        "start": 0,
        "end": 300,
        "strand": "+",
        "frame": 0,
        "nt_sequence": "ATGGCA...",
        "aa_sequence": "MA...",
        "table_id": 11,
        "has_start_codon": true,
        "has_stop_codon": true
      }
    ],
    "proteins": [...],
    "three_dis": [
      {
        "orf_id": "seq1_orf_1",
        "three_di": "AAABBBCCC...",
        "method": "prostt5_aa2fold",
        "model_name": "Rostlab/ProstT5_fp16"
      }
    ],
    "entropy": {
      "dna_entropy_global": 1.95,
      "orf_nt_entropy": {"seq1_orf_1": 1.85},
      "protein_aa_entropy": {"seq1_orf_1": 3.12},
      "three_di_entropy": {"seq1_orf_1": 2.89},
      "alphabet_sizes": {
        "dna": 4,
        "protein": 20,
        "three_di": 20
      }
    }
  }
]

Common Use Cases

Use GPU for Faster Processing

genome_entropy run --input data.fasta --output results.json --device cuda

Use Different Genetic Code

# Standard genetic code (Table 1)
genome_entropy run --input data.fasta --output results.json --table 1

# Bacterial code (Table 11, default)
genome_entropy run --input data.fasta --output results.json --table 11

Filter Short ORFs

# Only keep proteins >= 50 amino acids
genome_entropy run --input data.fasta --output results.json --min-aa 50

Enable Debug Logging

genome_entropy --log-level DEBUG run --input data.fasta --output results.json

Log to File

genome_entropy --log-file pipeline.log run --input data.fasta --output results.json

Pre-download Models

Download models before running the pipeline:

genome_entropy download --model Rostlab/ProstT5_fp16

Estimate Optimal Token Size

Find the best encoding size for your GPU:

genome_entropy estimate-tokens --device cuda

Input File Format

DNA sequences should be in FASTA format:

>sequence1 Description of sequence 1
ATGGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC
TAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCT
>sequence2 Description of sequence 2
ATGGGGCCCTTTAAAGGGCCCTTTAAAGGGCCCTTTAAAGGG
CCCTTTAAAGGGCCCTTTAAAGGGCCCTTTAAA

Tips for Large Datasets

Use GPU: Encoding is much faster on GPU
Adjust batch size: Increase for faster processing, decrease if OOM errors
Filter short ORFs: Use --min-aa to exclude short proteins
Log to file: Use --log-file to track progress
Estimate tokens first: Use estimate-tokens to find optimal encoding size

Example Workflow

Complete workflow for analyzing bacterial genomes:

# 1. Pre-download the model
genome_entropy download --model Rostlab/ProstT5_fp16

# 2. Estimate optimal token size for your GPU
genome_entropy estimate-tokens --device cuda

# 3. Run the pipeline with bacterial genetic code
genome_entropy --log-file analysis.log run \
    --input bacterial_genome.fasta \
    --output results.json \
    --table 11 \
    --min-aa 30 \
    --device cuda

# 4. Check the log file for any issues
cat analysis.log

Performance Benchmarks

Approximate processing times on different hardware:

Hardware	100 sequences	1000 sequences
CPU (8 cores)	~5 minutes	~50 minutes
NVIDIA RTX 3090	~1 minute	~10 minutes
Apple M1 Max (MPS)	~2 minutes	~20 minutes

Note: Times are approximate and depend on sequence length and system load.

Next Steps

Learn about all CLI commands: CLI Commands Reference
Understand the pipeline in detail: User Guide
Use the Python API: API Reference
Optimize token estimation: Token Size Estimation for 3Di Encoding