CLI Commands Reference
The genome_entropy command-line interface provides modular commands for each step of the pipeline, plus a unified run command to execute the entire workflow.
Global Options
All commands support these global options:
genome_entropy [GLOBAL_OPTIONS] COMMAND [COMMAND_OPTIONS]
Global Options:
--version, -vShow version and exit
--log-level, -l LEVELSet logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
Default: INFO
--log-file PATHWrite logs to file instead of STDOUT
Example:
genome_entropy --log-level DEBUG --log-file debug.log run --input data.fasta --output results.json
Commands
run
Run the complete pipeline from DNA to 3Di with entropy analysis.
Usage:
genome_entropy run [OPTIONS]
Required Options:
--input, -i PATHInput FASTA file with DNA sequences
--output, -o PATHOutput JSON file for results
Optional Options:
--table, -t INTEGERNCBI genetic code table ID
Default: 11 (bacterial/archaeal)
--min-aa INTEGERMinimum protein length in amino acids
Default: 30
--model, -m TEXTProstT5 model name from HuggingFace
Default: Rostlab/ProstT5_fp16
--device, -d TEXTDevice for inference (auto, cuda, mps, cpu)
Default: auto
--batch-size INTEGERBatch size for encoding
Default: 4
--encoding-size INTEGERTotal sequence length per encoding batch (in amino acids)
Default: 5000
--skip-entropySkip entropy calculation
Examples:
# Basic usage with defaults
genome_entropy run --input genome.fasta --output results.json
# Use GPU and custom parameters
genome_entropy run \
--input genome.fasta \
--output results.json \
--table 1 \
--min-aa 50 \
--device cuda \
--batch-size 8
# Skip entropy for faster processing
genome_entropy run --input genome.fasta --output results.json --skip-entropy
orf
Extract Open Reading Frames from DNA sequences.
Usage:
genome_entropy orf [OPTIONS]
Required Options:
--input, -i PATHInput FASTA file with DNA sequences
--output, -o PATHOutput JSON file with ORF records
Optional Options:
--table, -t INTEGERNCBI genetic code table ID
Default: 11
--min-nt INTEGERMinimum ORF length in nucleotides
Default: 90 (30 amino acids)
Examples:
# Find ORFs with default settings
genome_entropy orf --input genome.fasta --output orfs.json
# Use standard genetic code and longer minimum length
genome_entropy orf \
--input genome.fasta \
--output orfs.json \
--table 1 \
--min-nt 150
translate
Translate ORFs to protein sequences.
Usage:
genome_entropy translate [OPTIONS]
Required Options:
--input, -i PATHInput JSON file with ORF records
--output, -o PATHOutput JSON file with protein records
Optional Options:
--table, -t INTEGERNCBI genetic code table ID
Default: 11
Examples:
# Translate ORFs
genome_entropy translate --input orfs.json --output proteins.json
# Use different genetic code
genome_entropy translate \
--input orfs.json \
--output proteins.json \
--table 4
encode3di
Encode protein sequences to 3Di structural tokens using ProstT5.
Usage:
genome_entropy encode3di [OPTIONS]
Required Options:
--input, -i PATHInput JSON file with protein records
--output, -o PATHOutput JSON file with 3Di records
Optional Options:
--model, -m TEXTProstT5 model name
Default: Rostlab/ProstT5_fp16
--device, -d TEXTDevice for inference (auto, cuda, mps, cpu)
Default: auto
--batch-size INTEGERNumber of sequences per batch
Default: 4
--encoding-size INTEGERTotal amino acids per encoding batch
Default: 5000
Examples:
# Basic encoding
genome_entropy encode3di --input proteins.json --output 3di.json
# Use GPU with larger batches
genome_entropy encode3di \
--input proteins.json \
--output 3di.json \
--device cuda \
--batch-size 8 \
--encoding-size 10000
# Force CPU usage
genome_entropy encode3di \
--input proteins.json \
--output 3di.json \
--device cpu
entropy
Calculate Shannon entropy at all representation levels.
Usage:
genome_entropy entropy [OPTIONS]
Required Options:
--input, -i PATHInput JSON file with 3Di records
--output, -o PATHOutput JSON file with entropy report
Optional Options:
--normalizeNormalize entropy by alphabet size (scale to [0, 1])
Examples:
# Calculate entropy
genome_entropy entropy --input 3di.json --output entropy.json
# Calculate normalized entropy
genome_entropy entropy \
--input 3di.json \
--output entropy.json \
--normalize
download
Pre-download ProstT5 models to cache.
Usage:
genome_entropy download [OPTIONS]
Optional Options:
--model, -m TEXTModel name to download
Default: Rostlab/ProstT5_fp16
Examples:
# Download default model
genome_entropy download
# Download specific model
genome_entropy download --model Rostlab/ProstT5
estimate-tokens
Estimate optimal encoding size for your GPU.
Usage:
genome_entropy estimate-tokens [OPTIONS]
Optional Options:
--device, -d TEXTDevice to test (auto, cuda, mps, cpu)
Default: auto
--model, -m TEXTProstT5 model name
Default: Rostlab/ProstT5_fp16
--start INTEGERStarting encoding size to test
Default: 3000
--end INTEGEREnding encoding size to test
Default: 10000
--step INTEGERStep size for testing
Default: 1000
--trials INTEGERNumber of trials per size
Default: 3
Examples:
# Basic estimation
genome_entropy estimate-tokens
# Custom range for powerful GPU
genome_entropy estimate-tokens \
--device cuda \
--start 5000 \
--end 20000 \
--step 2000
# Test CPU limits
genome_entropy estimate-tokens --device cpu
Common Workflows
Standard Analysis
# Complete pipeline with logging
genome_entropy --log-file analysis.log run \
--input genome.fasta \
--output results.json \
--table 11 \
--min-aa 30 \
--device auto
Step-by-Step Analysis
# Step 1: Find ORFs
genome_entropy orf --input genome.fasta --output orfs.json --table 11
# Step 2: Translate
genome_entropy translate --input orfs.json --output proteins.json --table 11
# Step 3: Encode to 3Di
genome_entropy encode3di \
--input proteins.json \
--output 3di.json \
--device cuda \
--batch-size 8
# Step 4: Calculate entropy
genome_entropy entropy --input 3di.json --output entropy.json --normalize
Optimizing Performance
# First, find optimal encoding size
genome_entropy estimate-tokens --device cuda
# Then use it in the pipeline
genome_entropy run \
--input genome.fasta \
--output results.json \
--device cuda \
--encoding-size 15000 # Use recommended value from estimate-tokens
Exit Codes
The CLI uses standard exit codes:
0: Success
1: General error
2: User error (bad arguments, missing file)
3: Runtime error (model failure, GPU error)
Examples:
# Check exit code
genome_entropy run --input genome.fasta --output results.json
echo $? # Should print 0 on success
Genetic Code Tables
Common NCBI genetic code tables:
Table |
Description |
|---|---|
1 |
Standard genetic code |
11 |
Bacterial, archaeal, plant plastid (default) |
4 |
Mold, protozoan, coelenterate mitochondrial |
2 |
Vertebrate mitochondrial |
5 |
Invertebrate mitochondrial |
See complete list: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi
Environment Variables
GET_ORFS_PATHPath to get_orfs binary if not in PATH
Example:
export GET_ORFS_PATH=/usr/local/bin/get_orfsTRANSFORMERS_CACHEHuggingFace cache directory for models
Default:
~/.cache/huggingface/CUDA_VISIBLE_DEVICESSelect specific GPU(s)
Example:
export CUDA_VISIBLE_DEVICES=0
Next Steps
Read the User Guide for detailed pipeline documentation
See API Reference for Python API usage
Learn about Token Size Estimation for 3Di Encoding for performance optimization