CLI Commands Reference

The genome_entropy command-line interface provides modular commands for each step of the pipeline, plus a unified run command to execute the entire workflow.

Global Options

All commands support these global options:

genome_entropy [GLOBAL_OPTIONS] COMMAND [COMMAND_OPTIONS]

Global Options:

--version, -v

Show version and exit

--log-level, -l LEVEL

Set logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)

Default: INFO

--log-file PATH

Write logs to file instead of STDOUT

Example:

genome_entropy --log-level DEBUG --log-file debug.log run --input data.fasta --output results.json

Commands

run

Run the complete pipeline from DNA to 3Di with entropy analysis.

Usage:

genome_entropy run [OPTIONS]

Required Options:

--input, -i PATH: Input FASTA file with DNA sequences
--output, -o PATH: Output JSON file for results

Optional Options:

--table, -t INTEGER

NCBI genetic code table ID

Default: 11 (bacterial/archaeal)

--min-aa INTEGER

Minimum protein length in amino acids

Default: 30

--model, -m TEXT

ProstT5 model name from HuggingFace

Default: Rostlab/ProstT5_fp16

--device, -d TEXT

Device for inference (auto, cuda, mps, cpu)

Default: auto

--batch-size INTEGER

Batch size for encoding

Default: 4

--encoding-size INTEGER

Total sequence length per encoding batch (in amino acids)

Default: 5000

--skip-entropy

Skip entropy calculation

Examples:

# Basic usage with defaults
genome_entropy run --input genome.fasta --output results.json

# Use GPU and custom parameters
genome_entropy run \
    --input genome.fasta \
    --output results.json \
    --table 1 \
    --min-aa 50 \
    --device cuda \
    --batch-size 8

# Skip entropy for faster processing
genome_entropy run --input genome.fasta --output results.json --skip-entropy

orf

Extract Open Reading Frames from DNA sequences.

Usage:

genome_entropy orf [OPTIONS]

Required Options:

--input, -i PATH: Input FASTA file with DNA sequences
--output, -o PATH: Output JSON file with ORF records

Optional Options:

--table, -t INTEGER

NCBI genetic code table ID

Default: 11

--min-nt INTEGER

Minimum ORF length in nucleotides

Default: 90 (30 amino acids)

Examples:

# Find ORFs with default settings
genome_entropy orf --input genome.fasta --output orfs.json

# Use standard genetic code and longer minimum length
genome_entropy orf \
    --input genome.fasta \
    --output orfs.json \
    --table 1 \
    --min-nt 150

translate

Translate ORFs to protein sequences.

Usage:

genome_entropy translate [OPTIONS]

Required Options:

--input, -i PATH: Input JSON file with ORF records
--output, -o PATH: Output JSON file with protein records

Optional Options:

--table, -t INTEGER

NCBI genetic code table ID

Default: 11

Examples:

# Translate ORFs
genome_entropy translate --input orfs.json --output proteins.json

# Use different genetic code
genome_entropy translate \
    --input orfs.json \
    --output proteins.json \
    --table 4

encode3di

Encode protein sequences to 3Di structural tokens using ProstT5.

Usage:

genome_entropy encode3di [OPTIONS]

Required Options:

--input, -i PATH: Input JSON file with protein records
--output, -o PATH: Output JSON file with 3Di records

Optional Options:

--model, -m TEXT

ProstT5 model name

Default: Rostlab/ProstT5_fp16

--device, -d TEXT

Device for inference (auto, cuda, mps, cpu)

Default: auto

--batch-size INTEGER

Number of sequences per batch

Default: 4

--encoding-size INTEGER

Total amino acids per encoding batch

Default: 5000

Examples:

# Basic encoding
genome_entropy encode3di --input proteins.json --output 3di.json

# Use GPU with larger batches
genome_entropy encode3di \
    --input proteins.json \
    --output 3di.json \
    --device cuda \
    --batch-size 8 \
    --encoding-size 10000

# Force CPU usage
genome_entropy encode3di \
    --input proteins.json \
    --output 3di.json \
    --device cpu

entropy

Calculate Shannon entropy at all representation levels.

Usage:

genome_entropy entropy [OPTIONS]

Required Options:

--input, -i PATH: Input JSON file with 3Di records
--output, -o PATH: Output JSON file with entropy report

Optional Options:

--normalize: Normalize entropy by alphabet size (scale to [0, 1])

Examples:

# Calculate entropy
genome_entropy entropy --input 3di.json --output entropy.json

# Calculate normalized entropy
genome_entropy entropy \
    --input 3di.json \
    --output entropy.json \
    --normalize

download

Pre-download ProstT5 models to cache.

Usage:

genome_entropy download [OPTIONS]

Optional Options:

--model, -m TEXT

Model name to download

Default: Rostlab/ProstT5_fp16

Examples:

# Download default model
genome_entropy download

# Download specific model
genome_entropy download --model Rostlab/ProstT5

estimate-tokens

Estimate optimal encoding size for your GPU.

Usage:

genome_entropy estimate-tokens [OPTIONS]

Optional Options:

--device, -d TEXT

Device to test (auto, cuda, mps, cpu)

Default: auto

--model, -m TEXT

ProstT5 model name

Default: Rostlab/ProstT5_fp16

--start INTEGER

Starting encoding size to test

Default: 3000

--end INTEGER

Ending encoding size to test

Default: 10000

--step INTEGER

Step size for testing

Default: 1000

--trials INTEGER

Number of trials per size

Default: 3

Examples:

# Basic estimation
genome_entropy estimate-tokens

# Custom range for powerful GPU
genome_entropy estimate-tokens \
    --device cuda \
    --start 5000 \
    --end 20000 \
    --step 2000

# Test CPU limits
genome_entropy estimate-tokens --device cpu

Common Workflows

Standard Analysis

# Complete pipeline with logging
genome_entropy --log-file analysis.log run \
    --input genome.fasta \
    --output results.json \
    --table 11 \
    --min-aa 30 \
    --device auto

Step-by-Step Analysis

# Step 1: Find ORFs
genome_entropy orf --input genome.fasta --output orfs.json --table 11

# Step 2: Translate
genome_entropy translate --input orfs.json --output proteins.json --table 11

# Step 3: Encode to 3Di
genome_entropy encode3di \
    --input proteins.json \
    --output 3di.json \
    --device cuda \
    --batch-size 8

# Step 4: Calculate entropy
genome_entropy entropy --input 3di.json --output entropy.json --normalize

Optimizing Performance

# First, find optimal encoding size
genome_entropy estimate-tokens --device cuda

# Then use it in the pipeline
genome_entropy run \
    --input genome.fasta \
    --output results.json \
    --device cuda \
    --encoding-size 15000  # Use recommended value from estimate-tokens

Exit Codes

The CLI uses standard exit codes:

0: Success
1: General error
2: User error (bad arguments, missing file)
3: Runtime error (model failure, GPU error)

Examples:

# Check exit code
genome_entropy run --input genome.fasta --output results.json
echo $?  # Should print 0 on success

Genetic Code Tables

Common NCBI genetic code tables:

Table	Description
1	Standard genetic code
11	Bacterial, archaeal, plant plastid (default)
4	Mold, protozoan, coelenterate mitochondrial
2	Vertebrate mitochondrial
5	Invertebrate mitochondrial

See complete list: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

Environment Variables

GET_ORFS_PATH

Path to get_orfs binary if not in PATH

Example: export GET_ORFS_PATH=/usr/local/bin/get_orfs

TRANSFORMERS_CACHE

HuggingFace cache directory for models

Default: ~/.cache/huggingface/

CUDA_VISIBLE_DEVICES

Select specific GPU(s)

Example: export CUDA_VISIBLE_DEVICES=0

Next Steps

Read the User Guide for detailed pipeline documentation
See API Reference for Python API usage
Learn about Token Size Estimation for 3Di Encoding for performance optimization