Changelog
All notable changes to this project will be documented in this file.
[0.1.0] - 2026-01-19
Initial release of genome_entropy.
Added
Complete pipeline: DNA → ORF → Protein → 3Di → Entropy
ORF finding using external get_orfs binary
Protein translation with all NCBI genetic code tables
3Di encoding via ProstT5 model
Shannon entropy calculation at all levels
Modular CLI with individual commands:
genome_entropy run- Complete pipelinegenome_entropy orf- Find ORFsgenome_entropy translate- Translate to proteinsgenome_entropy encode3di- Encode to 3Digenome_entropy entropy- Calculate entropygenome_entropy download- Pre-download modelsgenome_entropy estimate-tokens- Estimate optimal encoding size
GPU acceleration support:
CUDA (NVIDIA GPUs)
MPS (Apple Silicon)
CPU fallback
Comprehensive logging system:
Configurable log levels
File or STDOUT output
Progress tracking
Token size estimation for optimal GPU utilization
Batch processing for efficient encoding
JSON I/O for structured data
FASTA reading and writing
Complete test suite with unit and integration tests
Comprehensive documentation
Example data and scripts
Features
Auto-detection of best available device (CUDA/MPS/CPU)
Graceful fallback on GPU memory errors
Support for all NCBI genetic code tables
Customizable ORF length filtering
Normalized and non-normalized entropy
Type hints throughout codebase
Google-style docstrings
Known Limitations
Requires external get_orfs binary
ProstT5 model is large (~2GB)
3Di encoding is memory-intensive
Integration tests not run in CI
[Unreleased]
Planned features for future releases:
Additional encoder models
Parallel processing for large datasets
Streaming mode for very large files
Additional entropy metrics
Web interface
Pre-built binaries for get_orfs