genome_entropy Documentation
Welcome to the documentation for genome_entropy, a complete bioinformatics pipeline that converts DNA sequences → ORFs → proteins → 3Di structural tokens, computing Shannon entropy at each representation level.
Overview
genome_entropy enables researchers to:
Extract Open Reading Frames (ORFs) from DNA sequences
Translate ORFs to protein sequences using customizable genetic codes
Predict structural alphabet tokens (3Di) directly from sequences using ProstT5
Calculate and compare Shannon entropy at DNA, ORF, protein, and 3Di levels
Process data efficiently with GPU acceleration (CUDA, MPS, or CPU)
Key Features
- 🧬 ORF Finding
Extract Open Reading Frames from DNA sequences using customizable genetic codes
- 🔄 Translation
Convert ORFs to protein sequences with support for all NCBI genetic code tables
- 🏗️ 3Di Encoding
Predict structural alphabet tokens directly from sequences using ProstT5
- 📊 Entropy Analysis
Calculate Shannon entropy at DNA, ORF, protein, and 3Di levels
- ⚡ GPU Acceleration
Auto-detect and use CUDA, MPS (Apple Silicon), or CPU
- 🔧 Modular CLI
Run complete pipeline or individual steps
- 📝 Comprehensive Logging
Configurable log levels and output to file or STDOUT
Getting Started
Reference
Development
Citation
If you use this software, please cite:
ProstT5: Heinzinger et al. (2023), “ProstT5: Bilingual Language Model for Protein Sequence and Structure”
get_orfs: https://github.com/linsalrob/get_orfs
pygenetic-code: https://github.com/linsalrob/genetic_codes
License
MIT License - see LICENSE file for details.