API Reference
This page documents the Python API for genome_entropy. You can use these modules directly in your Python code for more fine-grained control over the pipeline.
Core Modules
ORF finding utilities. |
|
Translation utilities. |
|
3Di encoding utilities. |
|
Entropy calculation utilities. |
|
Pipeline orchestration. |
|
I/O utilities for genome_entropy. |
ORF Finding
ORF finding utilities.
Types
- class genome_entropy.orf.types.OrfRecord(parent_id, orf_id, start, end, strand, frame, nt_sequence, aa_sequence, table_id, has_start_codon, has_stop_codon, in_genbank=False)[source]
Bases:
objectRepresents a single Open Reading Frame (ORF).
- Parameters:
- strand
Strand orientation (‘+’ or ‘-‘)
- Type:
Literal[‘+’, ‘-’]
- __init__(parent_id, orf_id, start, end, strand, frame, nt_sequence, aa_sequence, table_id, has_start_codon, has_stop_codon, in_genbank=False)
Finder
ORF finder wrapper using get_orfs binary.
- genome_entropy.orf.finder.find_orfs(sequences, table_id=11, min_nt_length=90, binary_path='get_orfs')[source]
Find ORFs in DNA sequences using get_orfs binary.
This function wraps the external get_orfs binary (https://github.com/linsalrob/get_orfs). The binary must be installed and available in PATH or specified via binary_path.
- Parameters:
- Returns:
List of OrfRecord objects
- Raises:
OrfFinderError – If get_orfs binary is not found or fails
- Return type:
Translation
Translation utilities.
Translator
Translation of nucleotide sequences to amino acids.
- class genome_entropy.translate.translator.ProteinRecord(orf, aa_sequence, aa_length)[source]
Bases:
objectRepresents a translated protein from an ORF.
- orf
The OrfRecord that was translated
- genome_entropy.translate.translator.translate_orf(orf, table_id=11)[source]
Translate an ORF to a protein sequence.
Uses the pygenetic-code library for translation with NCBI genetic codes. Ambiguous codons (containing N or other IUPAC codes) are translated to ‘X’.
- Parameters:
- Returns:
ProteinRecord with translated sequence
- Raises:
TranslationError – If translation fails
- Return type:
3Di Encoding
3Di encoding utilities.
- class genome_entropy.encode3di.ProstT5ThreeDiEncoder(model_name='gbouras13/modernprost-base', device=None)[source]
Bases:
objectEncoder for converting amino acid sequences to 3Di structural tokens.
Uses the ProstT5 model from HuggingFace to predict 3Di tokens directly from protein sequences without requiring 3D structures.
- __init__(model_name='gbouras13/modernprost-base', device=None)[source]
Initialize the ProstT5 encoder.
- Parameters:
- Raises:
ModelError – If PyTorch or Transformers are not installed
DeviceError – If specified device is not available
- token_budget_batches(aa_sequences, token_budget)[source]
Yield batches of sequences (with original indices) under an approximate token budget.
- Optimized strategy to address the problem of isolated long sequences:
Keep original indices.
Sort by length to minimize padding within each batch.
For each batch: - Start with long sequences from the end (largest first) - Add long sequences until adding another would exceed budget - Fill remaining budget with short sequences from the beginning
This approach avoids ending up with long proteins that can’t be combined, resulting in better token budget utilization and fewer iterations.
Parameters
aa_sequences : Sequence[str] Unordered amino acid sequences. token_budget : int Maximum approximate “tokens” per batch
Yields
List[IndexedSeq] A batch of (original_index, sequence) records.
- encode(aa_sequences, encoding_size=10000, use_multi_gpu=False, gpu_ids=None, multi_gpu_encoder=None)[source]
Encode amino acid sequences to 3Di tokens.
- Parameters:
List of amino acid sequences. note: Amino acid sequences are expected to be upper-case,
while 3Di-sequences need to be lower-case.
encoding_size (int) – Maximum size (approx. amino acids) to encode per gpu
use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available
gpu_ids (List[int] | None) – Optional list of GPU IDs to use for multi-GPU encoding. If None and use_multi_gpu=True, auto-discover available GPUs.
multi_gpu_encoder (Any | None) – Optional pre-initialized MultiGPUEncoder instance. If provided, this encoder will be reused instead of creating a new one. This is important for efficiency when processing multiple sequences.
- Returns:
List of 3Di token sequences (one per input sequence)
- Raises:
EncodingError – If encoding fails
- Return type:
- encode_proteins(proteins, encoding_size=10000, use_multi_gpu=False, gpu_ids=None, multi_gpu_encoder=None)[source]
Encode protein records to 3Di records.
- Parameters:
proteins (List[ProteinRecord]) – List of ProteinRecord objects
encoding_size (int) – Maximum size (approx. amino acids) to encode per batch
use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available
gpu_ids (List[int] | None) – Optional list of GPU IDs to use for multi-GPU encoding
multi_gpu_encoder (Any | None) – Optional pre-initialized MultiGPUEncoder instance. If provided, this encoder will be reused instead of creating a new one. This is important for efficiency when processing multiple sequences.
- Returns:
List of ThreeDiRecord objects
- Return type:
- class genome_entropy.encode3di.ModernProstThreeDiEncoder(model_name, device=None, use_accelerate=False)[source]
Bases:
objectEncoder for converting amino acid sequences to 3Di structural tokens.
Uses ModernProst models (gbouras13/modernprost-base or modernprost-profiles) from HuggingFace to predict 3Di tokens directly from protein sequences.
Based on implementation from phold: https://github.com/gbouras13/phold/blob/main/src/phold/features/predict_3Di.py
- __init__(model_name, device=None, use_accelerate=False)[source]
Initialize the ModernProst encoder.
- Parameters:
- Raises:
ModelError – If PyTorch or Transformers are not installed
DeviceError – If specified device is not available
- token_budget_batches(aa_sequences, token_budget)[source]
Yield batches of sequences (with original indices) under an approximate token budget.
- Optimized strategy to address the problem of isolated long sequences:
Keep original indices.
Sort by length to minimize padding within each batch.
For each batch: - Start with long sequences from the end (largest first) - Add long sequences until adding another would exceed budget - Fill remaining budget with short sequences from the beginning
This approach avoids ending up with long proteins that can’t be combined, resulting in better token budget utilization and fewer iterations.
Parameters
aa_sequences : Sequence[str] Unordered amino acid sequences. token_budget : int Maximum approximate “tokens” per batch
Yields
List[IndexedSeq] A batch of (original_index, sequence) records.
- encode(aa_sequences, encoding_size=10000, use_multi_gpu=False, gpu_ids=None, multi_gpu_encoder=None)[source]
Encode amino acid sequences to 3Di tokens.
- Parameters:
aa_sequences (List[str]) – List of amino acid sequences (upper-case).
encoding_size (int) – Maximum size (approx. amino acids) to encode per batch
use_multi_gpu (bool) – If True, use accelerate for multi-GPU parallel encoding
gpu_ids (List[int] | None) – Optional list of GPU IDs (currently unused with accelerate)
multi_gpu_encoder (Any | None) – Optional pre-initialized encoder (for backward compatibility)
- Returns:
List of 3Di token sequences (one per input sequence)
- Raises:
EncodingError – If encoding fails
- Return type:
- encode_proteins(proteins, encoding_size=10000, use_multi_gpu=False, gpu_ids=None, multi_gpu_encoder=None)[source]
Encode protein records to 3Di records.
- Parameters:
proteins (List[ProteinRecord]) – List of ProteinRecord objects
encoding_size (int) – Maximum size (approx. amino acids) to encode per batch
use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available
gpu_ids (List[int] | None) – Optional list of GPU IDs to use for multi-GPU encoding
multi_gpu_encoder (Any | None) – Optional pre-initialized MultiGPUEncoder instance.
- Returns:
List of ThreeDiRecord objects
- Return type:
- class genome_entropy.encode3di.ThreeDiRecord(protein, three_di, method, model_name, inference_device)[source]
Bases:
objectRepresents a 3Di structural encoding of a protein.
- Parameters:
protein (ProteinRecord)
three_di (str)
method (Literal['prostt5_aa2fold'])
model_name (str)
inference_device (str)
- protein
The ProteinRecord that was encoded
- method
Method used for encoding (always “prostt5_aa2fold”)
- Type:
Literal[‘prostt5_aa2fold’]
- protein: ProteinRecord
- class genome_entropy.encode3di.IndexedSeq(idx, seq)[source]
Bases:
objectA sequence paired with its original position in the input list.
- genome_entropy.encode3di.estimate_token_size(encoder, start_length=3000, end_length=10000, step=1000, num_trials=3, base_protein_length=100)[source]
Estimate optimal token size for GPU encoding by testing increasing lengths.
This function generates random protein sequences of increasing total length and attempts to encode them. It catches OutOfMemoryError to find the maximum length that can be encoded on the available GPU.
- Parameters:
encoder (Any) – ProstT5ThreeDiEncoder instance to use for encoding
start_length (int) – Starting total length to test (default: 3000)
end_length (int) – Maximum total length to test (default: 10000)
step (int) – Increment between test lengths (default: 1000)
num_trials (int) – Number of trials per length for robustness (default: 3)
base_protein_length (int) – Approximate length of individual proteins (default: 100)
- Returns:
‘max_length’: Maximum length successfully encoded
’recommended_token_size’: Recommended token budget (90% of max)
’trials_per_length’: Dictionary of successful trials per length
’device’: Device used for testing
- Return type:
Dictionary with estimation results
- Raises:
ValueError – If encoder doesn’t have required attributes or torch not available
- genome_entropy.encode3di.generate_random_protein(length, seed=None)[source]
Generate a random protein sequence of specified length.
- genome_entropy.encode3di.generate_combined_proteins(target_length, base_length=100, seed=None)[source]
Generate multiple shorter proteins that combine to target length.
Types
Data types for 3Di encoding.
- class genome_entropy.encode3di.types.ThreeDiRecord(protein, three_di, method, model_name, inference_device)[source]
Bases:
objectRepresents a 3Di structural encoding of a protein.
- Parameters:
protein (ProteinRecord)
three_di (str)
method (Literal['prostt5_aa2fold'])
model_name (str)
inference_device (str)
- protein
The ProteinRecord that was encoded
- method
Method used for encoding (always “prostt5_aa2fold”)
- Type:
Literal[‘prostt5_aa2fold’]
- protein: ProteinRecord
Encoder
ProstT5-based encoder for amino acid to 3Di structural token conversion.
- class genome_entropy.encode3di.encoder.ProstT5ThreeDiEncoder(model_name='gbouras13/modernprost-base', device=None)[source]
Bases:
objectEncoder for converting amino acid sequences to 3Di structural tokens.
Uses the ProstT5 model from HuggingFace to predict 3Di tokens directly from protein sequences without requiring 3D structures.
- __init__(model_name='gbouras13/modernprost-base', device=None)[source]
Initialize the ProstT5 encoder.
- Parameters:
- Raises:
ModelError – If PyTorch or Transformers are not installed
DeviceError – If specified device is not available
- token_budget_batches(aa_sequences, token_budget)[source]
Yield batches of sequences (with original indices) under an approximate token budget.
- Optimized strategy to address the problem of isolated long sequences:
Keep original indices.
Sort by length to minimize padding within each batch.
For each batch: - Start with long sequences from the end (largest first) - Add long sequences until adding another would exceed budget - Fill remaining budget with short sequences from the beginning
This approach avoids ending up with long proteins that can’t be combined, resulting in better token budget utilization and fewer iterations.
Parameters
aa_sequences : Sequence[str] Unordered amino acid sequences. token_budget : int Maximum approximate “tokens” per batch
Yields
List[IndexedSeq] A batch of (original_index, sequence) records.
- encode(aa_sequences, encoding_size=10000, use_multi_gpu=False, gpu_ids=None, multi_gpu_encoder=None)[source]
Encode amino acid sequences to 3Di tokens.
- Parameters:
List of amino acid sequences. note: Amino acid sequences are expected to be upper-case,
while 3Di-sequences need to be lower-case.
encoding_size (int) – Maximum size (approx. amino acids) to encode per gpu
use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available
gpu_ids (List[int] | None) – Optional list of GPU IDs to use for multi-GPU encoding. If None and use_multi_gpu=True, auto-discover available GPUs.
multi_gpu_encoder (Any | None) – Optional pre-initialized MultiGPUEncoder instance. If provided, this encoder will be reused instead of creating a new one. This is important for efficiency when processing multiple sequences.
- Returns:
List of 3Di token sequences (one per input sequence)
- Raises:
EncodingError – If encoding fails
- Return type:
- encode_proteins(proteins, encoding_size=10000, use_multi_gpu=False, gpu_ids=None, multi_gpu_encoder=None)[source]
Encode protein records to 3Di records.
- Parameters:
proteins (List[ProteinRecord]) – List of ProteinRecord objects
encoding_size (int) – Maximum size (approx. amino acids) to encode per batch
use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available
gpu_ids (List[int] | None) – Optional list of GPU IDs to use for multi-GPU encoding
multi_gpu_encoder (Any | None) – Optional pre-initialized MultiGPUEncoder instance. If provided, this encoder will be reused instead of creating a new one. This is important for efficiency when processing multiple sequences.
- Returns:
List of ThreeDiRecord objects
- Return type:
Encoding Functions
Core encoding functions for amino acid to 3Di conversion.
- genome_entropy.encode3di.encoding.preprocess_sequences(aa_sequences)[source]
Preprocess amino acid sequences for ProstT5 encoding.
- genome_entropy.encode3di.encoding.format_seconds(seconds)[source]
Format seconds as H:MM:SS (or M:SS for < 1 hour).
- genome_entropy.encode3di.encoding.get_memory_info()[source]
Get current CUDA memory allocation and reservation in GB.
- genome_entropy.encode3di.encoding.process_batches(batches_iter, encode_batch_fn, total_sequences, total_batches)[source]
Process batches of sequences and return results in original order.
- Parameters:
- Returns:
List of encoded 3Di sequences in original input order
- Raises:
EncodingError – If encoding fails
RuntimeError – If some sequences were not encoded
- Return type:
- genome_entropy.encode3di.encoding.encode(aa_sequences, encode_batch_fn, token_budget_batches_fn, encoding_size)[source]
Encode amino acid sequences to 3Di tokens.
This is a standalone encoding function that orchestrates the encoding pipeline.
- Parameters:
aa_sequences (List[str]) – List of amino acid sequences (uppercase, standard 20 AAs)
encode_batch_fn (Callable[[List[str]], List[str]]) – Function that encodes a batch of preprocessed sequences
token_budget_batches_fn (Callable[[List[str], int], Iterator[Any]]) – Function that batches sequences under token budget
encoding_size (int) – Maximum size (approx. amino acids) to encode per batch
- Returns:
List of 3Di token sequences (one per input sequence)
- Raises:
EncodingError – If encoding fails
- Return type:
Token Estimator
Token size estimation for optimal GPU memory usage in 3Di encoding.
- genome_entropy.encode3di.token_estimator.generate_random_protein(length, seed=None)[source]
Generate a random protein sequence of specified length.
- genome_entropy.encode3di.token_estimator.generate_combined_proteins(target_length, base_length=100, seed=None)[source]
Generate multiple shorter proteins that combine to target length.
- genome_entropy.encode3di.token_estimator.estimate_token_size(encoder, start_length=3000, end_length=10000, step=1000, num_trials=3, base_protein_length=100)[source]
Estimate optimal token size for GPU encoding by testing increasing lengths.
This function generates random protein sequences of increasing total length and attempts to encode them. It catches OutOfMemoryError to find the maximum length that can be encoded on the available GPU.
- Parameters:
encoder (Any) – ProstT5ThreeDiEncoder instance to use for encoding
start_length (int) – Starting total length to test (default: 3000)
end_length (int) – Maximum total length to test (default: 10000)
step (int) – Increment between test lengths (default: 1000)
num_trials (int) – Number of trials per length for robustness (default: 3)
base_protein_length (int) – Approximate length of individual proteins (default: 100)
- Returns:
‘max_length’: Maximum length successfully encoded
’recommended_token_size’: Recommended token budget (90% of max)
’trials_per_length’: Dictionary of successful trials per length
’device’: Device used for testing
- Return type:
Dictionary with estimation results
- Raises:
ValueError – If encoder doesn’t have required attributes or torch not available
Entropy Calculation
Entropy calculation utilities.
Shannon Entropy
Shannon entropy calculation for sequences.
- class genome_entropy.entropy.shannon.EntropyReport(dna_entropy_global, orf_nt_entropy, protein_aa_entropy, three_di_entropy, alphabet_sizes)[source]
Bases:
objectReport containing entropy values at different representation levels.
- Parameters:
- genome_entropy.entropy.shannon.shannon_entropy(sequence, alphabet=None, normalize=False)[source]
Calculate Shannon entropy of a sequence.
Shannon entropy: H = -Σ(p_i × log₂(p_i)) where p_i is the frequency of symbol i.
- Parameters:
- Returns:
Shannon entropy value (bits) - Returns 0.0 for empty sequences - Returns normalized entropy in [0, 1] if normalize=True
- Return type:
Examples
>>> shannon_entropy("AAAA") 0.0 >>> shannon_entropy("ACGT") 2.0 >>> shannon_entropy("ACGT", normalize=True, alphabet=set("ACGT")) 1.0
- genome_entropy.entropy.shannon.calculate_sequence_entropy(sequence, alphabet=None, normalize=False)[source]
Calculate entropy for a biological sequence.
Convenience wrapper around shannon_entropy that handles common preprocessing (e.g., converting to uppercase).
Pipeline
Pipeline orchestration.
- class genome_entropy.pipeline.PipelineResult(input_id, input_dna_length, orfs, proteins, three_dis, entropy)[source]
Bases:
objectResult of running the complete DNA to 3Di pipeline.
- Parameters:
input_id (str)
input_dna_length (int)
proteins (List[ProteinRecord])
three_dis (List[ThreeDiRecord])
entropy (EntropyReport)
- orfs
List of ORFs found in the sequence
- Type:
- proteins
List of translated proteins
- three_dis
List of 3Di encoded structures
- Type:
- entropy
Entropy report for all representations
- proteins: List[ProteinRecord]
- three_dis: List[ThreeDiRecord]
- entropy: EntropyReport
- __init__(input_id, input_dna_length, orfs, proteins, three_dis, entropy)
- Parameters:
input_id (str)
input_dna_length (int)
proteins (List[ProteinRecord])
three_dis (List[ThreeDiRecord])
entropy (EntropyReport)
- Return type:
None
- genome_entropy.pipeline.run_pipeline(input_fasta=None, table_id=11, min_aa_len=30, model_name='gbouras13/modernprost-base', compute_entropy=True, output_json=None, device=None, use_multi_gpu=False, gpu_ids=None, genbank_file=None, encoding_size=None)[source]
Run the complete DNA to 3Di pipeline with entropy calculation.
Pipeline steps: 1. Read FASTA file or GenBank file 2. Find ORFs in all 6 reading frames 3. Translate ORFs to proteins 4. Encode proteins to 3Di structural tokens 5. Calculate entropy at all levels 6. Optionally match ORFs to GenBank CDS annotations 7. Optionally write results to JSON
- Parameters:
input_fasta (str | Path | None) – Path to input FASTA file. Optional if genbank_file is provided.
table_id (int) – NCBI genetic code table ID
min_aa_len (int) – Minimum protein length in amino acids
model_name (str) – ProstT5 model name
compute_entropy (bool) – Whether to compute entropy values
output_json (str | Path | None) – Optional path to save results as JSON
device (str | None) – Device for 3Di encoding (“cuda”, “mps”, “cpu”, or None for auto) Ignored if use_multi_gpu is True.
use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available
gpu_ids (List[int] | None) – Optional list of GPU IDs for multi-GPU encoding. If None and use_multi_gpu=True, auto-discover available GPUs.
genbank_file (str | Path | None) – Optional path to GenBank file. If provided alone, extracts DNA sequences from it. Can be combined with input_fasta to use FASTA sequences with GenBank CDS annotations.
encoding_size (int | None) – Maximum size (approx. amino acids) to encode per batch. If None, uses DEFAULT_ENCODING_SIZE from config.
- Returns:
List of PipelineResult objects (one per input sequence)
- Raises:
PipelineError – If any pipeline step fails
ValueError – If neither input_fasta nor genbank_file is provided
- Return type:
- genome_entropy.pipeline.calculate_pipeline_entropy(dna_sequence, orfs, proteins, three_dis)[source]
Calculate entropy at all representation levels.
- Parameters:
dna_sequence (str) – Original DNA sequence
proteins (List[ProteinRecord]) – List of protein records
three_dis (List[ThreeDiRecord]) – List of 3Di records
- Returns:
EntropyReport with entropy values
- Return type:
- class genome_entropy.pipeline.UnifiedPipelineResult(schema_version, input_id, input_dna_length, dna_entropy_global, alphabet_sizes, features)[source]
Bases:
objectResult of running the complete DNA to 3Di pipeline (unified format).
This is the new format that eliminates redundancy by using a single dictionary of features keyed by orf_id, instead of separate parallel lists for orfs, proteins, and three_dis.
- Parameters:
- features
Dictionary mapping orf_id to UnifiedFeature objects
- Type:
- features: Dict[str, UnifiedFeature]
- class genome_entropy.pipeline.UnifiedFeature(orf_id, location, dna, protein, three_di, metadata, entropy)[source]
Bases:
objectUnified representation of a biological feature (ORF and derived data).
This structure consolidates all information about a single ORF into one hierarchical object, eliminating the redundancy present in the old format where ORF data was duplicated in proteins list and protein data was duplicated in three_dis list.
- Parameters:
orf_id (str)
location (FeatureLocation)
dna (FeatureDNA)
protein (FeatureProtein)
three_di (FeatureThreeDi)
metadata (FeatureMetadata)
entropy (FeatureEntropy)
- location
Genomic coordinates
- dna
DNA sequence information
- protein
Protein sequence information
- three_di
3Di structural encoding
- metadata
Additional metadata
- entropy
Entropy values at all representation levels
- location: FeatureLocation
- dna: FeatureDNA
- protein: FeatureProtein
- three_di: FeatureThreeDi
- metadata: FeatureMetadata
- entropy: FeatureEntropy
- __init__(orf_id, location, dna, protein, three_di, metadata, entropy)
- Parameters:
orf_id (str)
location (FeatureLocation)
dna (FeatureDNA)
protein (FeatureProtein)
three_di (FeatureThreeDi)
metadata (FeatureMetadata)
entropy (FeatureEntropy)
- Return type:
None
- class genome_entropy.pipeline.FeatureLocation(start, end, strand, frame)[source]
Bases:
objectGenomic location of a feature (ORF).
- strand
Strand orientation (‘+’ or ‘-‘)
- Type:
Literal[‘+’, ‘-’]
- class genome_entropy.pipeline.FeatureDNA(nt_sequence, length)[source]
Bases:
objectDNA-level information for a feature.
- class genome_entropy.pipeline.FeatureProtein(aa_sequence, length)[source]
Bases:
objectProtein-level information for a feature.
- class genome_entropy.pipeline.FeatureThreeDi(encoding, length, method, model_name, inference_device)[source]
Bases:
object3Di structural encoding for a feature.
- class genome_entropy.pipeline.FeatureMetadata(parent_id, table_id, has_start_codon, has_stop_codon, in_genbank)[source]
Bases:
objectMetadata about a feature.
- Parameters:
- class genome_entropy.pipeline.FeatureEntropy(dna_entropy, protein_entropy, three_di_entropy)[source]
Bases:
objectEntropy values at different representation levels for a feature.
Runner
End-to-end pipeline orchestration for DNA to 3Di with entropy calculation.
- class genome_entropy.pipeline.runner.PipelineResult(input_id, input_dna_length, orfs, proteins, three_dis, entropy)[source]
Bases:
objectResult of running the complete DNA to 3Di pipeline.
- Parameters:
input_id (str)
input_dna_length (int)
proteins (List[ProteinRecord])
three_dis (List[ThreeDiRecord])
entropy (EntropyReport)
- orfs
List of ORFs found in the sequence
- Type:
- proteins
List of translated proteins
- three_dis
List of 3Di encoded structures
- Type:
- entropy
Entropy report for all representations
- proteins: List[ProteinRecord]
- three_dis: List[ThreeDiRecord]
- entropy: EntropyReport
- __init__(input_id, input_dna_length, orfs, proteins, three_dis, entropy)
- Parameters:
input_id (str)
input_dna_length (int)
proteins (List[ProteinRecord])
three_dis (List[ThreeDiRecord])
entropy (EntropyReport)
- Return type:
None
- genome_entropy.pipeline.runner.run_pipeline(input_fasta=None, table_id=11, min_aa_len=30, model_name='gbouras13/modernprost-base', compute_entropy=True, output_json=None, device=None, use_multi_gpu=False, gpu_ids=None, genbank_file=None, encoding_size=None)[source]
Run the complete DNA to 3Di pipeline with entropy calculation.
Pipeline steps: 1. Read FASTA file or GenBank file 2. Find ORFs in all 6 reading frames 3. Translate ORFs to proteins 4. Encode proteins to 3Di structural tokens 5. Calculate entropy at all levels 6. Optionally match ORFs to GenBank CDS annotations 7. Optionally write results to JSON
- Parameters:
input_fasta (str | Path | None) – Path to input FASTA file. Optional if genbank_file is provided.
table_id (int) – NCBI genetic code table ID
min_aa_len (int) – Minimum protein length in amino acids
model_name (str) – ProstT5 model name
compute_entropy (bool) – Whether to compute entropy values
output_json (str | Path | None) – Optional path to save results as JSON
device (str | None) – Device for 3Di encoding (“cuda”, “mps”, “cpu”, or None for auto) Ignored if use_multi_gpu is True.
use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available
gpu_ids (List[int] | None) – Optional list of GPU IDs for multi-GPU encoding. If None and use_multi_gpu=True, auto-discover available GPUs.
genbank_file (str | Path | None) – Optional path to GenBank file. If provided alone, extracts DNA sequences from it. Can be combined with input_fasta to use FASTA sequences with GenBank CDS annotations.
encoding_size (int | None) – Maximum size (approx. amino acids) to encode per batch. If None, uses DEFAULT_ENCODING_SIZE from config.
- Returns:
List of PipelineResult objects (one per input sequence)
- Raises:
PipelineError – If any pipeline step fails
ValueError – If neither input_fasta nor genbank_file is provided
- Return type:
- genome_entropy.pipeline.runner.calculate_pipeline_entropy(dna_sequence, orfs, proteins, three_dis)[source]
Calculate entropy at all representation levels.
- Parameters:
dna_sequence (str) – Original DNA sequence
proteins (List[ProteinRecord]) – List of protein records
three_dis (List[ThreeDiRecord]) – List of 3Di records
- Returns:
EntropyReport with entropy values
- Return type:
I/O
I/O utilities for genome_entropy.
FASTA I/O
FASTA file reading and writing utilities.
- genome_entropy.io.fasta.read_fasta(fasta_path)[source]
Read a FASTA file and return a dictionary of sequence_id -> sequence.
Automatically detects and handles gzipped files (ending in .gz).
- Parameters:
fasta_path (str | Path) – Path to FASTA file (plain text or gzipped)
- Returns:
Dictionary mapping sequence IDs to sequences
- Raises:
FileNotFoundError – If the FASTA file doesn’t exist
ValueError – If the FASTA file is malformed
- Return type:
- genome_entropy.io.fasta.read_fasta_iter(fasta_path)[source]
Read a FASTA file and yield (sequence_id, sequence) tuples.
Memory-efficient iterator for large FASTA files. Automatically detects and handles gzipped files (ending in .gz).
- Parameters:
fasta_path (str | Path) – Path to FASTA file (plain text or gzipped)
- Yields:
Tuples of (sequence_id, sequence)
- Raises:
FileNotFoundError – If the FASTA file doesn’t exist
ValueError – If the FASTA file is malformed
- Return type:
JSON I/O
JSON serialization for data models.
- genome_entropy.io.jsonio.to_json_dict(obj)[source]
Convert a dataclass object to a JSON-serializable dictionary.
Recursively handles nested dataclasses, lists, and dictionaries.
- genome_entropy.io.jsonio.convert_pipeline_result_to_unified(pipeline_result)[source]
Convert PipelineResult to UnifiedPipelineResult format.
This function transforms the old redundant format (separate orfs, proteins, three_dis lists) into the new unified format where each feature appears exactly once with all its related data organized hierarchically.
OLD FORMAT PROBLEM:
The old format had three parallel lists: - orfs: [ORF1, ORF2, …] - proteins: [{orf: ORF1, aa_seq: …}, {orf: ORF2, aa_seq: …}, …] - three_dis: [{protein: {orf: ORF1, …}, 3di: …}, …]
This caused: 1. ORF data duplicated 3 times (in orfs, inside proteins, inside three_dis) 2. Protein data duplicated 2 times (in proteins, inside three_dis) 3. ~2-3x larger files due to redundancy 4. Risk of inconsistency if data differs between copies
NEW UNIFIED FORMAT:
Single features dictionary with hierarchical organization: - features: {
- “orf_1”: {
location: {start, end, strand, frame}, dna: {sequence, length}, protein: {sequence, length}, three_di: {encoding, length, method, model, device}, metadata: {parent_id, table_id, has_start, has_stop, in_genbank}, entropy: {dna_entropy, protein_entropy, three_di_entropy}
}
}
Benefits: 1. Each piece of information stored exactly once 2. 40-50% smaller file sizes 3. Direct O(1) access by orf_id 4. Clear hierarchical organization matching biological concepts 5. Single source of truth - no inconsistency possible
- param pipeline_result:
PipelineResult object or list of PipelineResult objects
- returns:
UnifiedPipelineResult object or list of UnifiedPipelineResult objects
- genome_entropy.io.jsonio.write_json(data, output_path, indent=2)[source]
Write data to a JSON file.
Automatically handles dataclass objects by converting them to dictionaries. If data contains PipelineResult objects, they are automatically converted to the new unified format to eliminate redundancy. Automatically compresses output if filename ends with .gz.
AUTOMATIC CONVERSION:
This function transparently converts old-format PipelineResult objects to the new unified format. This means:
Users don’t need to manually call convert_pipeline_result_to_unified()
All JSON output from the pipeline automatically uses the new format
The conversion happens only once during serialization
No changes needed to pipeline code or user scripts
MAPPING: Old Keys → New Structure
- OLD FORMAT:
orfs[i].orf_id → features[orf_id].orf_id
orfs[i].start → features[orf_id].location.start
orfs[i].nt_sequence → features[orf_id].dna.nt_sequence
proteins[i].aa_sequence → features[orf_id].protein.aa_sequence
three_dis[i].three_di → features[orf_id].three_di.encoding
entropy.orf_nt_entropy[id] → features[id].entropy.dna_entropy
- NEW FORMAT adds:
schema_version: “2.0.0” (for compatibility tracking)
features: dict (replaces orfs, proteins, three_dis lists)
Hierarchical organization (location, dna, protein, three_di, metadata, entropy)
- param data:
Data to write (dataclass, dict, list, etc.)
- param output_path:
Path to output JSON file (plain text or .gz for compressed)
- param indent:
Indentation level for pretty printing (default: 2)
- genome_entropy.io.jsonio.read_json(input_path)[source]
Read JSON data from a file.
Automatically detects and handles gzipped files (ending in .gz).
- Parameters:
input_path (str | Path) – Path to input JSON file (plain text or gzipped)
- Returns:
Parsed JSON data (dict, list, etc.)
- Raises:
FileNotFoundError – If the JSON file doesn’t exist
json.JSONDecodeError – If the file contains invalid JSON
- Return type:
Configuration
Configuration and constants for genome_entropy.
Errors
Custom exceptions for genome_entropy.
- exception genome_entropy.errors.OrfEntropyError[source]
Bases:
ExceptionBase exception for genome_entropy package.
- exception genome_entropy.errors.ConfigurationError[source]
Bases:
OrfEntropyErrorRaised when there’s a configuration error.
- exception genome_entropy.errors.InputError[source]
Bases:
OrfEntropyErrorRaised when input data is invalid or cannot be processed.
- exception genome_entropy.errors.OrfFinderError[source]
Bases:
OrfEntropyErrorRaised when ORF finding fails.
- exception genome_entropy.errors.TranslationError[source]
Bases:
OrfEntropyErrorRaised when translation fails.
- exception genome_entropy.errors.EncodingError[source]
Bases:
OrfEntropyErrorRaised when 3Di encoding fails.
- exception genome_entropy.errors.ModelError[source]
Bases:
OrfEntropyErrorRaised when model loading or inference fails.
- exception genome_entropy.errors.DeviceError[source]
Bases:
OrfEntropyErrorRaised when device selection or initialization fails.
- exception genome_entropy.errors.PipelineError[source]
Bases:
OrfEntropyErrorRaised when the pipeline orchestration fails.
Logging
Centralized logging configuration for genome_entropy.
This module provides a single source for configuring logging throughout the application. It supports: - Multiple log levels (DEBUG, INFO, WARNING, ERROR, CRITICAL) - Output to file or STDOUT - Consistent format across all modules
- genome_entropy.logging_config.configure_logging(level=20, log_file=None, log_format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', date_format='%Y-%m-%d %H:%M:%S', force=False)[source]
Configure logging for the entire application.
This should be called once at application startup (e.g., in CLI main).
- Parameters:
level (int | str) – Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL) as int or string
log_file (str | Path | None) – Optional path to log file. If None, logs to STDOUT
log_format (str) – Format string for log messages
date_format (str) – Format string for timestamps
force (bool) – If True, reconfigure even if already configured
- Return type:
None
Examples
>>> configure_logging(level=logging.DEBUG, log_file="app.log") >>> configure_logging(level="INFO") # Log to STDOUT >>> configure_logging(level="DEBUG", log_file=None) # Debug to STDOUT
- genome_entropy.logging_config.get_logger(name)[source]
Get a logger instance for a module.
This is the preferred way to get loggers in the application.
- Parameters:
name (str) – Name of the logger (usually __name__ of the module)
- Returns:
Configured logger instance
- Return type:
Example
>>> logger = get_logger(__name__) >>> logger.info("Processing started")
- genome_entropy.logging_config.is_configured()[source]
Check if logging has been configured.
- Returns:
True if configure_logging() has been called
- Return type:
- genome_entropy.logging_config.get_log_file()[source]
Get the current log file path.
- Returns:
Path to log file, or None if logging to STDOUT
- Return type:
Path | None
Usage Examples
ORF Finding
from genome_entropy.orf.finder import find_orfs
# Find ORFs in a FASTA file
orfs = find_orfs(
fasta_path="genome.fasta",
table_id=11,
min_length_nt=90
)
# Examine results
for orf in orfs:
print(f"ORF {orf.orf_id}: {orf.start}-{orf.end} ({orf.strand})")
print(f" Nucleotide: {orf.nt_sequence[:50]}...")
print(f" Amino acid: {orf.aa_sequence[:50]}...")
Translation
from genome_entropy.translate.translator import translate_orfs
# Translate ORFs
proteins = translate_orfs(orfs, table_id=11)
for protein in proteins:
print(f"Protein from {protein.orf.orf_id}: {protein.aa_sequence}")
print(f" Length: {protein.aa_length} amino acids")
3Di Encoding
from genome_entropy.encode3di import ProstT5ThreeDiEncoder
# Initialize encoder
encoder = ProstT5ThreeDiEncoder(
model_name="Rostlab/ProstT5_fp16",
device="auto" # Auto-detect CUDA/MPS/CPU
)
# Encode proteins to 3Di
aa_sequences = [p.aa_sequence for p in proteins]
three_di_tokens = encoder.encode(
aa_sequences,
batch_size=4,
encoding_size=5000
)
for i, tokens in enumerate(three_di_tokens):
print(f"Protein {i}: {tokens[:50]}...")
Token Estimation
from genome_entropy.encode3di import ProstT5ThreeDiEncoder, estimate_token_size
# Initialize encoder
encoder = ProstT5ThreeDiEncoder()
# Find optimal encoding size
results = estimate_token_size(
encoder=encoder,
start_length=3000,
end_length=10000,
step=1000,
num_trials=3
)
print(f"Recommended encoding size: {results['recommended_token_size']} AA")
# Use in encoding
three_di = encoder.encode(
sequences,
encoding_size=results['recommended_token_size']
)
Shannon Entropy
from genome_entropy.entropy.shannon import shannon_entropy, calculate_sequence_entropy
# Calculate basic entropy
dna = "ATCGATCGATCG"
entropy = shannon_entropy(dna)
print(f"DNA entropy: {entropy:.2f} bits")
# Normalized entropy
normalized = shannon_entropy(
dna,
alphabet=set("ACGT"),
normalize=True
)
print(f"Normalized: {normalized:.2f}")
# For biological sequences
protein = "MKKYTLFLGLLGLVAAGTLWGLSACCA"
protein_entropy = calculate_sequence_entropy(protein)
print(f"Protein entropy: {protein_entropy:.2f} bits")
Complete Pipeline
from pathlib import Path
from genome_entropy.pipeline.runner import run_pipeline
# Run complete pipeline
results = run_pipeline(
input_fasta=Path("genome.fasta"),
output_json=Path("results.json"),
table_id=11,
min_aa_len=30,
model_name="Rostlab/ProstT5_fp16",
device="auto",
compute_entropy=True
)
# Process results
for result in results:
print(f"Sequence: {result.input_id}")
print(f" DNA length: {result.input_dna_length}")
print(f" ORFs found: {len(result.orfs)}")
print(f" DNA entropy: {result.entropy.dna_entropy_global:.2f}")
for orf_id, entropy in result.entropy.protein_aa_entropy.items():
print(f" Protein {orf_id} entropy: {entropy:.2f}")
I/O Operations
from genome_entropy.io.fasta import read_fasta, write_fasta
from genome_entropy.io.jsonio import save_json, load_json
# Read FASTA
sequences = read_fasta("genome.fasta")
for seq_id, seq in sequences:
print(f"{seq_id}: {len(seq)} bp")
# Write FASTA
output_sequences = [
("seq1", "ATCGATCG"),
("seq2", "GCTAGCTA")
]
write_fasta("output.fasta", output_sequences)
# Save/load JSON
data = {"key": "value", "results": [1, 2, 3]}
save_json(data, "output.json")
loaded = load_json("output.json")
Error Handling
from genome_entropy.errors import (
OrfEntropyError,
OrfFinderError,
TranslationError,
EncodingError
)
try:
orfs = find_orfs("genome.fasta", table_id=11)
except OrfFinderError as e:
print(f"ORF finding failed: {e}")
except OrfEntropyError as e:
print(f"General error: {e}")
Custom Logging
from genome_entropy.logging_config import configure_logging
import logging
# Configure logging
configure_logging(level="DEBUG", log_file="debug.log")
# Get logger for your module
logger = logging.getLogger(__name__)
logger.info("Starting analysis")
logger.debug("Detailed debug information")
Advanced: Custom Batching
from genome_entropy.encode3di.encoder import ProstT5ThreeDiEncoder
encoder = ProstT5ThreeDiEncoder()
# Create batches with token budget
sequences = ["MKKYTLFLG", "ACDEFGHIK", ...]
batches = encoder.token_budget_batches(
sequences,
max_total_length=5000
)
# Process each batch
all_results = []
for batch in batches:
batch_results = encoder._encode_batch(batch)
all_results.extend(batch_results)
Data Classes
OrfRecord
@dataclass
class OrfRecord:
parent_id: str # Source sequence ID
orf_id: str # Unique ORF identifier
start: int # 0-based, inclusive
end: int # 0-based, exclusive
strand: Literal["+","-"]
frame: int # 0, 1, 2
nt_sequence: str # Nucleotide sequence
aa_sequence: str # Amino acid sequence
table_id: int # NCBI translation table
has_start_codon: bool
has_stop_codon: bool
ThreeDiRecord
@dataclass
class ThreeDiRecord:
orf_id: str
three_di: str # 3Di token sequence
method: Literal["prostt5_aa2fold"]
model_name: str
inference_device: str # "cuda", "mps", or "cpu"
EntropyReport
@dataclass
class EntropyReport:
dna_entropy_global: float
orf_nt_entropy: dict[str, float] # orf_id → entropy
protein_aa_entropy: dict[str, float]
three_di_entropy: dict[str, float]
alphabet_sizes: dict[str, int]
Type Hints
All modules use comprehensive type hints for better IDE support and type checking:
from typing import List, Dict, Optional, Tuple
from pathlib import Path
def find_orfs(
fasta_path: Path | str,
table_id: int = 11,
min_length_nt: int = 90
) -> List[OrfRecord]:
...
Next Steps
See User Guide for conceptual overview
Check CLI Commands Reference for command-line usage
Read Development Guide for contributing