API Reference

This page documents the Python API for genome_entropy. You can use these modules directly in your Python code for more fine-grained control over the pipeline.

Core Modules

`genome_entropy.orf`	ORF finding utilities.
`genome_entropy.translate`	Translation utilities.
`genome_entropy.encode3di`	3Di encoding utilities.
`genome_entropy.entropy`	Entropy calculation utilities.
`genome_entropy.pipeline`	Pipeline orchestration.
`genome_entropy.io`	I/O utilities for genome_entropy.

ORF Finding

ORF finding utilities.

Types

class genome_entropy.orf.types.OrfRecord(parent_id, orf_id, start, end, strand, frame, nt_sequence, aa_sequence, table_id, has_start_codon, has_stop_codon, in_genbank=False)[source]

Bases: object

Represents a single Open Reading Frame (ORF).

Parameters:

parent_id (str)
orf_id (str)
start (int)
end (int)
strand (Literal['+', '-'])
frame (int)
nt_sequence (str)
aa_sequence (str)
table_id (int)
has_start_codon (bool)
has_stop_codon (bool)
in_genbank (bool)

parent_id

ID of the parent DNA sequence

Type:: str

orf_id

Unique identifier for this ORF

Type:: str

start

0-based start position (inclusive)

Type:: int

end

0-based end position (exclusive)

Type:: int

strand

Strand orientation (‘+’ or ‘-‘)

Type:: Literal[‘+’, ‘-’]

frame

Reading frame (0, 1, or 2)

Type:: int

nt_sequence

Nucleotide sequence of the ORF

Type:: str

aa_sequence

Amino acid sequence of the ORF

Type:: str

table_id

NCBI genetic code table ID used

Type:: int

has_start_codon

Whether the ORF has a start codon

Type:: bool

has_stop_codon

Whether the ORF has a stop codon

Type:: bool

in_genbank

Whether this ORF matches a CDS annotated in GenBank

Type:: bool

parent_id: str

orf_id: str

start: int

end: int

strand: Literal['+', '-']

frame: int

nt_sequence: str

aa_sequence: str

table_id: int

has_start_codon: bool

has_stop_codon: bool

in_genbank: bool = False

__post_init__()[source]

Validate ORF attributes.

Return type:: None

__init__(parent_id, orf_id, start, end, strand, frame, nt_sequence, aa_sequence, table_id, has_start_codon, has_stop_codon, in_genbank=False)

Parameters:

parent_id (str)
orf_id (str)
start (int)
end (int)
strand (Literal['+', '-'])
frame (int)
nt_sequence (str)
aa_sequence (str)
table_id (int)
has_start_codon (bool)
has_stop_codon (bool)
in_genbank (bool)

Return type:

None

Finder

ORF finder wrapper using get_orfs binary.

genome_entropy.orf.finder.find_orfs(sequences, table_id=11, min_nt_length=90, binary_path='get_orfs')[source]

Find ORFs in DNA sequences using get_orfs binary.

This function wraps the external get_orfs binary (https://github.com/linsalrob/get_orfs). The binary must be installed and available in PATH or specified via binary_path.

Parameters:

sequences (Dict[str, str]) – Dictionary mapping sequence IDs to DNA sequences
table_id (int) – NCBI genetic code table ID (default: 11, bacterial)
min_nt_length (int) – Minimum ORF length in nucleotides (default: 90)
binary_path (str) – Path to get_orfs binary (default: from config/environment)

Returns:

List of OrfRecord objects

Raises:

OrfFinderError – If get_orfs binary is not found or fails

Return type:

List[OrfRecord]

genome_entropy.orf.finder.reverse_complement(seq)[source]

Return the reverse complement of a DNA sequence.

Parameters:: seq (str)
Return type:: str

Translation

Translation utilities.

Translator

Translation of nucleotide sequences to amino acids.

class genome_entropy.translate.translator.ProteinRecord(orf, aa_sequence, aa_length)[source]

Bases: object

Represents a translated protein from an ORF.

Parameters:

orf (OrfRecord)
aa_sequence (str)
aa_length (int)

orf

The OrfRecord that was translated

Type:: genome_entropy.orf.types.OrfRecord

aa_sequence

The amino acid sequence

Type:: str

aa_length

Length of the amino acid sequence

Type:: int

orf: OrfRecord

aa_sequence: str

aa_length: int

__post_init__()[source]

Validate protein attributes.

Return type:: None

__init__(orf, aa_sequence, aa_length)

Parameters:

orf (OrfRecord)
aa_sequence (str)
aa_length (int)

Return type:

None

genome_entropy.translate.translator.translate_orf(orf, table_id=11)[source]

Translate an ORF to a protein sequence.

Uses the pygenetic-code library for translation with NCBI genetic codes. Ambiguous codons (containing N or other IUPAC codes) are translated to ‘X’.

Parameters:

orf (OrfRecord) – OrfRecord to translate
table_id (int) – NCBI genetic code table ID (default: from config)

Returns:

ProteinRecord with translated sequence

Raises:

TranslationError – If translation fails

Return type:

ProteinRecord

genome_entropy.translate.translator.translate_orfs(orfs, table_id=11)[source]

Translate multiple ORFs to protein sequences.

Parameters:

orfs (List[OrfRecord]) – List of OrfRecord objects to translate
table_id (int) – NCBI genetic code table ID

Returns:

List of ProteinRecord objects

Return type:

List[ProteinRecord]

3Di Encoding

3Di encoding utilities.

class genome_entropy.encode3di.ProstT5ThreeDiEncoder(model_name='gbouras13/modernprost-base', device=None)[source]

Bases: object

Encoder for converting amino acid sequences to 3Di structural tokens.

Uses the ProstT5 model from HuggingFace to predict 3Di tokens directly from protein sequences without requiring 3D structures.

Parameters:

model_name (str)
device (str | None)

__init__(model_name='gbouras13/modernprost-base', device=None)[source]

Initialize the ProstT5 encoder.

Parameters:

model_name (str) – HuggingFace model identifier
device (str | None) – Device to use (“cuda”, “mps”, “cpu”, or None for auto-detect)

Raises:

ModelError – If PyTorch or Transformers are not installed
DeviceError – If specified device is not available

token_budget_batches(aa_sequences, token_budget)[source]

Yield batches of sequences (with original indices) under an approximate token budget.

Optimized strategy to address the problem of isolated long sequences:

Keep original indices.
Sort by length to minimize padding within each batch.
For each batch: - Start with long sequences from the end (largest first) - Add long sequences until adding another would exceed budget - Fill remaining budget with short sequences from the beginning
This approach avoids ending up with long proteins that can’t be combined, resulting in better token budget utilization and fewer iterations.

Parameters

aa_sequences : Sequence[str] Unordered amino acid sequences. token_budget : int Maximum approximate “tokens” per batch

Yields

List[IndexedSeq] A batch of (original_index, sequence) records.

Parameters:

aa_sequences (Sequence[str])
token_budget (int)

Return type:

Iterator[List[IndexedSeq]]

encode(aa_sequences, encoding_size=10000, use_multi_gpu=False, gpu_ids=None, multi_gpu_encoder=None)[source]

Encode amino acid sequences to 3Di tokens.

Parameters:

aa_sequences (List[str]) –
List of amino acid sequences. note: Amino acid sequences are expected to be upper-case,

while 3Di-sequences need to be lower-case.
encoding_size (int) – Maximum size (approx. amino acids) to encode per gpu
use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available
gpu_ids (List[int] | None) – Optional list of GPU IDs to use for multi-GPU encoding. If None and use_multi_gpu=True, auto-discover available GPUs.
multi_gpu_encoder (Any | None) – Optional pre-initialized MultiGPUEncoder instance. If provided, this encoder will be reused instead of creating a new one. This is important for efficiency when processing multiple sequences.

Returns:

List of 3Di token sequences (one per input sequence)

Raises:

EncodingError – If encoding fails

Return type:

List[str]

encode_proteins(proteins, encoding_size=10000, use_multi_gpu=False, gpu_ids=None, multi_gpu_encoder=None)[source]

Encode protein records to 3Di records.

Parameters:

proteins (List[ProteinRecord]) – List of ProteinRecord objects
encoding_size (int) – Maximum size (approx. amino acids) to encode per batch
use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available
gpu_ids (List[int] | None) – Optional list of GPU IDs to use for multi-GPU encoding
multi_gpu_encoder (Any | None) – Optional pre-initialized MultiGPUEncoder instance. If provided, this encoder will be reused instead of creating a new one. This is important for efficiency when processing multiple sequences.

Returns:

List of ThreeDiRecord objects

Return type:

List[ThreeDiRecord]

class genome_entropy.encode3di.ModernProstThreeDiEncoder(model_name, device=None, use_accelerate=False)[source]

Bases: object

Encoder for converting amino acid sequences to 3Di structural tokens.

Uses ModernProst models (gbouras13/modernprost-base or modernprost-profiles) from HuggingFace to predict 3Di tokens directly from protein sequences.

Based on implementation from phold: https://github.com/gbouras13/phold/blob/main/src/phold/features/predict_3Di.py

Parameters:

model_name (str)
device (str | None)
use_accelerate (bool)

__init__(model_name, device=None, use_accelerate=False)[source]

Initialize the ModernProst encoder.

Parameters:

model_name (str) – HuggingFace model identifier (gbouras13/modernprost-base or modernprost-profiles)
device (str | None) – Device to use (“cuda”, “mps”, “cpu”, or None for auto-detect)
use_accelerate (bool) – If True, use HuggingFace accelerate for multi-GPU support

Raises:

ModelError – If PyTorch or Transformers are not installed
DeviceError – If specified device is not available

token_budget_batches(aa_sequences, token_budget)[source]

Yield batches of sequences (with original indices) under an approximate token budget.

Optimized strategy to address the problem of isolated long sequences:

Keep original indices.
Sort by length to minimize padding within each batch.
For each batch: - Start with long sequences from the end (largest first) - Add long sequences until adding another would exceed budget - Fill remaining budget with short sequences from the beginning
This approach avoids ending up with long proteins that can’t be combined, resulting in better token budget utilization and fewer iterations.

Parameters

aa_sequences : Sequence[str] Unordered amino acid sequences. token_budget : int Maximum approximate “tokens” per batch

Yields

List[IndexedSeq] A batch of (original_index, sequence) records.

Parameters:

aa_sequences (Sequence[str])
token_budget (int)

Return type:

Iterator[List[IndexedSeq]]

encode(aa_sequences, encoding_size=10000, use_multi_gpu=False, gpu_ids=None, multi_gpu_encoder=None)[source]

Encode amino acid sequences to 3Di tokens.

Parameters:

aa_sequences (List[str]) – List of amino acid sequences (upper-case).
encoding_size (int) – Maximum size (approx. amino acids) to encode per batch
use_multi_gpu (bool) – If True, use accelerate for multi-GPU parallel encoding
gpu_ids (List[int] | None) – Optional list of GPU IDs (currently unused with accelerate)
multi_gpu_encoder (Any | None) – Optional pre-initialized encoder (for backward compatibility)

Returns:

List of 3Di token sequences (one per input sequence)

Raises:

EncodingError – If encoding fails

Return type:

List[str]

encode_proteins(proteins, encoding_size=10000, use_multi_gpu=False, gpu_ids=None, multi_gpu_encoder=None)[source]

Encode protein records to 3Di records.

Parameters:

proteins (List[ProteinRecord]) – List of ProteinRecord objects
encoding_size (int) – Maximum size (approx. amino acids) to encode per batch
use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available
gpu_ids (List[int] | None) – Optional list of GPU IDs to use for multi-GPU encoding
multi_gpu_encoder (Any | None) – Optional pre-initialized MultiGPUEncoder instance.

Returns:

List of ThreeDiRecord objects

Return type:

List[ThreeDiRecord]

class genome_entropy.encode3di.ThreeDiRecord(protein, three_di, method, model_name, inference_device)[source]

Bases: object

Represents a 3Di structural encoding of a protein.

Parameters:

protein (ProteinRecord)
three_di (str)
method (Literal['prostt5_aa2fold'])
model_name (str)
inference_device (str)

protein

The ProteinRecord that was encoded

Type:: genome_entropy.translate.translator.ProteinRecord

three_di

The 3Di token sequence

Type:: str

method

Method used for encoding (always “prostt5_aa2fold”)

Type:: Literal[‘prostt5_aa2fold’]

model_name

Name of the ProstT5 model used

Type:: str

inference_device

Device used for inference (“cuda”, “mps”, or “cpu”)

Type:: str

protein: ProteinRecord

three_di: str

method: Literal['prostt5_aa2fold']

model_name: str

inference_device: str

__init__(protein, three_di, method, model_name, inference_device)

Parameters:

protein (ProteinRecord)
three_di (str)
method (Literal['prostt5_aa2fold'])
model_name (str)
inference_device (str)

Return type:

None

class genome_entropy.encode3di.IndexedSeq(idx, seq)[source]

Bases: object

A sequence paired with its original position in the input list.

Parameters:

idx (int)
seq (str)

idx: int

seq: str

__init__(idx, seq)

Parameters:

idx (int)
seq (str)

Return type:

None

genome_entropy.encode3di.estimate_token_size(encoder, start_length=3000, end_length=10000, step=1000, num_trials=3, base_protein_length=100)[source]

Estimate optimal token size for GPU encoding by testing increasing lengths.

This function generates random protein sequences of increasing total length and attempts to encode them. It catches OutOfMemoryError to find the maximum length that can be encoded on the available GPU.

Parameters:

encoder (Any) – ProstT5ThreeDiEncoder instance to use for encoding
start_length (int) – Starting total length to test (default: 3000)
end_length (int) – Maximum total length to test (default: 10000)
step (int) – Increment between test lengths (default: 1000)
num_trials (int) – Number of trials per length for robustness (default: 3)
base_protein_length (int) – Approximate length of individual proteins (default: 100)

Returns:

‘max_length’: Maximum length successfully encoded
’recommended_token_size’: Recommended token budget (90% of max)
’trials_per_length’: Dictionary of successful trials per length
’device’: Device used for testing

Return type:

Dictionary with estimation results

Raises:

ValueError – If encoder doesn’t have required attributes or torch not available

genome_entropy.encode3di.generate_random_protein(length, seed=None)[source]

Generate a random protein sequence of specified length.

Parameters:

length (int) – Length of the protein sequence
seed (int | None) – Random seed for reproducibility (optional)

Returns:

Random protein sequence using the 20 standard amino acids

Return type:

str

genome_entropy.encode3di.generate_combined_proteins(target_length, base_length=100, seed=None)[source]

Generate multiple shorter proteins that combine to target length.

Parameters:

target_length (int) – Total target length across all proteins
base_length (int) – Approximate length of each individual protein
seed (int | None) – Random seed for reproducibility (optional)

Returns:

List of protein sequences that total approximately target_length

Return type:

List[str]

Types

Data types for 3Di encoding.

class genome_entropy.encode3di.types.ThreeDiRecord(protein, three_di, method, model_name, inference_device)[source]

Bases: object

Represents a 3Di structural encoding of a protein.

Parameters:

protein (ProteinRecord)
three_di (str)
method (Literal['prostt5_aa2fold'])
model_name (str)
inference_device (str)

protein

The ProteinRecord that was encoded

Type:: genome_entropy.translate.translator.ProteinRecord

three_di

The 3Di token sequence

Type:: str

method

Method used for encoding (always “prostt5_aa2fold”)

Type:: Literal[‘prostt5_aa2fold’]

model_name

Name of the ProstT5 model used

Type:: str

inference_device

Device used for inference (“cuda”, “mps”, or “cpu”)

Type:: str

protein: ProteinRecord

three_di: str

method: Literal['prostt5_aa2fold']

model_name: str

inference_device: str

__init__(protein, three_di, method, model_name, inference_device)

Parameters:

protein (ProteinRecord)
three_di (str)
method (Literal['prostt5_aa2fold'])
model_name (str)
inference_device (str)

Return type:

None

class genome_entropy.encode3di.types.IndexedSeq(idx, seq)[source]

Bases: object

A sequence paired with its original position in the input list.

Parameters:

idx (int)
seq (str)

idx: int

seq: str

__init__(idx, seq)

Parameters:

idx (int)
seq (str)

Return type:

None

Encoder

ProstT5-based encoder for amino acid to 3Di structural token conversion.

class genome_entropy.encode3di.encoder.ProstT5ThreeDiEncoder(model_name='gbouras13/modernprost-base', device=None)[source]

Bases: object

Encoder for converting amino acid sequences to 3Di structural tokens.

Uses the ProstT5 model from HuggingFace to predict 3Di tokens directly from protein sequences without requiring 3D structures.

Parameters:

model_name (str)
device (str | None)

__init__(model_name='gbouras13/modernprost-base', device=None)[source]

Initialize the ProstT5 encoder.

Parameters:

model_name (str) – HuggingFace model identifier
device (str | None) – Device to use (“cuda”, “mps”, “cpu”, or None for auto-detect)

Raises:

ModelError – If PyTorch or Transformers are not installed
DeviceError – If specified device is not available

token_budget_batches(aa_sequences, token_budget)[source]

Yield batches of sequences (with original indices) under an approximate token budget.

Optimized strategy to address the problem of isolated long sequences:

Keep original indices.
Sort by length to minimize padding within each batch.
For each batch: - Start with long sequences from the end (largest first) - Add long sequences until adding another would exceed budget - Fill remaining budget with short sequences from the beginning
This approach avoids ending up with long proteins that can’t be combined, resulting in better token budget utilization and fewer iterations.

Parameters

aa_sequences : Sequence[str] Unordered amino acid sequences. token_budget : int Maximum approximate “tokens” per batch

Yields

List[IndexedSeq] A batch of (original_index, sequence) records.

Parameters:

aa_sequences (Sequence[str])
token_budget (int)

Return type:

Iterator[List[IndexedSeq]]

encode(aa_sequences, encoding_size=10000, use_multi_gpu=False, gpu_ids=None, multi_gpu_encoder=None)[source]

Encode amino acid sequences to 3Di tokens.

Parameters:

aa_sequences (List[str]) –
List of amino acid sequences. note: Amino acid sequences are expected to be upper-case,

while 3Di-sequences need to be lower-case.
encoding_size (int) – Maximum size (approx. amino acids) to encode per gpu
use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available
gpu_ids (List[int] | None) – Optional list of GPU IDs to use for multi-GPU encoding. If None and use_multi_gpu=True, auto-discover available GPUs.
multi_gpu_encoder (Any | None) – Optional pre-initialized MultiGPUEncoder instance. If provided, this encoder will be reused instead of creating a new one. This is important for efficiency when processing multiple sequences.

Returns:

List of 3Di token sequences (one per input sequence)

Raises:

EncodingError – If encoding fails

Return type:

List[str]

encode_proteins(proteins, encoding_size=10000, use_multi_gpu=False, gpu_ids=None, multi_gpu_encoder=None)[source]

Encode protein records to 3Di records.

Parameters:

proteins (List[ProteinRecord]) – List of ProteinRecord objects
encoding_size (int) – Maximum size (approx. amino acids) to encode per batch
use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available
gpu_ids (List[int] | None) – Optional list of GPU IDs to use for multi-GPU encoding
multi_gpu_encoder (Any | None) – Optional pre-initialized MultiGPUEncoder instance. If provided, this encoder will be reused instead of creating a new one. This is important for efficiency when processing multiple sequences.

Returns:

List of ThreeDiRecord objects

Return type:

List[ThreeDiRecord]

Encoding Functions

Core encoding functions for amino acid to 3Di conversion.

genome_entropy.encode3di.encoding.preprocess_sequences(aa_sequences)[source]

Preprocess amino acid sequences for ProstT5 encoding.

Parameters:: aa_sequences (List[str]) – List of raw amino acid sequences
Returns:: List of preprocessed sequences ready for ProstT5 model
Return type:: List[str]

genome_entropy.encode3di.encoding.format_seconds(seconds)[source]

Format seconds as H:MM:SS (or M:SS for < 1 hour).

Parameters:: seconds (float)
Return type:: str

genome_entropy.encode3di.encoding.get_memory_info()[source]

Get current CUDA memory allocation and reservation in GB.

Returns:: Tuple of (allocated_gb, reserved_gb). Returns (0, 0) if CUDA not available.
Return type:: Tuple[float, float]

genome_entropy.encode3di.encoding.process_batches(batches_iter, encode_batch_fn, total_sequences, total_batches)[source]

Process batches of sequences and return results in original order.

Parameters:

batches_iter (Iterator[Any]) – Iterator yielding batches of IndexedSeq objects
encode_batch_fn (Callable[[List[str]], List[str]]) – Function to encode a batch of sequences
total_sequences (int) – Total number of sequences being processed
total_batches (int) – Total number of batches to process

Returns:

List of encoded 3Di sequences in original input order

Raises:

EncodingError – If encoding fails
RuntimeError – If some sequences were not encoded

Return type:

List[str]

genome_entropy.encode3di.encoding.encode(aa_sequences, encode_batch_fn, token_budget_batches_fn, encoding_size)[source]

Encode amino acid sequences to 3Di tokens.

This is a standalone encoding function that orchestrates the encoding pipeline.

Parameters:

aa_sequences (List[str]) – List of amino acid sequences (uppercase, standard 20 AAs)
encode_batch_fn (Callable[[List[str]], List[str]]) – Function that encodes a batch of preprocessed sequences
token_budget_batches_fn (Callable[[List[str], int], Iterator[Any]]) – Function that batches sequences under token budget
encoding_size (int) – Maximum size (approx. amino acids) to encode per batch

Returns:

List of 3Di token sequences (one per input sequence)

Raises:

EncodingError – If encoding fails

Return type:

List[str]

Token Estimator

Token size estimation for optimal GPU memory usage in 3Di encoding.

genome_entropy.encode3di.token_estimator.generate_random_protein(length, seed=None)[source]

Generate a random protein sequence of specified length.

Parameters:

length (int) – Length of the protein sequence
seed (int | None) – Random seed for reproducibility (optional)

Returns:

Random protein sequence using the 20 standard amino acids

Return type:

str

genome_entropy.encode3di.token_estimator.generate_combined_proteins(target_length, base_length=100, seed=None)[source]

Generate multiple shorter proteins that combine to target length.

Parameters:

target_length (int) – Total target length across all proteins
base_length (int) – Approximate length of each individual protein
seed (int | None) – Random seed for reproducibility (optional)

Returns:

List of protein sequences that total approximately target_length

Return type:

List[str]

genome_entropy.encode3di.token_estimator.estimate_token_size(encoder, start_length=3000, end_length=10000, step=1000, num_trials=3, base_protein_length=100)[source]

Estimate optimal token size for GPU encoding by testing increasing lengths.

This function generates random protein sequences of increasing total length and attempts to encode them. It catches OutOfMemoryError to find the maximum length that can be encoded on the available GPU.

Parameters:

encoder (Any) – ProstT5ThreeDiEncoder instance to use for encoding
start_length (int) – Starting total length to test (default: 3000)
end_length (int) – Maximum total length to test (default: 10000)
step (int) – Increment between test lengths (default: 1000)
num_trials (int) – Number of trials per length for robustness (default: 3)
base_protein_length (int) – Approximate length of individual proteins (default: 100)

Returns:

‘max_length’: Maximum length successfully encoded
’recommended_token_size’: Recommended token budget (90% of max)
’trials_per_length’: Dictionary of successful trials per length
’device’: Device used for testing

Return type:

Dictionary with estimation results

Raises:

ValueError – If encoder doesn’t have required attributes or torch not available

Entropy Calculation

Entropy calculation utilities.

Shannon Entropy

Shannon entropy calculation for sequences.

class genome_entropy.entropy.shannon.EntropyReport(dna_entropy_global, orf_nt_entropy, protein_aa_entropy, three_di_entropy, alphabet_sizes)[source]

Bases: object

Report containing entropy values at different representation levels.

Parameters:

dna_entropy_global (float)
orf_nt_entropy (Dict[str, float])
protein_aa_entropy (Dict[str, float])
three_di_entropy (Dict[str, float])
alphabet_sizes (Dict[str, int])

dna_entropy_global

Entropy of the entire input DNA sequence

Type:: float

orf_nt_entropy

Dictionary mapping ORF IDs to their nucleotide entropy

Type:: Dict[str, float]

protein_aa_entropy

Dictionary mapping ORF IDs to their amino acid entropy

Type:: Dict[str, float]

three_di_entropy

Dictionary mapping ORF IDs to their 3Di token entropy

Type:: Dict[str, float]

alphabet_sizes

Dictionary with alphabet sizes for each representation

Type:: Dict[str, int]

dna_entropy_global: float

orf_nt_entropy: Dict[str, float]

protein_aa_entropy: Dict[str, float]

three_di_entropy: Dict[str, float]

alphabet_sizes: Dict[str, int]

__init__(dna_entropy_global, orf_nt_entropy, protein_aa_entropy, three_di_entropy, alphabet_sizes)

Parameters:

dna_entropy_global (float)
orf_nt_entropy (Dict[str, float])
protein_aa_entropy (Dict[str, float])
three_di_entropy (Dict[str, float])
alphabet_sizes (Dict[str, int])

Return type:

None

genome_entropy.entropy.shannon.shannon_entropy(sequence, alphabet=None, normalize=False)[source]

Calculate Shannon entropy of a sequence.

Shannon entropy: H = -Σ(p_i × log₂(p_i)) where p_i is the frequency of symbol i.

Parameters:

sequence (str) – String to calculate entropy for
alphabet (Set[str] | None) – Optional set of symbols in the alphabet for normalization
normalize (bool) – If True, normalize entropy by max possible entropy (log₂|alphabet|)

Returns:

Shannon entropy value (bits) - Returns 0.0 for empty sequences - Returns normalized entropy in [0, 1] if normalize=True

Return type:

float

Examples

>>> shannon_entropy("AAAA")
0.0
>>> shannon_entropy("ACGT")
2.0
>>> shannon_entropy("ACGT", normalize=True, alphabet=set("ACGT"))
1.0

genome_entropy.entropy.shannon.calculate_sequence_entropy(sequence, alphabet=None, normalize=False)[source]

Calculate entropy for a biological sequence.

Convenience wrapper around shannon_entropy that handles common preprocessing (e.g., converting to uppercase).

Parameters:

sequence (str) – Biological sequence (DNA, protein, 3Di tokens)
alphabet (Set[str] | None) – Optional alphabet for normalization
normalize (bool) – Whether to normalize by alphabet size

Returns:

Shannon entropy in bits (or normalized to [0, 1])

Return type:

float

genome_entropy.entropy.shannon.calculate_entropies_for_sequences(sequences, alphabet=None, normalize=False)[source]

Calculate entropy for multiple sequences.

Parameters:

sequences (Dict[str, str]) – Dictionary mapping IDs to sequences
alphabet (Set[str] | None) – Optional alphabet for normalization
normalize (bool) – Whether to normalize by alphabet size

Returns:

Dictionary mapping IDs to entropy values

Return type:

Dict[str, float]

Pipeline

Pipeline orchestration.

class genome_entropy.pipeline.PipelineResult(input_id, input_dna_length, orfs, proteins, three_dis, entropy)[source]

Bases: object

Result of running the complete DNA to 3Di pipeline.

Parameters:

input_id (str)
input_dna_length (int)
orfs (List[OrfRecord])
proteins (List[ProteinRecord])
three_dis (List[ThreeDiRecord])
entropy (EntropyReport)

input_id

ID of the input DNA sequence

Type:: str

input_dna_length

Length of the input DNA sequence

Type:: int

orfs

List of ORFs found in the sequence

Type:: List[genome_entropy.orf.types.OrfRecord]

proteins

List of translated proteins

Type:: List[genome_entropy.translate.translator.ProteinRecord]

three_dis

List of 3Di encoded structures

Type:: List[genome_entropy.encode3di.types.ThreeDiRecord]

entropy

Entropy report for all representations

Type:: genome_entropy.entropy.shannon.EntropyReport

input_id: str

input_dna_length: int

orfs: List[OrfRecord]

proteins: List[ProteinRecord]

three_dis: List[ThreeDiRecord]

entropy: EntropyReport

__init__(input_id, input_dna_length, orfs, proteins, three_dis, entropy)

Parameters:

input_id (str)
input_dna_length (int)
orfs (List[OrfRecord])
proteins (List[ProteinRecord])
three_dis (List[ThreeDiRecord])
entropy (EntropyReport)

Return type:

None

genome_entropy.pipeline.run_pipeline(input_fasta=None, table_id=11, min_aa_len=30, model_name='gbouras13/modernprost-base', compute_entropy=True, output_json=None, device=None, use_multi_gpu=False, gpu_ids=None, genbank_file=None, encoding_size=None)[source]

Run the complete DNA to 3Di pipeline with entropy calculation.

Pipeline steps: 1. Read FASTA file or GenBank file 2. Find ORFs in all 6 reading frames 3. Translate ORFs to proteins 4. Encode proteins to 3Di structural tokens 5. Calculate entropy at all levels 6. Optionally match ORFs to GenBank CDS annotations 7. Optionally write results to JSON

Parameters:

input_fasta (str | Path | None) – Path to input FASTA file. Optional if genbank_file is provided.
table_id (int) – NCBI genetic code table ID
min_aa_len (int) – Minimum protein length in amino acids
model_name (str) – ProstT5 model name
compute_entropy (bool) – Whether to compute entropy values
output_json (str | Path | None) – Optional path to save results as JSON
device (str | None) – Device for 3Di encoding (“cuda”, “mps”, “cpu”, or None for auto) Ignored if use_multi_gpu is True.
use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available
gpu_ids (List[int] | None) – Optional list of GPU IDs for multi-GPU encoding. If None and use_multi_gpu=True, auto-discover available GPUs.
genbank_file (str | Path | None) – Optional path to GenBank file. If provided alone, extracts DNA sequences from it. Can be combined with input_fasta to use FASTA sequences with GenBank CDS annotations.
encoding_size (int | None) – Maximum size (approx. amino acids) to encode per batch. If None, uses DEFAULT_ENCODING_SIZE from config.

Returns:

List of PipelineResult objects (one per input sequence)

Raises:

PipelineError – If any pipeline step fails
ValueError – If neither input_fasta nor genbank_file is provided

Return type:

List[PipelineResult]

genome_entropy.pipeline.calculate_pipeline_entropy(dna_sequence, orfs, proteins, three_dis)[source]

Calculate entropy at all representation levels.

Parameters:

dna_sequence (str) – Original DNA sequence
orfs (List[OrfRecord]) – List of ORF records
proteins (List[ProteinRecord]) – List of protein records
three_dis (List[ThreeDiRecord]) – List of 3Di records

Returns:

EntropyReport with entropy values

Return type:

EntropyReport

class genome_entropy.pipeline.UnifiedPipelineResult(schema_version, input_id, input_dna_length, dna_entropy_global, alphabet_sizes, features)[source]

Bases: object

Result of running the complete DNA to 3Di pipeline (unified format).

This is the new format that eliminates redundancy by using a single dictionary of features keyed by orf_id, instead of separate parallel lists for orfs, proteins, and three_dis.

Parameters:

schema_version (str)
input_id (str)
input_dna_length (int)
dna_entropy_global (float)
alphabet_sizes (Dict[str, int])
features (Dict[str, UnifiedFeature])

schema_version

Version of the output schema (for compatibility tracking)

Type:: str

input_id

ID of the input DNA sequence

Type:: str

input_dna_length

Length of the input DNA sequence

Type:: int

dna_entropy_global

Entropy of the entire input DNA sequence

Type:: float

alphabet_sizes

Dictionary with alphabet sizes for each representation

Type:: Dict[str, int]

features

Dictionary mapping orf_id to UnifiedFeature objects

Type:: Dict[str, genome_entropy.pipeline.types.UnifiedFeature]

schema_version: str

input_id: str

input_dna_length: int

dna_entropy_global: float

alphabet_sizes: Dict[str, int]

features: Dict[str, UnifiedFeature]

__init__(schema_version, input_id, input_dna_length, dna_entropy_global, alphabet_sizes, features)

Parameters:

schema_version (str)
input_id (str)
input_dna_length (int)
dna_entropy_global (float)
alphabet_sizes (Dict[str, int])
features (Dict[str, UnifiedFeature])

Return type:

None

class genome_entropy.pipeline.UnifiedFeature(orf_id, location, dna, protein, three_di, metadata, entropy)[source]

Bases: object

Unified representation of a biological feature (ORF and derived data).

This structure consolidates all information about a single ORF into one hierarchical object, eliminating the redundancy present in the old format where ORF data was duplicated in proteins list and protein data was duplicated in three_dis list.

Parameters:

orf_id (str)
location (FeatureLocation)
dna (FeatureDNA)
protein (FeatureProtein)
three_di (FeatureThreeDi)
metadata (FeatureMetadata)
entropy (FeatureEntropy)

orf_id

Unique identifier for this feature

Type:: str

location

Genomic coordinates

Type:: genome_entropy.pipeline.types.FeatureLocation

dna

DNA sequence information

Type:: genome_entropy.pipeline.types.FeatureDNA

protein

Protein sequence information

Type:: genome_entropy.pipeline.types.FeatureProtein

three_di

3Di structural encoding

Type:: genome_entropy.pipeline.types.FeatureThreeDi

metadata

Additional metadata

Type:: genome_entropy.pipeline.types.FeatureMetadata

entropy

Entropy values at all representation levels

Type:: genome_entropy.pipeline.types.FeatureEntropy

orf_id: str

location: FeatureLocation

dna: FeatureDNA

protein: FeatureProtein

three_di: FeatureThreeDi

metadata: FeatureMetadata

entropy: FeatureEntropy

__init__(orf_id, location, dna, protein, three_di, metadata, entropy)

Parameters:

orf_id (str)
location (FeatureLocation)
dna (FeatureDNA)
protein (FeatureProtein)
three_di (FeatureThreeDi)
metadata (FeatureMetadata)
entropy (FeatureEntropy)

Return type:

None

class genome_entropy.pipeline.FeatureLocation(start, end, strand, frame)[source]

Bases: object

Genomic location of a feature (ORF).

Parameters:

start (int)
end (int)
strand (Literal['+', '-'])
frame (int)

start

0-based start position (inclusive)

Type:: int

end

0-based end position (exclusive)

Type:: int

strand

Strand orientation (‘+’ or ‘-‘)

Type:: Literal[‘+’, ‘-’]

frame

Reading frame (0, 1, 2, or 3)

Type:: int

start: int

end: int

strand: Literal['+', '-']

frame: int

__init__(start, end, strand, frame)

Parameters:

start (int)
end (int)
strand (Literal['+', '-'])
frame (int)

Return type:

None

class genome_entropy.pipeline.FeatureDNA(nt_sequence, length)[source]

Bases: object

DNA-level information for a feature.

Parameters:

nt_sequence (str)
length (int)

nt_sequence

Nucleotide sequence

Type:: str

length

Length of nucleotide sequence

Type:: int

nt_sequence: str

length: int

__init__(nt_sequence, length)

Parameters:

nt_sequence (str)
length (int)

Return type:

None

class genome_entropy.pipeline.FeatureProtein(aa_sequence, length)[source]

Bases: object

Protein-level information for a feature.

Parameters:

aa_sequence (str)
length (int)

aa_sequence

Amino acid sequence

Type:: str

length

Length of amino acid sequence

Type:: int

aa_sequence: str

length: int

__init__(aa_sequence, length)

Parameters:

aa_sequence (str)
length (int)

Return type:

None

class genome_entropy.pipeline.FeatureThreeDi(encoding, length, method, model_name, inference_device)[source]

Bases: object

3Di structural encoding for a feature.

Parameters:

encoding (str)
length (int)
method (str)
model_name (str)
inference_device (str)

encoding

3Di token sequence

Type:: str

length

Length of 3Di sequence

Type:: int

method

Method used for encoding (e.g., “prostt5_aa2fold”)

Type:: str

model_name

Name of the model used

Type:: str

inference_device

Device used for inference (“cuda”, “mps”, or “cpu”)

Type:: str

encoding: str

length: int

method: str

model_name: str

inference_device: str

__init__(encoding, length, method, model_name, inference_device)

Parameters:

encoding (str)
length (int)
method (str)
model_name (str)
inference_device (str)

Return type:

None

class genome_entropy.pipeline.FeatureMetadata(parent_id, table_id, has_start_codon, has_stop_codon, in_genbank)[source]

Bases: object

Metadata about a feature.

Parameters:

parent_id (str)
table_id (int)
has_start_codon (bool)
has_stop_codon (bool)
in_genbank (bool)

parent_id

ID of the parent DNA sequence

Type:: str

table_id

NCBI genetic code table ID used

Type:: int

has_start_codon

Whether the ORF has a start codon

Type:: bool

has_stop_codon

Whether the ORF has a stop codon

Type:: bool

in_genbank

Whether this ORF matches a CDS annotated in GenBank

Type:: bool

parent_id: str

table_id: int

has_start_codon: bool

has_stop_codon: bool

in_genbank: bool

__init__(parent_id, table_id, has_start_codon, has_stop_codon, in_genbank)

Parameters:

parent_id (str)
table_id (int)
has_start_codon (bool)
has_stop_codon (bool)
in_genbank (bool)

Return type:

None

class genome_entropy.pipeline.FeatureEntropy(dna_entropy, protein_entropy, three_di_entropy)[source]

Bases: object

Entropy values at different representation levels for a feature.

Parameters:

dna_entropy (float)
protein_entropy (float)
three_di_entropy (float)

dna_entropy

Shannon entropy of nucleotide sequence

Type:: float

protein_entropy

Shannon entropy of amino acid sequence

Type:: float

three_di_entropy

Shannon entropy of 3Di encoding

Type:: float

dna_entropy: float

protein_entropy: float

three_di_entropy: float

__init__(dna_entropy, protein_entropy, three_di_entropy)

Parameters:

dna_entropy (float)
protein_entropy (float)
three_di_entropy (float)

Return type:

None

Runner

End-to-end pipeline orchestration for DNA to 3Di with entropy calculation.

class genome_entropy.pipeline.runner.PipelineResult(input_id, input_dna_length, orfs, proteins, three_dis, entropy)[source]

Bases: object

Result of running the complete DNA to 3Di pipeline.

Parameters:

input_id (str)
input_dna_length (int)
orfs (List[OrfRecord])
proteins (List[ProteinRecord])
three_dis (List[ThreeDiRecord])
entropy (EntropyReport)

input_id

ID of the input DNA sequence

Type:: str

input_dna_length

Length of the input DNA sequence

Type:: int

orfs

List of ORFs found in the sequence

Type:: List[genome_entropy.orf.types.OrfRecord]

proteins

List of translated proteins

Type:: List[genome_entropy.translate.translator.ProteinRecord]

three_dis

List of 3Di encoded structures

Type:: List[genome_entropy.encode3di.types.ThreeDiRecord]

entropy

Entropy report for all representations

Type:: genome_entropy.entropy.shannon.EntropyReport

input_id: str

input_dna_length: int

orfs: List[OrfRecord]

proteins: List[ProteinRecord]

three_dis: List[ThreeDiRecord]

entropy: EntropyReport

__init__(input_id, input_dna_length, orfs, proteins, three_dis, entropy)

Parameters:

input_id (str)
input_dna_length (int)
orfs (List[OrfRecord])
proteins (List[ProteinRecord])
three_dis (List[ThreeDiRecord])
entropy (EntropyReport)

Return type:

None

genome_entropy.pipeline.runner.run_pipeline(input_fasta=None, table_id=11, min_aa_len=30, model_name='gbouras13/modernprost-base', compute_entropy=True, output_json=None, device=None, use_multi_gpu=False, gpu_ids=None, genbank_file=None, encoding_size=None)[source]

Run the complete DNA to 3Di pipeline with entropy calculation.

Pipeline steps: 1. Read FASTA file or GenBank file 2. Find ORFs in all 6 reading frames 3. Translate ORFs to proteins 4. Encode proteins to 3Di structural tokens 5. Calculate entropy at all levels 6. Optionally match ORFs to GenBank CDS annotations 7. Optionally write results to JSON

Parameters:

input_fasta (str | Path | None) – Path to input FASTA file. Optional if genbank_file is provided.
table_id (int) – NCBI genetic code table ID
min_aa_len (int) – Minimum protein length in amino acids
model_name (str) – ProstT5 model name
compute_entropy (bool) – Whether to compute entropy values
output_json (str | Path | None) – Optional path to save results as JSON
device (str | None) – Device for 3Di encoding (“cuda”, “mps”, “cpu”, or None for auto) Ignored if use_multi_gpu is True.
use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available
gpu_ids (List[int] | None) – Optional list of GPU IDs for multi-GPU encoding. If None and use_multi_gpu=True, auto-discover available GPUs.
genbank_file (str | Path | None) – Optional path to GenBank file. If provided alone, extracts DNA sequences from it. Can be combined with input_fasta to use FASTA sequences with GenBank CDS annotations.
encoding_size (int | None) – Maximum size (approx. amino acids) to encode per batch. If None, uses DEFAULT_ENCODING_SIZE from config.

Returns:

List of PipelineResult objects (one per input sequence)

Raises:

PipelineError – If any pipeline step fails
ValueError – If neither input_fasta nor genbank_file is provided

Return type:

List[PipelineResult]

genome_entropy.pipeline.runner.calculate_pipeline_entropy(dna_sequence, orfs, proteins, three_dis)[source]

Calculate entropy at all representation levels.

Parameters:

dna_sequence (str) – Original DNA sequence
orfs (List[OrfRecord]) – List of ORF records
proteins (List[ProteinRecord]) – List of protein records
three_dis (List[ThreeDiRecord]) – List of 3Di records

Returns:

EntropyReport with entropy values

Return type:

EntropyReport

I/O

I/O utilities for genome_entropy.

FASTA I/O

FASTA file reading and writing utilities.

genome_entropy.io.fasta.read_fasta(fasta_path)[source]

Read a FASTA file and return a dictionary of sequence_id -> sequence.

Automatically detects and handles gzipped files (ending in .gz).

Parameters:

fasta_path (str | Path) – Path to FASTA file (plain text or gzipped)

Returns:

Dictionary mapping sequence IDs to sequences

Raises:

FileNotFoundError – If the FASTA file doesn’t exist
ValueError – If the FASTA file is malformed

Return type:

Dict[str, str]

genome_entropy.io.fasta.read_fasta_iter(fasta_path)[source]

Read a FASTA file and yield (sequence_id, sequence) tuples.

Memory-efficient iterator for large FASTA files. Automatically detects and handles gzipped files (ending in .gz).

Parameters:

fasta_path (str | Path) – Path to FASTA file (plain text or gzipped)

Yields:

Tuples of (sequence_id, sequence)

Raises:

FileNotFoundError – If the FASTA file doesn’t exist
ValueError – If the FASTA file is malformed

Return type:

Iterator[Tuple[str, str]]

genome_entropy.io.fasta.write_fasta(sequences, output_path, line_width=80)[source]

Write sequences to a FASTA file.

Automatically compresses output if filename ends with .gz.

Parameters:

sequences (Dict[str, str]) – Dictionary mapping sequence IDs to sequences
output_path (str | Path) – Path to output FASTA file (plain text or .gz for compressed)
line_width (int) – Maximum line width for sequence lines (default: 80)

Return type:

None

JSON I/O

JSON serialization for data models.

genome_entropy.io.jsonio.to_json_dict(obj)[source]

Convert a dataclass object to a JSON-serializable dictionary.

Recursively handles nested dataclasses, lists, and dictionaries.

Parameters:: obj (Any) – Object to convert (typically a dataclass instance)
Returns:: JSON-serializable dictionary
Return type:: Any

genome_entropy.io.jsonio.convert_pipeline_result_to_unified(pipeline_result)[source]

Convert PipelineResult to UnifiedPipelineResult format.

This function transforms the old redundant format (separate orfs, proteins, three_dis lists) into the new unified format where each feature appears exactly once with all its related data organized hierarchically.

OLD FORMAT PROBLEM:

The old format had three parallel lists: - orfs: [ORF1, ORF2, …] - proteins: [{orf: ORF1, aa_seq: …}, {orf: ORF2, aa_seq: …}, …] - three_dis: [{protein: {orf: ORF1, …}, 3di: …}, …]

This caused: 1. ORF data duplicated 3 times (in orfs, inside proteins, inside three_dis) 2. Protein data duplicated 2 times (in proteins, inside three_dis) 3. ~2-3x larger files due to redundancy 4. Risk of inconsistency if data differs between copies

NEW UNIFIED FORMAT:

Single features dictionary with hierarchical organization: - features: {

“orf_1”: {
location: {start, end, strand, frame}, dna: {sequence, length}, protein: {sequence, length}, three_di: {encoding, length, method, model, device}, metadata: {parent_id, table_id, has_start, has_stop, in_genbank}, entropy: {dna_entropy, protein_entropy, three_di_entropy}

}

}

Benefits: 1. Each piece of information stored exactly once 2. 40-50% smaller file sizes 3. Direct O(1) access by orf_id 4. Clear hierarchical organization matching biological concepts 5. Single source of truth - no inconsistency possible

param pipeline_result:: PipelineResult object or list of PipelineResult objects
returns:: UnifiedPipelineResult object or list of UnifiedPipelineResult objects

genome_entropy.io.jsonio.write_json(data, output_path, indent=2)[source]

Write data to a JSON file.

Automatically handles dataclass objects by converting them to dictionaries. If data contains PipelineResult objects, they are automatically converted to the new unified format to eliminate redundancy. Automatically compresses output if filename ends with .gz.

AUTOMATIC CONVERSION:

This function transparently converts old-format PipelineResult objects to the new unified format. This means:

Users don’t need to manually call convert_pipeline_result_to_unified()
All JSON output from the pipeline automatically uses the new format
The conversion happens only once during serialization
No changes needed to pipeline code or user scripts

MAPPING: Old Keys → New Structure

OLD FORMAT:

orfs[i].orf_id → features[orf_id].orf_id
orfs[i].start → features[orf_id].location.start
orfs[i].nt_sequence → features[orf_id].dna.nt_sequence
proteins[i].aa_sequence → features[orf_id].protein.aa_sequence
three_dis[i].three_di → features[orf_id].three_di.encoding
entropy.orf_nt_entropy[id] → features[id].entropy.dna_entropy

NEW FORMAT adds:

schema_version: “2.0.0” (for compatibility tracking)
features: dict (replaces orfs, proteins, three_dis lists)
Hierarchical organization (location, dna, protein, three_di, metadata, entropy)

param data:: Data to write (dataclass, dict, list, etc.)
param output_path:: Path to output JSON file (plain text or .gz for compressed)
param indent:: Indentation level for pretty printing (default: 2)

Parameters:

data (Any)
output_path (str | Path)
indent (int)

Return type:

None

genome_entropy.io.jsonio.read_json(input_path)[source]

Read JSON data from a file.

Automatically detects and handles gzipped files (ending in .gz).

Parameters:

input_path (str | Path) – Path to input JSON file (plain text or gzipped)

Returns:

Parsed JSON data (dict, list, etc.)

Raises:

FileNotFoundError – If the JSON file doesn’t exist
json.JSONDecodeError – If the file contains invalid JSON

Return type:

Any

Configuration

Configuration and constants for genome_entropy.

Errors

Custom exceptions for genome_entropy.

exception genome_entropy.errors.OrfEntropyError[source]

Bases: Exception

Base exception for genome_entropy package.

exception genome_entropy.errors.ConfigurationError[source]

Bases: OrfEntropyError

Raised when there’s a configuration error.

exception genome_entropy.errors.InputError[source]

Bases: OrfEntropyError

Raised when input data is invalid or cannot be processed.

exception genome_entropy.errors.OrfFinderError[source]

Bases: OrfEntropyError

Raised when ORF finding fails.

exception genome_entropy.errors.TranslationError[source]

Bases: OrfEntropyError

Raised when translation fails.

exception genome_entropy.errors.EncodingError[source]

Bases: OrfEntropyError

Raised when 3Di encoding fails.

exception genome_entropy.errors.ModelError[source]

Bases: OrfEntropyError

Raised when model loading or inference fails.

exception genome_entropy.errors.DeviceError[source]

Bases: OrfEntropyError

Raised when device selection or initialization fails.

exception genome_entropy.errors.PipelineError[source]

Bases: OrfEntropyError

Raised when the pipeline orchestration fails.

Logging

Centralized logging configuration for genome_entropy.

This module provides a single source for configuring logging throughout the application. It supports: - Multiple log levels (DEBUG, INFO, WARNING, ERROR, CRITICAL) - Output to file or STDOUT - Consistent format across all modules

genome_entropy.logging_config.configure_logging(level=20, log_file=None, log_format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', date_format='%Y-%m-%d %H:%M:%S', force=False)[source]

Configure logging for the entire application.

This should be called once at application startup (e.g., in CLI main).

Parameters:

level (int | str) – Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL) as int or string
log_file (str | Path | None) – Optional path to log file. If None, logs to STDOUT
log_format (str) – Format string for log messages
date_format (str) – Format string for timestamps
force (bool) – If True, reconfigure even if already configured

Return type:

None

Examples

>>> configure_logging(level=logging.DEBUG, log_file="app.log")
>>> configure_logging(level="INFO")  # Log to STDOUT
>>> configure_logging(level="DEBUG", log_file=None)  # Debug to STDOUT

genome_entropy.logging_config.get_logger(name)[source]

Get a logger instance for a module.

This is the preferred way to get loggers in the application.

Parameters:: name (str) – Name of the logger (usually __name__ of the module)
Returns:: Configured logger instance
Return type:: Logger

Example

>>> logger = get_logger(__name__)
>>> logger.info("Processing started")

genome_entropy.logging_config.is_configured()[source]

Check if logging has been configured.

Returns:: True if configure_logging() has been called
Return type:: bool

genome_entropy.logging_config.get_log_file()[source]

Get the current log file path.

Returns:: Path to log file, or None if logging to STDOUT
Return type:: Path | None

genome_entropy.logging_config.get_log_level()[source]

Get the current logging level.

Returns:: Current logging level as integer
Return type:: int

genome_entropy.logging_config.set_log_level(level)[source]

Change the logging level at runtime.

Parameters:: level (int | str) – New logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
Return type:: None

Example

>>> set_log_level("DEBUG")
>>> set_log_level(logging.WARNING)

Usage Examples

ORF Finding

from genome_entropy.orf.finder import find_orfs

# Find ORFs in a FASTA file
orfs = find_orfs(
    fasta_path="genome.fasta",
    table_id=11,
    min_length_nt=90
)

# Examine results
for orf in orfs:
    print(f"ORF {orf.orf_id}: {orf.start}-{orf.end} ({orf.strand})")
    print(f"  Nucleotide: {orf.nt_sequence[:50]}...")
    print(f"  Amino acid: {orf.aa_sequence[:50]}...")

Translation

from genome_entropy.translate.translator import translate_orfs

# Translate ORFs
proteins = translate_orfs(orfs, table_id=11)

for protein in proteins:
    print(f"Protein from {protein.orf.orf_id}: {protein.aa_sequence}")
    print(f"  Length: {protein.aa_length} amino acids")

3Di Encoding

from genome_entropy.encode3di import ProstT5ThreeDiEncoder

# Initialize encoder
encoder = ProstT5ThreeDiEncoder(
    model_name="Rostlab/ProstT5_fp16",
    device="auto"  # Auto-detect CUDA/MPS/CPU
)

# Encode proteins to 3Di
aa_sequences = [p.aa_sequence for p in proteins]
three_di_tokens = encoder.encode(
    aa_sequences,
    batch_size=4,
    encoding_size=5000
)

for i, tokens in enumerate(three_di_tokens):
    print(f"Protein {i}: {tokens[:50]}...")

Token Estimation

from genome_entropy.encode3di import ProstT5ThreeDiEncoder, estimate_token_size

# Initialize encoder
encoder = ProstT5ThreeDiEncoder()

# Find optimal encoding size
results = estimate_token_size(
    encoder=encoder,
    start_length=3000,
    end_length=10000,
    step=1000,
    num_trials=3
)

print(f"Recommended encoding size: {results['recommended_token_size']} AA")

# Use in encoding
three_di = encoder.encode(
    sequences,
    encoding_size=results['recommended_token_size']
)

Shannon Entropy

from genome_entropy.entropy.shannon import shannon_entropy, calculate_sequence_entropy

# Calculate basic entropy
dna = "ATCGATCGATCG"
entropy = shannon_entropy(dna)
print(f"DNA entropy: {entropy:.2f} bits")

# Normalized entropy
normalized = shannon_entropy(
    dna,
    alphabet=set("ACGT"),
    normalize=True
)
print(f"Normalized: {normalized:.2f}")

# For biological sequences
protein = "MKKYTLFLGLLGLVAAGTLWGLSACCA"
protein_entropy = calculate_sequence_entropy(protein)
print(f"Protein entropy: {protein_entropy:.2f} bits")

Complete Pipeline

from pathlib import Path
from genome_entropy.pipeline.runner import run_pipeline

# Run complete pipeline
results = run_pipeline(
    input_fasta=Path("genome.fasta"),
    output_json=Path("results.json"),
    table_id=11,
    min_aa_len=30,
    model_name="Rostlab/ProstT5_fp16",
    device="auto",
    compute_entropy=True
)

# Process results
for result in results:
    print(f"Sequence: {result.input_id}")
    print(f"  DNA length: {result.input_dna_length}")
    print(f"  ORFs found: {len(result.orfs)}")
    print(f"  DNA entropy: {result.entropy.dna_entropy_global:.2f}")

    for orf_id, entropy in result.entropy.protein_aa_entropy.items():
        print(f"  Protein {orf_id} entropy: {entropy:.2f}")

I/O Operations

from genome_entropy.io.fasta import read_fasta, write_fasta
from genome_entropy.io.jsonio import save_json, load_json

# Read FASTA
sequences = read_fasta("genome.fasta")
for seq_id, seq in sequences:
    print(f"{seq_id}: {len(seq)} bp")

# Write FASTA
output_sequences = [
    ("seq1", "ATCGATCG"),
    ("seq2", "GCTAGCTA")
]
write_fasta("output.fasta", output_sequences)

# Save/load JSON
data = {"key": "value", "results": [1, 2, 3]}
save_json(data, "output.json")
loaded = load_json("output.json")

Error Handling

from genome_entropy.errors import (
    OrfEntropyError,
    OrfFinderError,
    TranslationError,
    EncodingError
)

try:
    orfs = find_orfs("genome.fasta", table_id=11)
except OrfFinderError as e:
    print(f"ORF finding failed: {e}")
except OrfEntropyError as e:
    print(f"General error: {e}")

Custom Logging

from genome_entropy.logging_config import configure_logging
import logging

# Configure logging
configure_logging(level="DEBUG", log_file="debug.log")

# Get logger for your module
logger = logging.getLogger(__name__)
logger.info("Starting analysis")
logger.debug("Detailed debug information")

Advanced: Custom Batching

from genome_entropy.encode3di.encoder import ProstT5ThreeDiEncoder

encoder = ProstT5ThreeDiEncoder()

# Create batches with token budget
sequences = ["MKKYTLFLG", "ACDEFGHIK", ...]
batches = encoder.token_budget_batches(
    sequences,
    max_total_length=5000
)

# Process each batch
all_results = []
for batch in batches:
    batch_results = encoder._encode_batch(batch)
    all_results.extend(batch_results)

Data Classes

OrfRecord

@dataclass
class OrfRecord:
    parent_id: str          # Source sequence ID
    orf_id: str             # Unique ORF identifier
    start: int              # 0-based, inclusive
    end: int                # 0-based, exclusive
    strand: Literal["+","-"]
    frame: int              # 0, 1, 2
    nt_sequence: str        # Nucleotide sequence
    aa_sequence: str        # Amino acid sequence
    table_id: int           # NCBI translation table
    has_start_codon: bool
    has_stop_codon: bool

ThreeDiRecord

@dataclass
class ThreeDiRecord:
    orf_id: str
    three_di: str           # 3Di token sequence
    method: Literal["prostt5_aa2fold"]
    model_name: str
    inference_device: str   # "cuda", "mps", or "cpu"

EntropyReport

@dataclass
class EntropyReport:
    dna_entropy_global: float
    orf_nt_entropy: dict[str, float]     # orf_id → entropy
    protein_aa_entropy: dict[str, float]
    three_di_entropy: dict[str, float]
    alphabet_sizes: dict[str, int]

Type Hints

All modules use comprehensive type hints for better IDE support and type checking:

from typing import List, Dict, Optional, Tuple
from pathlib import Path

def find_orfs(
    fasta_path: Path | str,
    table_id: int = 11,
    min_length_nt: int = 90
) -> List[OrfRecord]:
    ...

Next Steps

See User Guide for conceptual overview
Check CLI Commands Reference for command-line usage
Read Development Guide for contributing