API Reference

This page documents the Python API for genome_entropy. You can use these modules directly in your Python code for more fine-grained control over the pipeline.

Core Modules

genome_entropy.orf

ORF finding utilities.

genome_entropy.translate

Translation utilities.

genome_entropy.encode3di

3Di encoding utilities.

genome_entropy.entropy

Entropy calculation utilities.

genome_entropy.pipeline

Pipeline orchestration.

genome_entropy.io

I/O utilities for genome_entropy.

ORF Finding

ORF finding utilities.

Types

class genome_entropy.orf.types.OrfRecord(parent_id, orf_id, start, end, strand, frame, nt_sequence, aa_sequence, table_id, has_start_codon, has_stop_codon, in_genbank=False)[source]

Bases: object

Represents a single Open Reading Frame (ORF).

Parameters:
parent_id

ID of the parent DNA sequence

Type:

str

orf_id

Unique identifier for this ORF

Type:

str

start

0-based start position (inclusive)

Type:

int

end

0-based end position (exclusive)

Type:

int

strand

Strand orientation (‘+’ or ‘-‘)

Type:

Literal[‘+’, ‘-’]

frame

Reading frame (0, 1, or 2)

Type:

int

nt_sequence

Nucleotide sequence of the ORF

Type:

str

aa_sequence

Amino acid sequence of the ORF

Type:

str

table_id

NCBI genetic code table ID used

Type:

int

has_start_codon

Whether the ORF has a start codon

Type:

bool

has_stop_codon

Whether the ORF has a stop codon

Type:

bool

in_genbank

Whether this ORF matches a CDS annotated in GenBank

Type:

bool

parent_id: str
orf_id: str
start: int
end: int
strand: Literal['+', '-']
frame: int
nt_sequence: str
aa_sequence: str
table_id: int
has_start_codon: bool
has_stop_codon: bool
in_genbank: bool = False
__post_init__()[source]

Validate ORF attributes.

Return type:

None

__init__(parent_id, orf_id, start, end, strand, frame, nt_sequence, aa_sequence, table_id, has_start_codon, has_stop_codon, in_genbank=False)
Parameters:
Return type:

None

Finder

ORF finder wrapper using get_orfs binary.

genome_entropy.orf.finder.find_orfs(sequences, table_id=11, min_nt_length=90, binary_path='get_orfs')[source]

Find ORFs in DNA sequences using get_orfs binary.

This function wraps the external get_orfs binary (https://github.com/linsalrob/get_orfs). The binary must be installed and available in PATH or specified via binary_path.

Parameters:
  • sequences (Dict[str, str]) – Dictionary mapping sequence IDs to DNA sequences

  • table_id (int) – NCBI genetic code table ID (default: 11, bacterial)

  • min_nt_length (int) – Minimum ORF length in nucleotides (default: 90)

  • binary_path (str) – Path to get_orfs binary (default: from config/environment)

Returns:

List of OrfRecord objects

Raises:

OrfFinderError – If get_orfs binary is not found or fails

Return type:

List[OrfRecord]

genome_entropy.orf.finder.reverse_complement(seq)[source]

Return the reverse complement of a DNA sequence.

Parameters:

seq (str)

Return type:

str

Translation

Translation utilities.

Translator

Translation of nucleotide sequences to amino acids.

class genome_entropy.translate.translator.ProteinRecord(orf, aa_sequence, aa_length)[source]

Bases: object

Represents a translated protein from an ORF.

Parameters:
orf

The OrfRecord that was translated

Type:

genome_entropy.orf.types.OrfRecord

aa_sequence

The amino acid sequence

Type:

str

aa_length

Length of the amino acid sequence

Type:

int

orf: OrfRecord
aa_sequence: str
aa_length: int
__post_init__()[source]

Validate protein attributes.

Return type:

None

__init__(orf, aa_sequence, aa_length)
Parameters:
Return type:

None

genome_entropy.translate.translator.translate_orf(orf, table_id=11)[source]

Translate an ORF to a protein sequence.

Uses the pygenetic-code library for translation with NCBI genetic codes. Ambiguous codons (containing N or other IUPAC codes) are translated to ‘X’.

Parameters:
  • orf (OrfRecord) – OrfRecord to translate

  • table_id (int) – NCBI genetic code table ID (default: from config)

Returns:

ProteinRecord with translated sequence

Raises:

TranslationError – If translation fails

Return type:

ProteinRecord

genome_entropy.translate.translator.translate_orfs(orfs, table_id=11)[source]

Translate multiple ORFs to protein sequences.

Parameters:
  • orfs (List[OrfRecord]) – List of OrfRecord objects to translate

  • table_id (int) – NCBI genetic code table ID

Returns:

List of ProteinRecord objects

Return type:

List[ProteinRecord]

3Di Encoding

3Di encoding utilities.

class genome_entropy.encode3di.ProstT5ThreeDiEncoder(model_name='gbouras13/modernprost-base', device=None)[source]

Bases: object

Encoder for converting amino acid sequences to 3Di structural tokens.

Uses the ProstT5 model from HuggingFace to predict 3Di tokens directly from protein sequences without requiring 3D structures.

Parameters:
  • model_name (str)

  • device (str | None)

__init__(model_name='gbouras13/modernprost-base', device=None)[source]

Initialize the ProstT5 encoder.

Parameters:
  • model_name (str) – HuggingFace model identifier

  • device (str | None) – Device to use (“cuda”, “mps”, “cpu”, or None for auto-detect)

Raises:
  • ModelError – If PyTorch or Transformers are not installed

  • DeviceError – If specified device is not available

token_budget_batches(aa_sequences, token_budget)[source]

Yield batches of sequences (with original indices) under an approximate token budget.

Optimized strategy to address the problem of isolated long sequences:
  1. Keep original indices.

  2. Sort by length to minimize padding within each batch.

  3. For each batch: - Start with long sequences from the end (largest first) - Add long sequences until adding another would exceed budget - Fill remaining budget with short sequences from the beginning

  4. This approach avoids ending up with long proteins that can’t be combined, resulting in better token budget utilization and fewer iterations.

Parameters

aa_sequences : Sequence[str] Unordered amino acid sequences. token_budget : int Maximum approximate “tokens” per batch

Yields

List[IndexedSeq] A batch of (original_index, sequence) records.

Parameters:
Return type:

Iterator[List[IndexedSeq]]

encode(aa_sequences, encoding_size=10000, use_multi_gpu=False, gpu_ids=None, multi_gpu_encoder=None)[source]

Encode amino acid sequences to 3Di tokens.

Parameters:
  • aa_sequences (List[str]) –

    List of amino acid sequences. note: Amino acid sequences are expected to be upper-case,

    while 3Di-sequences need to be lower-case.

  • encoding_size (int) – Maximum size (approx. amino acids) to encode per gpu

  • use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available

  • gpu_ids (List[int] | None) – Optional list of GPU IDs to use for multi-GPU encoding. If None and use_multi_gpu=True, auto-discover available GPUs.

  • multi_gpu_encoder (Any | None) – Optional pre-initialized MultiGPUEncoder instance. If provided, this encoder will be reused instead of creating a new one. This is important for efficiency when processing multiple sequences.

Returns:

List of 3Di token sequences (one per input sequence)

Raises:

EncodingError – If encoding fails

Return type:

List[str]

encode_proteins(proteins, encoding_size=10000, use_multi_gpu=False, gpu_ids=None, multi_gpu_encoder=None)[source]

Encode protein records to 3Di records.

Parameters:
  • proteins (List[ProteinRecord]) – List of ProteinRecord objects

  • encoding_size (int) – Maximum size (approx. amino acids) to encode per batch

  • use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available

  • gpu_ids (List[int] | None) – Optional list of GPU IDs to use for multi-GPU encoding

  • multi_gpu_encoder (Any | None) – Optional pre-initialized MultiGPUEncoder instance. If provided, this encoder will be reused instead of creating a new one. This is important for efficiency when processing multiple sequences.

Returns:

List of ThreeDiRecord objects

Return type:

List[ThreeDiRecord]

class genome_entropy.encode3di.ModernProstThreeDiEncoder(model_name, device=None, use_accelerate=False)[source]

Bases: object

Encoder for converting amino acid sequences to 3Di structural tokens.

Uses ModernProst models (gbouras13/modernprost-base or modernprost-profiles) from HuggingFace to predict 3Di tokens directly from protein sequences.

Based on implementation from phold: https://github.com/gbouras13/phold/blob/main/src/phold/features/predict_3Di.py

Parameters:
  • model_name (str)

  • device (str | None)

  • use_accelerate (bool)

__init__(model_name, device=None, use_accelerate=False)[source]

Initialize the ModernProst encoder.

Parameters:
  • model_name (str) – HuggingFace model identifier (gbouras13/modernprost-base or modernprost-profiles)

  • device (str | None) – Device to use (“cuda”, “mps”, “cpu”, or None for auto-detect)

  • use_accelerate (bool) – If True, use HuggingFace accelerate for multi-GPU support

Raises:
  • ModelError – If PyTorch or Transformers are not installed

  • DeviceError – If specified device is not available

token_budget_batches(aa_sequences, token_budget)[source]

Yield batches of sequences (with original indices) under an approximate token budget.

Optimized strategy to address the problem of isolated long sequences:
  1. Keep original indices.

  2. Sort by length to minimize padding within each batch.

  3. For each batch: - Start with long sequences from the end (largest first) - Add long sequences until adding another would exceed budget - Fill remaining budget with short sequences from the beginning

  4. This approach avoids ending up with long proteins that can’t be combined, resulting in better token budget utilization and fewer iterations.

Parameters

aa_sequences : Sequence[str] Unordered amino acid sequences. token_budget : int Maximum approximate “tokens” per batch

Yields

List[IndexedSeq] A batch of (original_index, sequence) records.

Parameters:
Return type:

Iterator[List[IndexedSeq]]

encode(aa_sequences, encoding_size=10000, use_multi_gpu=False, gpu_ids=None, multi_gpu_encoder=None)[source]

Encode amino acid sequences to 3Di tokens.

Parameters:
  • aa_sequences (List[str]) – List of amino acid sequences (upper-case).

  • encoding_size (int) – Maximum size (approx. amino acids) to encode per batch

  • use_multi_gpu (bool) – If True, use accelerate for multi-GPU parallel encoding

  • gpu_ids (List[int] | None) – Optional list of GPU IDs (currently unused with accelerate)

  • multi_gpu_encoder (Any | None) – Optional pre-initialized encoder (for backward compatibility)

Returns:

List of 3Di token sequences (one per input sequence)

Raises:

EncodingError – If encoding fails

Return type:

List[str]

encode_proteins(proteins, encoding_size=10000, use_multi_gpu=False, gpu_ids=None, multi_gpu_encoder=None)[source]

Encode protein records to 3Di records.

Parameters:
  • proteins (List[ProteinRecord]) – List of ProteinRecord objects

  • encoding_size (int) – Maximum size (approx. amino acids) to encode per batch

  • use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available

  • gpu_ids (List[int] | None) – Optional list of GPU IDs to use for multi-GPU encoding

  • multi_gpu_encoder (Any | None) – Optional pre-initialized MultiGPUEncoder instance.

Returns:

List of ThreeDiRecord objects

Return type:

List[ThreeDiRecord]

class genome_entropy.encode3di.ThreeDiRecord(protein, three_di, method, model_name, inference_device)[source]

Bases: object

Represents a 3Di structural encoding of a protein.

Parameters:
protein

The ProteinRecord that was encoded

Type:

genome_entropy.translate.translator.ProteinRecord

three_di

The 3Di token sequence

Type:

str

method

Method used for encoding (always “prostt5_aa2fold”)

Type:

Literal[‘prostt5_aa2fold’]

model_name

Name of the ProstT5 model used

Type:

str

inference_device

Device used for inference (“cuda”, “mps”, or “cpu”)

Type:

str

protein: ProteinRecord
three_di: str
method: Literal['prostt5_aa2fold']
model_name: str
inference_device: str
__init__(protein, three_di, method, model_name, inference_device)
Parameters:
Return type:

None

class genome_entropy.encode3di.IndexedSeq(idx, seq)[source]

Bases: object

A sequence paired with its original position in the input list.

Parameters:
idx: int
seq: str
__init__(idx, seq)
Parameters:
Return type:

None

genome_entropy.encode3di.estimate_token_size(encoder, start_length=3000, end_length=10000, step=1000, num_trials=3, base_protein_length=100)[source]

Estimate optimal token size for GPU encoding by testing increasing lengths.

This function generates random protein sequences of increasing total length and attempts to encode them. It catches OutOfMemoryError to find the maximum length that can be encoded on the available GPU.

Parameters:
  • encoder (Any) – ProstT5ThreeDiEncoder instance to use for encoding

  • start_length (int) – Starting total length to test (default: 3000)

  • end_length (int) – Maximum total length to test (default: 10000)

  • step (int) – Increment between test lengths (default: 1000)

  • num_trials (int) – Number of trials per length for robustness (default: 3)

  • base_protein_length (int) – Approximate length of individual proteins (default: 100)

Returns:

  • ‘max_length’: Maximum length successfully encoded

  • ’recommended_token_size’: Recommended token budget (90% of max)

  • ’trials_per_length’: Dictionary of successful trials per length

  • ’device’: Device used for testing

Return type:

Dictionary with estimation results

Raises:

ValueError – If encoder doesn’t have required attributes or torch not available

genome_entropy.encode3di.generate_random_protein(length, seed=None)[source]

Generate a random protein sequence of specified length.

Parameters:
  • length (int) – Length of the protein sequence

  • seed (int | None) – Random seed for reproducibility (optional)

Returns:

Random protein sequence using the 20 standard amino acids

Return type:

str

genome_entropy.encode3di.generate_combined_proteins(target_length, base_length=100, seed=None)[source]

Generate multiple shorter proteins that combine to target length.

Parameters:
  • target_length (int) – Total target length across all proteins

  • base_length (int) – Approximate length of each individual protein

  • seed (int | None) – Random seed for reproducibility (optional)

Returns:

List of protein sequences that total approximately target_length

Return type:

List[str]

Types

Data types for 3Di encoding.

class genome_entropy.encode3di.types.ThreeDiRecord(protein, three_di, method, model_name, inference_device)[source]

Bases: object

Represents a 3Di structural encoding of a protein.

Parameters:
protein

The ProteinRecord that was encoded

Type:

genome_entropy.translate.translator.ProteinRecord

three_di

The 3Di token sequence

Type:

str

method

Method used for encoding (always “prostt5_aa2fold”)

Type:

Literal[‘prostt5_aa2fold’]

model_name

Name of the ProstT5 model used

Type:

str

inference_device

Device used for inference (“cuda”, “mps”, or “cpu”)

Type:

str

protein: ProteinRecord
three_di: str
method: Literal['prostt5_aa2fold']
model_name: str
inference_device: str
__init__(protein, three_di, method, model_name, inference_device)
Parameters:
Return type:

None

class genome_entropy.encode3di.types.IndexedSeq(idx, seq)[source]

Bases: object

A sequence paired with its original position in the input list.

Parameters:
idx: int
seq: str
__init__(idx, seq)
Parameters:
Return type:

None

Encoder

ProstT5-based encoder for amino acid to 3Di structural token conversion.

class genome_entropy.encode3di.encoder.ProstT5ThreeDiEncoder(model_name='gbouras13/modernprost-base', device=None)[source]

Bases: object

Encoder for converting amino acid sequences to 3Di structural tokens.

Uses the ProstT5 model from HuggingFace to predict 3Di tokens directly from protein sequences without requiring 3D structures.

Parameters:
  • model_name (str)

  • device (str | None)

__init__(model_name='gbouras13/modernprost-base', device=None)[source]

Initialize the ProstT5 encoder.

Parameters:
  • model_name (str) – HuggingFace model identifier

  • device (str | None) – Device to use (“cuda”, “mps”, “cpu”, or None for auto-detect)

Raises:
  • ModelError – If PyTorch or Transformers are not installed

  • DeviceError – If specified device is not available

token_budget_batches(aa_sequences, token_budget)[source]

Yield batches of sequences (with original indices) under an approximate token budget.

Optimized strategy to address the problem of isolated long sequences:
  1. Keep original indices.

  2. Sort by length to minimize padding within each batch.

  3. For each batch: - Start with long sequences from the end (largest first) - Add long sequences until adding another would exceed budget - Fill remaining budget with short sequences from the beginning

  4. This approach avoids ending up with long proteins that can’t be combined, resulting in better token budget utilization and fewer iterations.

Parameters

aa_sequences : Sequence[str] Unordered amino acid sequences. token_budget : int Maximum approximate “tokens” per batch

Yields

List[IndexedSeq] A batch of (original_index, sequence) records.

Parameters:
Return type:

Iterator[List[IndexedSeq]]

encode(aa_sequences, encoding_size=10000, use_multi_gpu=False, gpu_ids=None, multi_gpu_encoder=None)[source]

Encode amino acid sequences to 3Di tokens.

Parameters:
  • aa_sequences (List[str]) –

    List of amino acid sequences. note: Amino acid sequences are expected to be upper-case,

    while 3Di-sequences need to be lower-case.

  • encoding_size (int) – Maximum size (approx. amino acids) to encode per gpu

  • use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available

  • gpu_ids (List[int] | None) – Optional list of GPU IDs to use for multi-GPU encoding. If None and use_multi_gpu=True, auto-discover available GPUs.

  • multi_gpu_encoder (Any | None) – Optional pre-initialized MultiGPUEncoder instance. If provided, this encoder will be reused instead of creating a new one. This is important for efficiency when processing multiple sequences.

Returns:

List of 3Di token sequences (one per input sequence)

Raises:

EncodingError – If encoding fails

Return type:

List[str]

encode_proteins(proteins, encoding_size=10000, use_multi_gpu=False, gpu_ids=None, multi_gpu_encoder=None)[source]

Encode protein records to 3Di records.

Parameters:
  • proteins (List[ProteinRecord]) – List of ProteinRecord objects

  • encoding_size (int) – Maximum size (approx. amino acids) to encode per batch

  • use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available

  • gpu_ids (List[int] | None) – Optional list of GPU IDs to use for multi-GPU encoding

  • multi_gpu_encoder (Any | None) – Optional pre-initialized MultiGPUEncoder instance. If provided, this encoder will be reused instead of creating a new one. This is important for efficiency when processing multiple sequences.

Returns:

List of ThreeDiRecord objects

Return type:

List[ThreeDiRecord]

Encoding Functions

Core encoding functions for amino acid to 3Di conversion.

genome_entropy.encode3di.encoding.preprocess_sequences(aa_sequences)[source]

Preprocess amino acid sequences for ProstT5 encoding.

Parameters:

aa_sequences (List[str]) – List of raw amino acid sequences

Returns:

List of preprocessed sequences ready for ProstT5 model

Return type:

List[str]

genome_entropy.encode3di.encoding.format_seconds(seconds)[source]

Format seconds as H:MM:SS (or M:SS for < 1 hour).

Parameters:

seconds (float)

Return type:

str

genome_entropy.encode3di.encoding.get_memory_info()[source]

Get current CUDA memory allocation and reservation in GB.

Returns:

Tuple of (allocated_gb, reserved_gb). Returns (0, 0) if CUDA not available.

Return type:

Tuple[float, float]

genome_entropy.encode3di.encoding.process_batches(batches_iter, encode_batch_fn, total_sequences, total_batches)[source]

Process batches of sequences and return results in original order.

Parameters:
  • batches_iter (Iterator[Any]) – Iterator yielding batches of IndexedSeq objects

  • encode_batch_fn (Callable[[List[str]], List[str]]) – Function to encode a batch of sequences

  • total_sequences (int) – Total number of sequences being processed

  • total_batches (int) – Total number of batches to process

Returns:

List of encoded 3Di sequences in original input order

Raises:
Return type:

List[str]

genome_entropy.encode3di.encoding.encode(aa_sequences, encode_batch_fn, token_budget_batches_fn, encoding_size)[source]

Encode amino acid sequences to 3Di tokens.

This is a standalone encoding function that orchestrates the encoding pipeline.

Parameters:
  • aa_sequences (List[str]) – List of amino acid sequences (uppercase, standard 20 AAs)

  • encode_batch_fn (Callable[[List[str]], List[str]]) – Function that encodes a batch of preprocessed sequences

  • token_budget_batches_fn (Callable[[List[str], int], Iterator[Any]]) – Function that batches sequences under token budget

  • encoding_size (int) – Maximum size (approx. amino acids) to encode per batch

Returns:

List of 3Di token sequences (one per input sequence)

Raises:

EncodingError – If encoding fails

Return type:

List[str]

Token Estimator

Token size estimation for optimal GPU memory usage in 3Di encoding.

genome_entropy.encode3di.token_estimator.generate_random_protein(length, seed=None)[source]

Generate a random protein sequence of specified length.

Parameters:
  • length (int) – Length of the protein sequence

  • seed (int | None) – Random seed for reproducibility (optional)

Returns:

Random protein sequence using the 20 standard amino acids

Return type:

str

genome_entropy.encode3di.token_estimator.generate_combined_proteins(target_length, base_length=100, seed=None)[source]

Generate multiple shorter proteins that combine to target length.

Parameters:
  • target_length (int) – Total target length across all proteins

  • base_length (int) – Approximate length of each individual protein

  • seed (int | None) – Random seed for reproducibility (optional)

Returns:

List of protein sequences that total approximately target_length

Return type:

List[str]

genome_entropy.encode3di.token_estimator.estimate_token_size(encoder, start_length=3000, end_length=10000, step=1000, num_trials=3, base_protein_length=100)[source]

Estimate optimal token size for GPU encoding by testing increasing lengths.

This function generates random protein sequences of increasing total length and attempts to encode them. It catches OutOfMemoryError to find the maximum length that can be encoded on the available GPU.

Parameters:
  • encoder (Any) – ProstT5ThreeDiEncoder instance to use for encoding

  • start_length (int) – Starting total length to test (default: 3000)

  • end_length (int) – Maximum total length to test (default: 10000)

  • step (int) – Increment between test lengths (default: 1000)

  • num_trials (int) – Number of trials per length for robustness (default: 3)

  • base_protein_length (int) – Approximate length of individual proteins (default: 100)

Returns:

  • ‘max_length’: Maximum length successfully encoded

  • ’recommended_token_size’: Recommended token budget (90% of max)

  • ’trials_per_length’: Dictionary of successful trials per length

  • ’device’: Device used for testing

Return type:

Dictionary with estimation results

Raises:

ValueError – If encoder doesn’t have required attributes or torch not available

Entropy Calculation

Entropy calculation utilities.

Shannon Entropy

Shannon entropy calculation for sequences.

class genome_entropy.entropy.shannon.EntropyReport(dna_entropy_global, orf_nt_entropy, protein_aa_entropy, three_di_entropy, alphabet_sizes)[source]

Bases: object

Report containing entropy values at different representation levels.

Parameters:
dna_entropy_global

Entropy of the entire input DNA sequence

Type:

float

orf_nt_entropy

Dictionary mapping ORF IDs to their nucleotide entropy

Type:

Dict[str, float]

protein_aa_entropy

Dictionary mapping ORF IDs to their amino acid entropy

Type:

Dict[str, float]

three_di_entropy

Dictionary mapping ORF IDs to their 3Di token entropy

Type:

Dict[str, float]

alphabet_sizes

Dictionary with alphabet sizes for each representation

Type:

Dict[str, int]

dna_entropy_global: float
orf_nt_entropy: Dict[str, float]
protein_aa_entropy: Dict[str, float]
three_di_entropy: Dict[str, float]
alphabet_sizes: Dict[str, int]
__init__(dna_entropy_global, orf_nt_entropy, protein_aa_entropy, three_di_entropy, alphabet_sizes)
Parameters:
Return type:

None

genome_entropy.entropy.shannon.shannon_entropy(sequence, alphabet=None, normalize=False)[source]

Calculate Shannon entropy of a sequence.

Shannon entropy: H = -Σ(p_i × log₂(p_i)) where p_i is the frequency of symbol i.

Parameters:
  • sequence (str) – String to calculate entropy for

  • alphabet (Set[str] | None) – Optional set of symbols in the alphabet for normalization

  • normalize (bool) – If True, normalize entropy by max possible entropy (log₂|alphabet|)

Returns:

Shannon entropy value (bits) - Returns 0.0 for empty sequences - Returns normalized entropy in [0, 1] if normalize=True

Return type:

float

Examples

>>> shannon_entropy("AAAA")
0.0
>>> shannon_entropy("ACGT")
2.0
>>> shannon_entropy("ACGT", normalize=True, alphabet=set("ACGT"))
1.0
genome_entropy.entropy.shannon.calculate_sequence_entropy(sequence, alphabet=None, normalize=False)[source]

Calculate entropy for a biological sequence.

Convenience wrapper around shannon_entropy that handles common preprocessing (e.g., converting to uppercase).

Parameters:
  • sequence (str) – Biological sequence (DNA, protein, 3Di tokens)

  • alphabet (Set[str] | None) – Optional alphabet for normalization

  • normalize (bool) – Whether to normalize by alphabet size

Returns:

Shannon entropy in bits (or normalized to [0, 1])

Return type:

float

genome_entropy.entropy.shannon.calculate_entropies_for_sequences(sequences, alphabet=None, normalize=False)[source]

Calculate entropy for multiple sequences.

Parameters:
  • sequences (Dict[str, str]) – Dictionary mapping IDs to sequences

  • alphabet (Set[str] | None) – Optional alphabet for normalization

  • normalize (bool) – Whether to normalize by alphabet size

Returns:

Dictionary mapping IDs to entropy values

Return type:

Dict[str, float]

Pipeline

Pipeline orchestration.

class genome_entropy.pipeline.PipelineResult(input_id, input_dna_length, orfs, proteins, three_dis, entropy)[source]

Bases: object

Result of running the complete DNA to 3Di pipeline.

Parameters:
input_id

ID of the input DNA sequence

Type:

str

input_dna_length

Length of the input DNA sequence

Type:

int

orfs

List of ORFs found in the sequence

Type:

List[genome_entropy.orf.types.OrfRecord]

proteins

List of translated proteins

Type:

List[genome_entropy.translate.translator.ProteinRecord]

three_dis

List of 3Di encoded structures

Type:

List[genome_entropy.encode3di.types.ThreeDiRecord]

entropy

Entropy report for all representations

Type:

genome_entropy.entropy.shannon.EntropyReport

input_id: str
input_dna_length: int
orfs: List[OrfRecord]
proteins: List[ProteinRecord]
three_dis: List[ThreeDiRecord]
entropy: EntropyReport
__init__(input_id, input_dna_length, orfs, proteins, three_dis, entropy)
Parameters:
Return type:

None

genome_entropy.pipeline.run_pipeline(input_fasta=None, table_id=11, min_aa_len=30, model_name='gbouras13/modernprost-base', compute_entropy=True, output_json=None, device=None, use_multi_gpu=False, gpu_ids=None, genbank_file=None, encoding_size=None)[source]

Run the complete DNA to 3Di pipeline with entropy calculation.

Pipeline steps: 1. Read FASTA file or GenBank file 2. Find ORFs in all 6 reading frames 3. Translate ORFs to proteins 4. Encode proteins to 3Di structural tokens 5. Calculate entropy at all levels 6. Optionally match ORFs to GenBank CDS annotations 7. Optionally write results to JSON

Parameters:
  • input_fasta (str | Path | None) – Path to input FASTA file. Optional if genbank_file is provided.

  • table_id (int) – NCBI genetic code table ID

  • min_aa_len (int) – Minimum protein length in amino acids

  • model_name (str) – ProstT5 model name

  • compute_entropy (bool) – Whether to compute entropy values

  • output_json (str | Path | None) – Optional path to save results as JSON

  • device (str | None) – Device for 3Di encoding (“cuda”, “mps”, “cpu”, or None for auto) Ignored if use_multi_gpu is True.

  • use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available

  • gpu_ids (List[int] | None) – Optional list of GPU IDs for multi-GPU encoding. If None and use_multi_gpu=True, auto-discover available GPUs.

  • genbank_file (str | Path | None) – Optional path to GenBank file. If provided alone, extracts DNA sequences from it. Can be combined with input_fasta to use FASTA sequences with GenBank CDS annotations.

  • encoding_size (int | None) – Maximum size (approx. amino acids) to encode per batch. If None, uses DEFAULT_ENCODING_SIZE from config.

Returns:

List of PipelineResult objects (one per input sequence)

Raises:
Return type:

List[PipelineResult]

genome_entropy.pipeline.calculate_pipeline_entropy(dna_sequence, orfs, proteins, three_dis)[source]

Calculate entropy at all representation levels.

Parameters:
Returns:

EntropyReport with entropy values

Return type:

EntropyReport

class genome_entropy.pipeline.UnifiedPipelineResult(schema_version, input_id, input_dna_length, dna_entropy_global, alphabet_sizes, features)[source]

Bases: object

Result of running the complete DNA to 3Di pipeline (unified format).

This is the new format that eliminates redundancy by using a single dictionary of features keyed by orf_id, instead of separate parallel lists for orfs, proteins, and three_dis.

Parameters:
schema_version

Version of the output schema (for compatibility tracking)

Type:

str

input_id

ID of the input DNA sequence

Type:

str

input_dna_length

Length of the input DNA sequence

Type:

int

dna_entropy_global

Entropy of the entire input DNA sequence

Type:

float

alphabet_sizes

Dictionary with alphabet sizes for each representation

Type:

Dict[str, int]

features

Dictionary mapping orf_id to UnifiedFeature objects

Type:

Dict[str, genome_entropy.pipeline.types.UnifiedFeature]

schema_version: str
input_id: str
input_dna_length: int
dna_entropy_global: float
alphabet_sizes: Dict[str, int]
features: Dict[str, UnifiedFeature]
__init__(schema_version, input_id, input_dna_length, dna_entropy_global, alphabet_sizes, features)
Parameters:
Return type:

None

class genome_entropy.pipeline.UnifiedFeature(orf_id, location, dna, protein, three_di, metadata, entropy)[source]

Bases: object

Unified representation of a biological feature (ORF and derived data).

This structure consolidates all information about a single ORF into one hierarchical object, eliminating the redundancy present in the old format where ORF data was duplicated in proteins list and protein data was duplicated in three_dis list.

Parameters:
orf_id

Unique identifier for this feature

Type:

str

location

Genomic coordinates

Type:

genome_entropy.pipeline.types.FeatureLocation

dna

DNA sequence information

Type:

genome_entropy.pipeline.types.FeatureDNA

protein

Protein sequence information

Type:

genome_entropy.pipeline.types.FeatureProtein

three_di

3Di structural encoding

Type:

genome_entropy.pipeline.types.FeatureThreeDi

metadata

Additional metadata

Type:

genome_entropy.pipeline.types.FeatureMetadata

entropy

Entropy values at all representation levels

Type:

genome_entropy.pipeline.types.FeatureEntropy

orf_id: str
location: FeatureLocation
dna: FeatureDNA
protein: FeatureProtein
three_di: FeatureThreeDi
metadata: FeatureMetadata
entropy: FeatureEntropy
__init__(orf_id, location, dna, protein, three_di, metadata, entropy)
Parameters:
Return type:

None

class genome_entropy.pipeline.FeatureLocation(start, end, strand, frame)[source]

Bases: object

Genomic location of a feature (ORF).

Parameters:
start

0-based start position (inclusive)

Type:

int

end

0-based end position (exclusive)

Type:

int

strand

Strand orientation (‘+’ or ‘-‘)

Type:

Literal[‘+’, ‘-’]

frame

Reading frame (0, 1, 2, or 3)

Type:

int

start: int
end: int
strand: Literal['+', '-']
frame: int
__init__(start, end, strand, frame)
Parameters:
Return type:

None

class genome_entropy.pipeline.FeatureDNA(nt_sequence, length)[source]

Bases: object

DNA-level information for a feature.

Parameters:
  • nt_sequence (str)

  • length (int)

nt_sequence

Nucleotide sequence

Type:

str

length

Length of nucleotide sequence

Type:

int

nt_sequence: str
length: int
__init__(nt_sequence, length)
Parameters:
  • nt_sequence (str)

  • length (int)

Return type:

None

class genome_entropy.pipeline.FeatureProtein(aa_sequence, length)[source]

Bases: object

Protein-level information for a feature.

Parameters:
  • aa_sequence (str)

  • length (int)

aa_sequence

Amino acid sequence

Type:

str

length

Length of amino acid sequence

Type:

int

aa_sequence: str
length: int
__init__(aa_sequence, length)
Parameters:
  • aa_sequence (str)

  • length (int)

Return type:

None

class genome_entropy.pipeline.FeatureThreeDi(encoding, length, method, model_name, inference_device)[source]

Bases: object

3Di structural encoding for a feature.

Parameters:
  • encoding (str)

  • length (int)

  • method (str)

  • model_name (str)

  • inference_device (str)

encoding

3Di token sequence

Type:

str

length

Length of 3Di sequence

Type:

int

method

Method used for encoding (e.g., “prostt5_aa2fold”)

Type:

str

model_name

Name of the model used

Type:

str

inference_device

Device used for inference (“cuda”, “mps”, or “cpu”)

Type:

str

encoding: str
length: int
method: str
model_name: str
inference_device: str
__init__(encoding, length, method, model_name, inference_device)
Parameters:
  • encoding (str)

  • length (int)

  • method (str)

  • model_name (str)

  • inference_device (str)

Return type:

None

class genome_entropy.pipeline.FeatureMetadata(parent_id, table_id, has_start_codon, has_stop_codon, in_genbank)[source]

Bases: object

Metadata about a feature.

Parameters:
  • parent_id (str)

  • table_id (int)

  • has_start_codon (bool)

  • has_stop_codon (bool)

  • in_genbank (bool)

parent_id

ID of the parent DNA sequence

Type:

str

table_id

NCBI genetic code table ID used

Type:

int

has_start_codon

Whether the ORF has a start codon

Type:

bool

has_stop_codon

Whether the ORF has a stop codon

Type:

bool

in_genbank

Whether this ORF matches a CDS annotated in GenBank

Type:

bool

parent_id: str
table_id: int
has_start_codon: bool
has_stop_codon: bool
in_genbank: bool
__init__(parent_id, table_id, has_start_codon, has_stop_codon, in_genbank)
Parameters:
  • parent_id (str)

  • table_id (int)

  • has_start_codon (bool)

  • has_stop_codon (bool)

  • in_genbank (bool)

Return type:

None

class genome_entropy.pipeline.FeatureEntropy(dna_entropy, protein_entropy, three_di_entropy)[source]

Bases: object

Entropy values at different representation levels for a feature.

Parameters:
dna_entropy

Shannon entropy of nucleotide sequence

Type:

float

protein_entropy

Shannon entropy of amino acid sequence

Type:

float

three_di_entropy

Shannon entropy of 3Di encoding

Type:

float

dna_entropy: float
protein_entropy: float
three_di_entropy: float
__init__(dna_entropy, protein_entropy, three_di_entropy)
Parameters:
Return type:

None

Runner

End-to-end pipeline orchestration for DNA to 3Di with entropy calculation.

class genome_entropy.pipeline.runner.PipelineResult(input_id, input_dna_length, orfs, proteins, three_dis, entropy)[source]

Bases: object

Result of running the complete DNA to 3Di pipeline.

Parameters:
input_id

ID of the input DNA sequence

Type:

str

input_dna_length

Length of the input DNA sequence

Type:

int

orfs

List of ORFs found in the sequence

Type:

List[genome_entropy.orf.types.OrfRecord]

proteins

List of translated proteins

Type:

List[genome_entropy.translate.translator.ProteinRecord]

three_dis

List of 3Di encoded structures

Type:

List[genome_entropy.encode3di.types.ThreeDiRecord]

entropy

Entropy report for all representations

Type:

genome_entropy.entropy.shannon.EntropyReport

input_id: str
input_dna_length: int
orfs: List[OrfRecord]
proteins: List[ProteinRecord]
three_dis: List[ThreeDiRecord]
entropy: EntropyReport
__init__(input_id, input_dna_length, orfs, proteins, three_dis, entropy)
Parameters:
Return type:

None

genome_entropy.pipeline.runner.run_pipeline(input_fasta=None, table_id=11, min_aa_len=30, model_name='gbouras13/modernprost-base', compute_entropy=True, output_json=None, device=None, use_multi_gpu=False, gpu_ids=None, genbank_file=None, encoding_size=None)[source]

Run the complete DNA to 3Di pipeline with entropy calculation.

Pipeline steps: 1. Read FASTA file or GenBank file 2. Find ORFs in all 6 reading frames 3. Translate ORFs to proteins 4. Encode proteins to 3Di structural tokens 5. Calculate entropy at all levels 6. Optionally match ORFs to GenBank CDS annotations 7. Optionally write results to JSON

Parameters:
  • input_fasta (str | Path | None) – Path to input FASTA file. Optional if genbank_file is provided.

  • table_id (int) – NCBI genetic code table ID

  • min_aa_len (int) – Minimum protein length in amino acids

  • model_name (str) – ProstT5 model name

  • compute_entropy (bool) – Whether to compute entropy values

  • output_json (str | Path | None) – Optional path to save results as JSON

  • device (str | None) – Device for 3Di encoding (“cuda”, “mps”, “cpu”, or None for auto) Ignored if use_multi_gpu is True.

  • use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available

  • gpu_ids (List[int] | None) – Optional list of GPU IDs for multi-GPU encoding. If None and use_multi_gpu=True, auto-discover available GPUs.

  • genbank_file (str | Path | None) – Optional path to GenBank file. If provided alone, extracts DNA sequences from it. Can be combined with input_fasta to use FASTA sequences with GenBank CDS annotations.

  • encoding_size (int | None) – Maximum size (approx. amino acids) to encode per batch. If None, uses DEFAULT_ENCODING_SIZE from config.

Returns:

List of PipelineResult objects (one per input sequence)

Raises:
Return type:

List[PipelineResult]

genome_entropy.pipeline.runner.calculate_pipeline_entropy(dna_sequence, orfs, proteins, three_dis)[source]

Calculate entropy at all representation levels.

Parameters:
Returns:

EntropyReport with entropy values

Return type:

EntropyReport

I/O

I/O utilities for genome_entropy.

FASTA I/O

FASTA file reading and writing utilities.

genome_entropy.io.fasta.read_fasta(fasta_path)[source]

Read a FASTA file and return a dictionary of sequence_id -> sequence.

Automatically detects and handles gzipped files (ending in .gz).

Parameters:

fasta_path (str | Path) – Path to FASTA file (plain text or gzipped)

Returns:

Dictionary mapping sequence IDs to sequences

Raises:
Return type:

Dict[str, str]

genome_entropy.io.fasta.read_fasta_iter(fasta_path)[source]

Read a FASTA file and yield (sequence_id, sequence) tuples.

Memory-efficient iterator for large FASTA files. Automatically detects and handles gzipped files (ending in .gz).

Parameters:

fasta_path (str | Path) – Path to FASTA file (plain text or gzipped)

Yields:

Tuples of (sequence_id, sequence)

Raises:
Return type:

Iterator[Tuple[str, str]]

genome_entropy.io.fasta.write_fasta(sequences, output_path, line_width=80)[source]

Write sequences to a FASTA file.

Automatically compresses output if filename ends with .gz.

Parameters:
  • sequences (Dict[str, str]) – Dictionary mapping sequence IDs to sequences

  • output_path (str | Path) – Path to output FASTA file (plain text or .gz for compressed)

  • line_width (int) – Maximum line width for sequence lines (default: 80)

Return type:

None

JSON I/O

JSON serialization for data models.

genome_entropy.io.jsonio.to_json_dict(obj)[source]

Convert a dataclass object to a JSON-serializable dictionary.

Recursively handles nested dataclasses, lists, and dictionaries.

Parameters:

obj (Any) – Object to convert (typically a dataclass instance)

Returns:

JSON-serializable dictionary

Return type:

Any

genome_entropy.io.jsonio.convert_pipeline_result_to_unified(pipeline_result)[source]

Convert PipelineResult to UnifiedPipelineResult format.

This function transforms the old redundant format (separate orfs, proteins, three_dis lists) into the new unified format where each feature appears exactly once with all its related data organized hierarchically.

OLD FORMAT PROBLEM:

The old format had three parallel lists: - orfs: [ORF1, ORF2, …] - proteins: [{orf: ORF1, aa_seq: …}, {orf: ORF2, aa_seq: …}, …] - three_dis: [{protein: {orf: ORF1, …}, 3di: …}, …]

This caused: 1. ORF data duplicated 3 times (in orfs, inside proteins, inside three_dis) 2. Protein data duplicated 2 times (in proteins, inside three_dis) 3. ~2-3x larger files due to redundancy 4. Risk of inconsistency if data differs between copies

NEW UNIFIED FORMAT:

Single features dictionary with hierarchical organization: - features: {

“orf_1”: {

location: {start, end, strand, frame}, dna: {sequence, length}, protein: {sequence, length}, three_di: {encoding, length, method, model, device}, metadata: {parent_id, table_id, has_start, has_stop, in_genbank}, entropy: {dna_entropy, protein_entropy, three_di_entropy}

}

}

Benefits: 1. Each piece of information stored exactly once 2. 40-50% smaller file sizes 3. Direct O(1) access by orf_id 4. Clear hierarchical organization matching biological concepts 5. Single source of truth - no inconsistency possible

param pipeline_result:

PipelineResult object or list of PipelineResult objects

returns:

UnifiedPipelineResult object or list of UnifiedPipelineResult objects

genome_entropy.io.jsonio.write_json(data, output_path, indent=2)[source]

Write data to a JSON file.

Automatically handles dataclass objects by converting them to dictionaries. If data contains PipelineResult objects, they are automatically converted to the new unified format to eliminate redundancy. Automatically compresses output if filename ends with .gz.

AUTOMATIC CONVERSION:

This function transparently converts old-format PipelineResult objects to the new unified format. This means:

  1. Users don’t need to manually call convert_pipeline_result_to_unified()

  2. All JSON output from the pipeline automatically uses the new format

  3. The conversion happens only once during serialization

  4. No changes needed to pipeline code or user scripts

MAPPING: Old Keys → New Structure

OLD FORMAT:
  • orfs[i].orf_id → features[orf_id].orf_id

  • orfs[i].start → features[orf_id].location.start

  • orfs[i].nt_sequence → features[orf_id].dna.nt_sequence

  • proteins[i].aa_sequence → features[orf_id].protein.aa_sequence

  • three_dis[i].three_di → features[orf_id].three_di.encoding

  • entropy.orf_nt_entropy[id] → features[id].entropy.dna_entropy

NEW FORMAT adds:
  • schema_version: “2.0.0” (for compatibility tracking)

  • features: dict (replaces orfs, proteins, three_dis lists)

  • Hierarchical organization (location, dna, protein, three_di, metadata, entropy)

param data:

Data to write (dataclass, dict, list, etc.)

param output_path:

Path to output JSON file (plain text or .gz for compressed)

param indent:

Indentation level for pretty printing (default: 2)

Parameters:
Return type:

None

genome_entropy.io.jsonio.read_json(input_path)[source]

Read JSON data from a file.

Automatically detects and handles gzipped files (ending in .gz).

Parameters:

input_path (str | Path) – Path to input JSON file (plain text or gzipped)

Returns:

Parsed JSON data (dict, list, etc.)

Raises:
Return type:

Any

Configuration

Configuration and constants for genome_entropy.

Errors

Custom exceptions for genome_entropy.

exception genome_entropy.errors.OrfEntropyError[source]

Bases: Exception

Base exception for genome_entropy package.

exception genome_entropy.errors.ConfigurationError[source]

Bases: OrfEntropyError

Raised when there’s a configuration error.

exception genome_entropy.errors.InputError[source]

Bases: OrfEntropyError

Raised when input data is invalid or cannot be processed.

exception genome_entropy.errors.OrfFinderError[source]

Bases: OrfEntropyError

Raised when ORF finding fails.

exception genome_entropy.errors.TranslationError[source]

Bases: OrfEntropyError

Raised when translation fails.

exception genome_entropy.errors.EncodingError[source]

Bases: OrfEntropyError

Raised when 3Di encoding fails.

exception genome_entropy.errors.ModelError[source]

Bases: OrfEntropyError

Raised when model loading or inference fails.

exception genome_entropy.errors.DeviceError[source]

Bases: OrfEntropyError

Raised when device selection or initialization fails.

exception genome_entropy.errors.PipelineError[source]

Bases: OrfEntropyError

Raised when the pipeline orchestration fails.

Logging

Centralized logging configuration for genome_entropy.

This module provides a single source for configuring logging throughout the application. It supports: - Multiple log levels (DEBUG, INFO, WARNING, ERROR, CRITICAL) - Output to file or STDOUT - Consistent format across all modules

genome_entropy.logging_config.configure_logging(level=20, log_file=None, log_format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', date_format='%Y-%m-%d %H:%M:%S', force=False)[source]

Configure logging for the entire application.

This should be called once at application startup (e.g., in CLI main).

Parameters:
  • level (int | str) – Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL) as int or string

  • log_file (str | Path | None) – Optional path to log file. If None, logs to STDOUT

  • log_format (str) – Format string for log messages

  • date_format (str) – Format string for timestamps

  • force (bool) – If True, reconfigure even if already configured

Return type:

None

Examples

>>> configure_logging(level=logging.DEBUG, log_file="app.log")
>>> configure_logging(level="INFO")  # Log to STDOUT
>>> configure_logging(level="DEBUG", log_file=None)  # Debug to STDOUT
genome_entropy.logging_config.get_logger(name)[source]

Get a logger instance for a module.

This is the preferred way to get loggers in the application.

Parameters:

name (str) – Name of the logger (usually __name__ of the module)

Returns:

Configured logger instance

Return type:

Logger

Example

>>> logger = get_logger(__name__)
>>> logger.info("Processing started")
genome_entropy.logging_config.is_configured()[source]

Check if logging has been configured.

Returns:

True if configure_logging() has been called

Return type:

bool

genome_entropy.logging_config.get_log_file()[source]

Get the current log file path.

Returns:

Path to log file, or None if logging to STDOUT

Return type:

Path | None

genome_entropy.logging_config.get_log_level()[source]

Get the current logging level.

Returns:

Current logging level as integer

Return type:

int

genome_entropy.logging_config.set_log_level(level)[source]

Change the logging level at runtime.

Parameters:

level (int | str) – New logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)

Return type:

None

Example

>>> set_log_level("DEBUG")
>>> set_log_level(logging.WARNING)

Usage Examples

ORF Finding

from genome_entropy.orf.finder import find_orfs

# Find ORFs in a FASTA file
orfs = find_orfs(
    fasta_path="genome.fasta",
    table_id=11,
    min_length_nt=90
)

# Examine results
for orf in orfs:
    print(f"ORF {orf.orf_id}: {orf.start}-{orf.end} ({orf.strand})")
    print(f"  Nucleotide: {orf.nt_sequence[:50]}...")
    print(f"  Amino acid: {orf.aa_sequence[:50]}...")

Translation

from genome_entropy.translate.translator import translate_orfs

# Translate ORFs
proteins = translate_orfs(orfs, table_id=11)

for protein in proteins:
    print(f"Protein from {protein.orf.orf_id}: {protein.aa_sequence}")
    print(f"  Length: {protein.aa_length} amino acids")

3Di Encoding

from genome_entropy.encode3di import ProstT5ThreeDiEncoder

# Initialize encoder
encoder = ProstT5ThreeDiEncoder(
    model_name="Rostlab/ProstT5_fp16",
    device="auto"  # Auto-detect CUDA/MPS/CPU
)

# Encode proteins to 3Di
aa_sequences = [p.aa_sequence for p in proteins]
three_di_tokens = encoder.encode(
    aa_sequences,
    batch_size=4,
    encoding_size=5000
)

for i, tokens in enumerate(three_di_tokens):
    print(f"Protein {i}: {tokens[:50]}...")

Token Estimation

from genome_entropy.encode3di import ProstT5ThreeDiEncoder, estimate_token_size

# Initialize encoder
encoder = ProstT5ThreeDiEncoder()

# Find optimal encoding size
results = estimate_token_size(
    encoder=encoder,
    start_length=3000,
    end_length=10000,
    step=1000,
    num_trials=3
)

print(f"Recommended encoding size: {results['recommended_token_size']} AA")

# Use in encoding
three_di = encoder.encode(
    sequences,
    encoding_size=results['recommended_token_size']
)

Shannon Entropy

from genome_entropy.entropy.shannon import shannon_entropy, calculate_sequence_entropy

# Calculate basic entropy
dna = "ATCGATCGATCG"
entropy = shannon_entropy(dna)
print(f"DNA entropy: {entropy:.2f} bits")

# Normalized entropy
normalized = shannon_entropy(
    dna,
    alphabet=set("ACGT"),
    normalize=True
)
print(f"Normalized: {normalized:.2f}")

# For biological sequences
protein = "MKKYTLFLGLLGLVAAGTLWGLSACCA"
protein_entropy = calculate_sequence_entropy(protein)
print(f"Protein entropy: {protein_entropy:.2f} bits")

Complete Pipeline

from pathlib import Path
from genome_entropy.pipeline.runner import run_pipeline

# Run complete pipeline
results = run_pipeline(
    input_fasta=Path("genome.fasta"),
    output_json=Path("results.json"),
    table_id=11,
    min_aa_len=30,
    model_name="Rostlab/ProstT5_fp16",
    device="auto",
    compute_entropy=True
)

# Process results
for result in results:
    print(f"Sequence: {result.input_id}")
    print(f"  DNA length: {result.input_dna_length}")
    print(f"  ORFs found: {len(result.orfs)}")
    print(f"  DNA entropy: {result.entropy.dna_entropy_global:.2f}")

    for orf_id, entropy in result.entropy.protein_aa_entropy.items():
        print(f"  Protein {orf_id} entropy: {entropy:.2f}")

I/O Operations

from genome_entropy.io.fasta import read_fasta, write_fasta
from genome_entropy.io.jsonio import save_json, load_json

# Read FASTA
sequences = read_fasta("genome.fasta")
for seq_id, seq in sequences:
    print(f"{seq_id}: {len(seq)} bp")

# Write FASTA
output_sequences = [
    ("seq1", "ATCGATCG"),
    ("seq2", "GCTAGCTA")
]
write_fasta("output.fasta", output_sequences)

# Save/load JSON
data = {"key": "value", "results": [1, 2, 3]}
save_json(data, "output.json")
loaded = load_json("output.json")

Error Handling

from genome_entropy.errors import (
    OrfEntropyError,
    OrfFinderError,
    TranslationError,
    EncodingError
)

try:
    orfs = find_orfs("genome.fasta", table_id=11)
except OrfFinderError as e:
    print(f"ORF finding failed: {e}")
except OrfEntropyError as e:
    print(f"General error: {e}")

Custom Logging

from genome_entropy.logging_config import configure_logging
import logging

# Configure logging
configure_logging(level="DEBUG", log_file="debug.log")

# Get logger for your module
logger = logging.getLogger(__name__)
logger.info("Starting analysis")
logger.debug("Detailed debug information")

Advanced: Custom Batching

from genome_entropy.encode3di.encoder import ProstT5ThreeDiEncoder

encoder = ProstT5ThreeDiEncoder()

# Create batches with token budget
sequences = ["MKKYTLFLG", "ACDEFGHIK", ...]
batches = encoder.token_budget_batches(
    sequences,
    max_total_length=5000
)

# Process each batch
all_results = []
for batch in batches:
    batch_results = encoder._encode_batch(batch)
    all_results.extend(batch_results)

Data Classes

OrfRecord

@dataclass
class OrfRecord:
    parent_id: str          # Source sequence ID
    orf_id: str             # Unique ORF identifier
    start: int              # 0-based, inclusive
    end: int                # 0-based, exclusive
    strand: Literal["+","-"]
    frame: int              # 0, 1, 2
    nt_sequence: str        # Nucleotide sequence
    aa_sequence: str        # Amino acid sequence
    table_id: int           # NCBI translation table
    has_start_codon: bool
    has_stop_codon: bool

ThreeDiRecord

@dataclass
class ThreeDiRecord:
    orf_id: str
    three_di: str           # 3Di token sequence
    method: Literal["prostt5_aa2fold"]
    model_name: str
    inference_device: str   # "cuda", "mps", or "cpu"

EntropyReport

@dataclass
class EntropyReport:
    dna_entropy_global: float
    orf_nt_entropy: dict[str, float]     # orf_id → entropy
    protein_aa_entropy: dict[str, float]
    three_di_entropy: dict[str, float]
    alphabet_sizes: dict[str, int]

Type Hints

All modules use comprehensive type hints for better IDE support and type checking:

from typing import List, Dict, Optional, Tuple
from pathlib import Path

def find_orfs(
    fasta_path: Path | str,
    table_id: int = 11,
    min_length_nt: int = 90
) -> List[OrfRecord]:
    ...

Next Steps