genome_entropy.pipeline

Pipeline orchestration.

class genome_entropy.pipeline.PipelineResult(input_id, input_dna_length, orfs, proteins, three_dis, entropy)[source]

Result of running the complete DNA to 3Di pipeline.

Parameters:
input_id

ID of the input DNA sequence

Type:

str

input_dna_length

Length of the input DNA sequence

Type:

int

orfs

List of ORFs found in the sequence

Type:

List[genome_entropy.orf.types.OrfRecord]

proteins

List of translated proteins

Type:

List[genome_entropy.translate.translator.ProteinRecord]

three_dis

List of 3Di encoded structures

Type:

List[genome_entropy.encode3di.types.ThreeDiRecord]

entropy

Entropy report for all representations

Type:

genome_entropy.entropy.shannon.EntropyReport

input_id: str
input_dna_length: int
orfs: List[OrfRecord]
proteins: List[ProteinRecord]
three_dis: List[ThreeDiRecord]
entropy: EntropyReport
__init__(input_id, input_dna_length, orfs, proteins, three_dis, entropy)
Parameters:
Return type:

None

genome_entropy.pipeline.run_pipeline(input_fasta=None, table_id=11, min_aa_len=30, model_name='gbouras13/modernprost-base', compute_entropy=True, output_json=None, device=None, use_multi_gpu=False, gpu_ids=None, genbank_file=None, encoding_size=None)[source]

Run the complete DNA to 3Di pipeline with entropy calculation.

Pipeline steps: 1. Read FASTA file or GenBank file 2. Find ORFs in all 6 reading frames 3. Translate ORFs to proteins 4. Encode proteins to 3Di structural tokens 5. Calculate entropy at all levels 6. Optionally match ORFs to GenBank CDS annotations 7. Optionally write results to JSON

Parameters:
  • input_fasta (str | Path | None) – Path to input FASTA file. Optional if genbank_file is provided.

  • table_id (int) – NCBI genetic code table ID

  • min_aa_len (int) – Minimum protein length in amino acids

  • model_name (str) – ProstT5 model name

  • compute_entropy (bool) – Whether to compute entropy values

  • output_json (str | Path | None) – Optional path to save results as JSON

  • device (str | None) – Device for 3Di encoding (“cuda”, “mps”, “cpu”, or None for auto) Ignored if use_multi_gpu is True.

  • use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available

  • gpu_ids (List[int] | None) – Optional list of GPU IDs for multi-GPU encoding. If None and use_multi_gpu=True, auto-discover available GPUs.

  • genbank_file (str | Path | None) – Optional path to GenBank file. If provided alone, extracts DNA sequences from it. Can be combined with input_fasta to use FASTA sequences with GenBank CDS annotations.

  • encoding_size (int | None) – Maximum size (approx. amino acids) to encode per batch. If None, uses DEFAULT_ENCODING_SIZE from config.

Returns:

List of PipelineResult objects (one per input sequence)

Raises:
Return type:

List[PipelineResult]

genome_entropy.pipeline.calculate_pipeline_entropy(dna_sequence, orfs, proteins, three_dis)[source]

Calculate entropy at all representation levels.

Parameters:
Returns:

EntropyReport with entropy values

Return type:

EntropyReport

class genome_entropy.pipeline.UnifiedPipelineResult(schema_version, input_id, input_dna_length, dna_entropy_global, alphabet_sizes, features)[source]

Result of running the complete DNA to 3Di pipeline (unified format).

This is the new format that eliminates redundancy by using a single dictionary of features keyed by orf_id, instead of separate parallel lists for orfs, proteins, and three_dis.

Parameters:
schema_version

Version of the output schema (for compatibility tracking)

Type:

str

input_id

ID of the input DNA sequence

Type:

str

input_dna_length

Length of the input DNA sequence

Type:

int

dna_entropy_global

Entropy of the entire input DNA sequence

Type:

float

alphabet_sizes

Dictionary with alphabet sizes for each representation

Type:

Dict[str, int]

features

Dictionary mapping orf_id to UnifiedFeature objects

Type:

Dict[str, genome_entropy.pipeline.types.UnifiedFeature]

schema_version: str
input_id: str
input_dna_length: int
dna_entropy_global: float
alphabet_sizes: Dict[str, int]
features: Dict[str, UnifiedFeature]
__init__(schema_version, input_id, input_dna_length, dna_entropy_global, alphabet_sizes, features)
Parameters:
Return type:

None

class genome_entropy.pipeline.UnifiedFeature(orf_id, location, dna, protein, three_di, metadata, entropy)[source]

Unified representation of a biological feature (ORF and derived data).

This structure consolidates all information about a single ORF into one hierarchical object, eliminating the redundancy present in the old format where ORF data was duplicated in proteins list and protein data was duplicated in three_dis list.

Parameters:
orf_id

Unique identifier for this feature

Type:

str

location

Genomic coordinates

Type:

genome_entropy.pipeline.types.FeatureLocation

dna

DNA sequence information

Type:

genome_entropy.pipeline.types.FeatureDNA

protein

Protein sequence information

Type:

genome_entropy.pipeline.types.FeatureProtein

three_di

3Di structural encoding

Type:

genome_entropy.pipeline.types.FeatureThreeDi

metadata

Additional metadata

Type:

genome_entropy.pipeline.types.FeatureMetadata

entropy

Entropy values at all representation levels

Type:

genome_entropy.pipeline.types.FeatureEntropy

orf_id: str
location: FeatureLocation
dna: FeatureDNA
protein: FeatureProtein
three_di: FeatureThreeDi
metadata: FeatureMetadata
entropy: FeatureEntropy
__init__(orf_id, location, dna, protein, three_di, metadata, entropy)
Parameters:
Return type:

None

class genome_entropy.pipeline.FeatureLocation(start, end, strand, frame)[source]

Genomic location of a feature (ORF).

Parameters:
start

0-based start position (inclusive)

Type:

int

end

0-based end position (exclusive)

Type:

int

strand

Strand orientation (‘+’ or ‘-‘)

Type:

Literal[‘+’, ‘-’]

frame

Reading frame (0, 1, 2, or 3)

Type:

int

start: int
end: int
strand: Literal['+', '-']
frame: int
__init__(start, end, strand, frame)
Parameters:
Return type:

None

class genome_entropy.pipeline.FeatureDNA(nt_sequence, length)[source]

DNA-level information for a feature.

Parameters:
  • nt_sequence (str)

  • length (int)

nt_sequence

Nucleotide sequence

Type:

str

length

Length of nucleotide sequence

Type:

int

nt_sequence: str
length: int
__init__(nt_sequence, length)
Parameters:
  • nt_sequence (str)

  • length (int)

Return type:

None

class genome_entropy.pipeline.FeatureProtein(aa_sequence, length)[source]

Protein-level information for a feature.

Parameters:
  • aa_sequence (str)

  • length (int)

aa_sequence

Amino acid sequence

Type:

str

length

Length of amino acid sequence

Type:

int

aa_sequence: str
length: int
__init__(aa_sequence, length)
Parameters:
  • aa_sequence (str)

  • length (int)

Return type:

None

class genome_entropy.pipeline.FeatureThreeDi(encoding, length, method, model_name, inference_device)[source]

3Di structural encoding for a feature.

Parameters:
  • encoding (str)

  • length (int)

  • method (str)

  • model_name (str)

  • inference_device (str)

encoding

3Di token sequence

Type:

str

length

Length of 3Di sequence

Type:

int

method

Method used for encoding (e.g., “prostt5_aa2fold”)

Type:

str

model_name

Name of the model used

Type:

str

inference_device

Device used for inference (“cuda”, “mps”, or “cpu”)

Type:

str

encoding: str
length: int
method: str
model_name: str
inference_device: str
__init__(encoding, length, method, model_name, inference_device)
Parameters:
  • encoding (str)

  • length (int)

  • method (str)

  • model_name (str)

  • inference_device (str)

Return type:

None

class genome_entropy.pipeline.FeatureMetadata(parent_id, table_id, has_start_codon, has_stop_codon, in_genbank)[source]

Metadata about a feature.

Parameters:
  • parent_id (str)

  • table_id (int)

  • has_start_codon (bool)

  • has_stop_codon (bool)

  • in_genbank (bool)

parent_id

ID of the parent DNA sequence

Type:

str

table_id

NCBI genetic code table ID used

Type:

int

has_start_codon

Whether the ORF has a start codon

Type:

bool

has_stop_codon

Whether the ORF has a stop codon

Type:

bool

in_genbank

Whether this ORF matches a CDS annotated in GenBank

Type:

bool

parent_id: str
table_id: int
has_start_codon: bool
has_stop_codon: bool
in_genbank: bool
__init__(parent_id, table_id, has_start_codon, has_stop_codon, in_genbank)
Parameters:
  • parent_id (str)

  • table_id (int)

  • has_start_codon (bool)

  • has_stop_codon (bool)

  • in_genbank (bool)

Return type:

None

class genome_entropy.pipeline.FeatureEntropy(dna_entropy, protein_entropy, three_di_entropy)[source]

Entropy values at different representation levels for a feature.

Parameters:
dna_entropy

Shannon entropy of nucleotide sequence

Type:

float

protein_entropy

Shannon entropy of amino acid sequence

Type:

float

three_di_entropy

Shannon entropy of 3Di encoding

Type:

float

dna_entropy: float
protein_entropy: float
three_di_entropy: float
__init__(dna_entropy, protein_entropy, three_di_entropy)
Parameters:
Return type:

None

Modules

runner

End-to-end pipeline orchestration for DNA to 3Di with entropy calculation.

types

Unified data types for pipeline output format.