genome_entropy.pipeline.types

Unified data types for pipeline output format.

This module defines the unified feature structure that eliminates redundancy by consolidating ORF, protein, and 3Di data into a single hierarchical format.

The unified structure addresses the problem where: - The old proteins list duplicated entire ORF objects - The old three_dis list duplicated entire protein objects (which contained ORFs) - Each level repeated sequences, coordinates, and metadata

The new structure stores each piece of biological information exactly once, organized hierarchically by biological concept.

Classes

FeatureDNA(nt_sequence, length)

DNA-level information for a feature.

FeatureEntropy(dna_entropy, protein_entropy, ...)

Entropy values at different representation levels for a feature.

FeatureLocation(start, end, strand, frame)

Genomic location of a feature (ORF).

FeatureMetadata(parent_id, table_id, ...)

Metadata about a feature.

FeatureProtein(aa_sequence, length)

Protein-level information for a feature.

FeatureThreeDi(encoding, length, method, ...)

3Di structural encoding for a feature.

UnifiedFeature(orf_id, location, dna, ...)

Unified representation of a biological feature (ORF and derived data).

UnifiedPipelineResult(schema_version, ...)

Result of running the complete DNA to 3Di pipeline (unified format).

class genome_entropy.pipeline.types.FeatureLocation(start, end, strand, frame)[source]

Genomic location of a feature (ORF).

Parameters:
start

0-based start position (inclusive)

Type:

int

end

0-based end position (exclusive)

Type:

int

strand

Strand orientation (‘+’ or ‘-‘)

Type:

Literal[‘+’, ‘-’]

frame

Reading frame (0, 1, 2, or 3)

Type:

int

start: int
end: int
strand: Literal['+', '-']
frame: int
__init__(start, end, strand, frame)
Parameters:
Return type:

None

class genome_entropy.pipeline.types.FeatureDNA(nt_sequence, length)[source]

DNA-level information for a feature.

Parameters:
  • nt_sequence (str)

  • length (int)

nt_sequence

Nucleotide sequence

Type:

str

length

Length of nucleotide sequence

Type:

int

nt_sequence: str
length: int
__init__(nt_sequence, length)
Parameters:
  • nt_sequence (str)

  • length (int)

Return type:

None

class genome_entropy.pipeline.types.FeatureProtein(aa_sequence, length)[source]

Protein-level information for a feature.

Parameters:
  • aa_sequence (str)

  • length (int)

aa_sequence

Amino acid sequence

Type:

str

length

Length of amino acid sequence

Type:

int

aa_sequence: str
length: int
__init__(aa_sequence, length)
Parameters:
  • aa_sequence (str)

  • length (int)

Return type:

None

class genome_entropy.pipeline.types.FeatureThreeDi(encoding, length, method, model_name, inference_device)[source]

3Di structural encoding for a feature.

Parameters:
  • encoding (str)

  • length (int)

  • method (str)

  • model_name (str)

  • inference_device (str)

encoding

3Di token sequence

Type:

str

length

Length of 3Di sequence

Type:

int

method

Method used for encoding (e.g., “prostt5_aa2fold”)

Type:

str

model_name

Name of the model used

Type:

str

inference_device

Device used for inference (“cuda”, “mps”, or “cpu”)

Type:

str

encoding: str
length: int
method: str
model_name: str
inference_device: str
__init__(encoding, length, method, model_name, inference_device)
Parameters:
  • encoding (str)

  • length (int)

  • method (str)

  • model_name (str)

  • inference_device (str)

Return type:

None

class genome_entropy.pipeline.types.FeatureMetadata(parent_id, table_id, has_start_codon, has_stop_codon, in_genbank)[source]

Metadata about a feature.

Parameters:
  • parent_id (str)

  • table_id (int)

  • has_start_codon (bool)

  • has_stop_codon (bool)

  • in_genbank (bool)

parent_id

ID of the parent DNA sequence

Type:

str

table_id

NCBI genetic code table ID used

Type:

int

has_start_codon

Whether the ORF has a start codon

Type:

bool

has_stop_codon

Whether the ORF has a stop codon

Type:

bool

in_genbank

Whether this ORF matches a CDS annotated in GenBank

Type:

bool

parent_id: str
table_id: int
has_start_codon: bool
has_stop_codon: bool
in_genbank: bool
__init__(parent_id, table_id, has_start_codon, has_stop_codon, in_genbank)
Parameters:
  • parent_id (str)

  • table_id (int)

  • has_start_codon (bool)

  • has_stop_codon (bool)

  • in_genbank (bool)

Return type:

None

class genome_entropy.pipeline.types.FeatureEntropy(dna_entropy, protein_entropy, three_di_entropy)[source]

Entropy values at different representation levels for a feature.

Parameters:
dna_entropy

Shannon entropy of nucleotide sequence

Type:

float

protein_entropy

Shannon entropy of amino acid sequence

Type:

float

three_di_entropy

Shannon entropy of 3Di encoding

Type:

float

dna_entropy: float
protein_entropy: float
three_di_entropy: float
__init__(dna_entropy, protein_entropy, three_di_entropy)
Parameters:
Return type:

None

class genome_entropy.pipeline.types.UnifiedFeature(orf_id, location, dna, protein, three_di, metadata, entropy)[source]

Unified representation of a biological feature (ORF and derived data).

This structure consolidates all information about a single ORF into one hierarchical object, eliminating the redundancy present in the old format where ORF data was duplicated in proteins list and protein data was duplicated in three_dis list.

Parameters:
orf_id

Unique identifier for this feature

Type:

str

location

Genomic coordinates

Type:

genome_entropy.pipeline.types.FeatureLocation

dna

DNA sequence information

Type:

genome_entropy.pipeline.types.FeatureDNA

protein

Protein sequence information

Type:

genome_entropy.pipeline.types.FeatureProtein

three_di

3Di structural encoding

Type:

genome_entropy.pipeline.types.FeatureThreeDi

metadata

Additional metadata

Type:

genome_entropy.pipeline.types.FeatureMetadata

entropy

Entropy values at all representation levels

Type:

genome_entropy.pipeline.types.FeatureEntropy

orf_id: str
location: FeatureLocation
dna: FeatureDNA
protein: FeatureProtein
three_di: FeatureThreeDi
metadata: FeatureMetadata
entropy: FeatureEntropy
__init__(orf_id, location, dna, protein, three_di, metadata, entropy)
Parameters:
Return type:

None

class genome_entropy.pipeline.types.UnifiedPipelineResult(schema_version, input_id, input_dna_length, dna_entropy_global, alphabet_sizes, features)[source]

Result of running the complete DNA to 3Di pipeline (unified format).

This is the new format that eliminates redundancy by using a single dictionary of features keyed by orf_id, instead of separate parallel lists for orfs, proteins, and three_dis.

Parameters:
schema_version

Version of the output schema (for compatibility tracking)

Type:

str

input_id

ID of the input DNA sequence

Type:

str

input_dna_length

Length of the input DNA sequence

Type:

int

dna_entropy_global

Entropy of the entire input DNA sequence

Type:

float

alphabet_sizes

Dictionary with alphabet sizes for each representation

Type:

Dict[str, int]

features

Dictionary mapping orf_id to UnifiedFeature objects

Type:

Dict[str, genome_entropy.pipeline.types.UnifiedFeature]

schema_version: str
input_id: str
input_dna_length: int
dna_entropy_global: float
alphabet_sizes: Dict[str, int]
features: Dict[str, UnifiedFeature]
__init__(schema_version, input_id, input_dna_length, dna_entropy_global, alphabet_sizes, features)
Parameters:
Return type:

None