genome_entropy.pipeline.types

Unified data types for pipeline output format.

This module defines the unified feature structure that eliminates redundancy by consolidating ORF, protein, and 3Di data into a single hierarchical format.

The unified structure addresses the problem where: - The old proteins list duplicated entire ORF objects - The old three_dis list duplicated entire protein objects (which contained ORFs) - Each level repeated sequences, coordinates, and metadata

The new structure stores each piece of biological information exactly once, organized hierarchically by biological concept.

Classes

`FeatureDNA`(nt_sequence, length)	DNA-level information for a feature.
`FeatureEntropy`(dna_entropy, protein_entropy, ...)	Entropy values at different representation levels for a feature.
`FeatureLocation`(start, end, strand, frame)	Genomic location of a feature (ORF).
`FeatureMetadata`(parent_id, table_id, ...)	Metadata about a feature.
`FeatureProtein`(aa_sequence, length)	Protein-level information for a feature.
`FeatureThreeDi`(encoding, length, method, ...)	3Di structural encoding for a feature.
`UnifiedFeature`(orf_id, location, dna, ...)	Unified representation of a biological feature (ORF and derived data).
`UnifiedPipelineResult`(schema_version, ...)	Result of running the complete DNA to 3Di pipeline (unified format).

class genome_entropy.pipeline.types.FeatureLocation(start, end, strand, frame)[source]

Genomic location of a feature (ORF).

Parameters:

start (int)
end (int)
strand (Literal['+', '-'])
frame (int)

start

0-based start position (inclusive)

Type:: int

end

0-based end position (exclusive)

Type:: int

strand

Strand orientation (‘+’ or ‘-‘)

Type:: Literal[‘+’, ‘-’]

frame

Reading frame (0, 1, 2, or 3)

Type:: int

start: int

end: int

strand: Literal['+', '-']

frame: int

__init__(start, end, strand, frame)

Parameters:

start (int)
end (int)
strand (Literal['+', '-'])
frame (int)

Return type:

None

class genome_entropy.pipeline.types.FeatureDNA(nt_sequence, length)[source]

DNA-level information for a feature.

Parameters:

nt_sequence (str)
length (int)

nt_sequence

Nucleotide sequence

Type:: str

length

Length of nucleotide sequence

Type:: int

nt_sequence: str

length: int

__init__(nt_sequence, length)

Parameters:

nt_sequence (str)
length (int)

Return type:

None

class genome_entropy.pipeline.types.FeatureProtein(aa_sequence, length)[source]

Protein-level information for a feature.

Parameters:

aa_sequence (str)
length (int)

aa_sequence

Amino acid sequence

Type:: str

length

Length of amino acid sequence

Type:: int

aa_sequence: str

length: int

__init__(aa_sequence, length)

Parameters:

aa_sequence (str)
length (int)

Return type:

None

class genome_entropy.pipeline.types.FeatureThreeDi(encoding, length, method, model_name, inference_device)[source]

3Di structural encoding for a feature.

Parameters:

encoding (str)
length (int)
method (str)
model_name (str)
inference_device (str)

encoding

3Di token sequence

Type:: str

length

Length of 3Di sequence

Type:: int

method

Method used for encoding (e.g., “prostt5_aa2fold”)

Type:: str

model_name

Name of the model used

Type:: str

inference_device

Device used for inference (“cuda”, “mps”, or “cpu”)

Type:: str

encoding: str

length: int

method: str

model_name: str

inference_device: str

__init__(encoding, length, method, model_name, inference_device)

Parameters:

encoding (str)
length (int)
method (str)
model_name (str)
inference_device (str)

Return type:

None

class genome_entropy.pipeline.types.FeatureMetadata(parent_id, table_id, has_start_codon, has_stop_codon, in_genbank)[source]

Metadata about a feature.

Parameters:

parent_id (str)
table_id (int)
has_start_codon (bool)
has_stop_codon (bool)
in_genbank (bool)

parent_id

ID of the parent DNA sequence

Type:: str

table_id

NCBI genetic code table ID used

Type:: int

has_start_codon

Whether the ORF has a start codon

Type:: bool

has_stop_codon

Whether the ORF has a stop codon

Type:: bool

in_genbank

Whether this ORF matches a CDS annotated in GenBank

Type:: bool

parent_id: str

table_id: int

has_start_codon: bool

has_stop_codon: bool

in_genbank: bool

__init__(parent_id, table_id, has_start_codon, has_stop_codon, in_genbank)

Parameters:

parent_id (str)
table_id (int)
has_start_codon (bool)
has_stop_codon (bool)
in_genbank (bool)

Return type:

None

class genome_entropy.pipeline.types.FeatureEntropy(dna_entropy, protein_entropy, three_di_entropy)[source]

Entropy values at different representation levels for a feature.

Parameters:

dna_entropy (float)
protein_entropy (float)
three_di_entropy (float)

dna_entropy

Shannon entropy of nucleotide sequence

Type:: float

protein_entropy

Shannon entropy of amino acid sequence

Type:: float

three_di_entropy

Shannon entropy of 3Di encoding

Type:: float

dna_entropy: float

protein_entropy: float

three_di_entropy: float

__init__(dna_entropy, protein_entropy, three_di_entropy)

Parameters:

dna_entropy (float)
protein_entropy (float)
three_di_entropy (float)

Return type:

None

class genome_entropy.pipeline.types.UnifiedFeature(orf_id, location, dna, protein, three_di, metadata, entropy)[source]

Unified representation of a biological feature (ORF and derived data).

This structure consolidates all information about a single ORF into one hierarchical object, eliminating the redundancy present in the old format where ORF data was duplicated in proteins list and protein data was duplicated in three_dis list.

Parameters:

orf_id (str)
location (FeatureLocation)
dna (FeatureDNA)
protein (FeatureProtein)
three_di (FeatureThreeDi)
metadata (FeatureMetadata)
entropy (FeatureEntropy)

orf_id

Unique identifier for this feature

Type:: str

location

Genomic coordinates

Type:: genome_entropy.pipeline.types.FeatureLocation

dna

DNA sequence information

Type:: genome_entropy.pipeline.types.FeatureDNA

protein

Protein sequence information

Type:: genome_entropy.pipeline.types.FeatureProtein

three_di

3Di structural encoding

Type:: genome_entropy.pipeline.types.FeatureThreeDi

metadata

Additional metadata

Type:: genome_entropy.pipeline.types.FeatureMetadata

entropy

Entropy values at all representation levels

Type:: genome_entropy.pipeline.types.FeatureEntropy

orf_id: str

location: FeatureLocation

dna: FeatureDNA

protein: FeatureProtein

three_di: FeatureThreeDi

metadata: FeatureMetadata

entropy: FeatureEntropy

__init__(orf_id, location, dna, protein, three_di, metadata, entropy)

Parameters:

orf_id (str)
location (FeatureLocation)
dna (FeatureDNA)
protein (FeatureProtein)
three_di (FeatureThreeDi)
metadata (FeatureMetadata)
entropy (FeatureEntropy)

Return type:

None

class genome_entropy.pipeline.types.UnifiedPipelineResult(schema_version, input_id, input_dna_length, dna_entropy_global, alphabet_sizes, features)[source]

Result of running the complete DNA to 3Di pipeline (unified format).

This is the new format that eliminates redundancy by using a single dictionary of features keyed by orf_id, instead of separate parallel lists for orfs, proteins, and three_dis.

Parameters:

schema_version (str)
input_id (str)
input_dna_length (int)
dna_entropy_global (float)
alphabet_sizes (Dict[str, int])
features (Dict[str, UnifiedFeature])

schema_version

Version of the output schema (for compatibility tracking)

Type:: str

input_id

ID of the input DNA sequence

Type:: str

input_dna_length

Length of the input DNA sequence

Type:: int

dna_entropy_global

Entropy of the entire input DNA sequence

Type:: float

alphabet_sizes

Dictionary with alphabet sizes for each representation

Type:: Dict[str, int]

features

Dictionary mapping orf_id to UnifiedFeature objects

Type:: Dict[str, genome_entropy.pipeline.types.UnifiedFeature]

schema_version: str

input_id: str

input_dna_length: int

dna_entropy_global: float

alphabet_sizes: Dict[str, int]

features: Dict[str, UnifiedFeature]

__init__(schema_version, input_id, input_dna_length, dna_entropy_global, alphabet_sizes, features)

Parameters:

schema_version (str)
input_id (str)
input_dna_length (int)
dna_entropy_global (float)
alphabet_sizes (Dict[str, int])
features (Dict[str, UnifiedFeature])

Return type:

None