genome_entropy.pipeline.types
Unified data types for pipeline output format.
This module defines the unified feature structure that eliminates redundancy by consolidating ORF, protein, and 3Di data into a single hierarchical format.
The unified structure addresses the problem where: - The old proteins list duplicated entire ORF objects - The old three_dis list duplicated entire protein objects (which contained ORFs) - Each level repeated sequences, coordinates, and metadata
The new structure stores each piece of biological information exactly once, organized hierarchically by biological concept.
Classes
|
DNA-level information for a feature. |
|
Entropy values at different representation levels for a feature. |
|
Genomic location of a feature (ORF). |
|
Metadata about a feature. |
|
Protein-level information for a feature. |
|
3Di structural encoding for a feature. |
|
Unified representation of a biological feature (ORF and derived data). |
|
Result of running the complete DNA to 3Di pipeline (unified format). |
- class genome_entropy.pipeline.types.FeatureLocation(start, end, strand, frame)[source]
Genomic location of a feature (ORF).
- strand
Strand orientation (‘+’ or ‘-‘)
- Type:
Literal[‘+’, ‘-’]
- class genome_entropy.pipeline.types.FeatureDNA(nt_sequence, length)[source]
DNA-level information for a feature.
- class genome_entropy.pipeline.types.FeatureProtein(aa_sequence, length)[source]
Protein-level information for a feature.
- class genome_entropy.pipeline.types.FeatureThreeDi(encoding, length, method, model_name, inference_device)[source]
3Di structural encoding for a feature.
- class genome_entropy.pipeline.types.FeatureMetadata(parent_id, table_id, has_start_codon, has_stop_codon, in_genbank)[source]
Metadata about a feature.
- Parameters:
- class genome_entropy.pipeline.types.FeatureEntropy(dna_entropy, protein_entropy, three_di_entropy)[source]
Entropy values at different representation levels for a feature.
- class genome_entropy.pipeline.types.UnifiedFeature(orf_id, location, dna, protein, three_di, metadata, entropy)[source]
Unified representation of a biological feature (ORF and derived data).
This structure consolidates all information about a single ORF into one hierarchical object, eliminating the redundancy present in the old format where ORF data was duplicated in proteins list and protein data was duplicated in three_dis list.
- Parameters:
orf_id (str)
location (FeatureLocation)
dna (FeatureDNA)
protein (FeatureProtein)
three_di (FeatureThreeDi)
metadata (FeatureMetadata)
entropy (FeatureEntropy)
- location
Genomic coordinates
- dna
DNA sequence information
- protein
Protein sequence information
- three_di
3Di structural encoding
- metadata
Additional metadata
- entropy
Entropy values at all representation levels
- location: FeatureLocation
- dna: FeatureDNA
- protein: FeatureProtein
- three_di: FeatureThreeDi
- metadata: FeatureMetadata
- entropy: FeatureEntropy
- __init__(orf_id, location, dna, protein, three_di, metadata, entropy)
- Parameters:
orf_id (str)
location (FeatureLocation)
dna (FeatureDNA)
protein (FeatureProtein)
three_di (FeatureThreeDi)
metadata (FeatureMetadata)
entropy (FeatureEntropy)
- Return type:
None
- class genome_entropy.pipeline.types.UnifiedPipelineResult(schema_version, input_id, input_dna_length, dna_entropy_global, alphabet_sizes, features)[source]
Result of running the complete DNA to 3Di pipeline (unified format).
This is the new format that eliminates redundancy by using a single dictionary of features keyed by orf_id, instead of separate parallel lists for orfs, proteins, and three_dis.
- Parameters:
- features
Dictionary mapping orf_id to UnifiedFeature objects
- Type:
- features: Dict[str, UnifiedFeature]