genome_entropy.pipeline
Pipeline orchestration.
- class genome_entropy.pipeline.PipelineResult(input_id, input_dna_length, orfs, proteins, three_dis, entropy)[source]
Result of running the complete DNA to 3Di pipeline.
- Parameters:
input_id (str)
input_dna_length (int)
proteins (List[ProteinRecord])
three_dis (List[ThreeDiRecord])
entropy (EntropyReport)
- orfs
List of ORFs found in the sequence
- Type:
- proteins
List of translated proteins
- three_dis
List of 3Di encoded structures
- Type:
- entropy
Entropy report for all representations
- proteins: List[ProteinRecord]
- three_dis: List[ThreeDiRecord]
- entropy: EntropyReport
- __init__(input_id, input_dna_length, orfs, proteins, three_dis, entropy)
- Parameters:
input_id (str)
input_dna_length (int)
proteins (List[ProteinRecord])
three_dis (List[ThreeDiRecord])
entropy (EntropyReport)
- Return type:
None
- genome_entropy.pipeline.run_pipeline(input_fasta=None, table_id=11, min_aa_len=30, model_name='gbouras13/modernprost-base', compute_entropy=True, output_json=None, device=None, use_multi_gpu=False, gpu_ids=None, genbank_file=None, encoding_size=None)[source]
Run the complete DNA to 3Di pipeline with entropy calculation.
Pipeline steps: 1. Read FASTA file or GenBank file 2. Find ORFs in all 6 reading frames 3. Translate ORFs to proteins 4. Encode proteins to 3Di structural tokens 5. Calculate entropy at all levels 6. Optionally match ORFs to GenBank CDS annotations 7. Optionally write results to JSON
- Parameters:
input_fasta (str | Path | None) – Path to input FASTA file. Optional if genbank_file is provided.
table_id (int) – NCBI genetic code table ID
min_aa_len (int) – Minimum protein length in amino acids
model_name (str) – ProstT5 model name
compute_entropy (bool) – Whether to compute entropy values
output_json (str | Path | None) – Optional path to save results as JSON
device (str | None) – Device for 3Di encoding (“cuda”, “mps”, “cpu”, or None for auto) Ignored if use_multi_gpu is True.
use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available
gpu_ids (List[int] | None) – Optional list of GPU IDs for multi-GPU encoding. If None and use_multi_gpu=True, auto-discover available GPUs.
genbank_file (str | Path | None) – Optional path to GenBank file. If provided alone, extracts DNA sequences from it. Can be combined with input_fasta to use FASTA sequences with GenBank CDS annotations.
encoding_size (int | None) – Maximum size (approx. amino acids) to encode per batch. If None, uses DEFAULT_ENCODING_SIZE from config.
- Returns:
List of PipelineResult objects (one per input sequence)
- Raises:
PipelineError – If any pipeline step fails
ValueError – If neither input_fasta nor genbank_file is provided
- Return type:
- genome_entropy.pipeline.calculate_pipeline_entropy(dna_sequence, orfs, proteins, three_dis)[source]
Calculate entropy at all representation levels.
- Parameters:
dna_sequence (str) – Original DNA sequence
proteins (List[ProteinRecord]) – List of protein records
three_dis (List[ThreeDiRecord]) – List of 3Di records
- Returns:
EntropyReport with entropy values
- Return type:
- class genome_entropy.pipeline.UnifiedPipelineResult(schema_version, input_id, input_dna_length, dna_entropy_global, alphabet_sizes, features)[source]
Result of running the complete DNA to 3Di pipeline (unified format).
This is the new format that eliminates redundancy by using a single dictionary of features keyed by orf_id, instead of separate parallel lists for orfs, proteins, and three_dis.
- Parameters:
- features
Dictionary mapping orf_id to UnifiedFeature objects
- Type:
- features: Dict[str, UnifiedFeature]
- class genome_entropy.pipeline.UnifiedFeature(orf_id, location, dna, protein, three_di, metadata, entropy)[source]
Unified representation of a biological feature (ORF and derived data).
This structure consolidates all information about a single ORF into one hierarchical object, eliminating the redundancy present in the old format where ORF data was duplicated in proteins list and protein data was duplicated in three_dis list.
- Parameters:
orf_id (str)
location (FeatureLocation)
dna (FeatureDNA)
protein (FeatureProtein)
three_di (FeatureThreeDi)
metadata (FeatureMetadata)
entropy (FeatureEntropy)
- location
Genomic coordinates
- dna
DNA sequence information
- protein
Protein sequence information
- three_di
3Di structural encoding
- metadata
Additional metadata
- entropy
Entropy values at all representation levels
- location: FeatureLocation
- dna: FeatureDNA
- protein: FeatureProtein
- three_di: FeatureThreeDi
- metadata: FeatureMetadata
- entropy: FeatureEntropy
- __init__(orf_id, location, dna, protein, three_di, metadata, entropy)
- Parameters:
orf_id (str)
location (FeatureLocation)
dna (FeatureDNA)
protein (FeatureProtein)
three_di (FeatureThreeDi)
metadata (FeatureMetadata)
entropy (FeatureEntropy)
- Return type:
None
- class genome_entropy.pipeline.FeatureLocation(start, end, strand, frame)[source]
Genomic location of a feature (ORF).
- strand
Strand orientation (‘+’ or ‘-‘)
- Type:
Literal[‘+’, ‘-’]
- class genome_entropy.pipeline.FeatureDNA(nt_sequence, length)[source]
DNA-level information for a feature.
- class genome_entropy.pipeline.FeatureProtein(aa_sequence, length)[source]
Protein-level information for a feature.
- class genome_entropy.pipeline.FeatureThreeDi(encoding, length, method, model_name, inference_device)[source]
3Di structural encoding for a feature.
- class genome_entropy.pipeline.FeatureMetadata(parent_id, table_id, has_start_codon, has_stop_codon, in_genbank)[source]
Metadata about a feature.
- Parameters:
- class genome_entropy.pipeline.FeatureEntropy(dna_entropy, protein_entropy, three_di_entropy)[source]
Entropy values at different representation levels for a feature.
Modules
End-to-end pipeline orchestration for DNA to 3Di with entropy calculation. |
|
Unified data types for pipeline output format. |