genome_entropy.pipeline.runner

End-to-end pipeline orchestration for DNA to 3Di with entropy calculation.

Functions

calculate_pipeline_entropy(dna_sequence, ...)

Calculate entropy at all representation levels.

run_pipeline([input_fasta, table_id, ...])

Run the complete DNA to 3Di pipeline with entropy calculation.

Classes

PipelineResult(input_id, input_dna_length, ...)

Result of running the complete DNA to 3Di pipeline.

class genome_entropy.pipeline.runner.PipelineResult(input_id, input_dna_length, orfs, proteins, three_dis, entropy)[source]

Result of running the complete DNA to 3Di pipeline.

Parameters:
input_id

ID of the input DNA sequence

Type:

str

input_dna_length

Length of the input DNA sequence

Type:

int

orfs

List of ORFs found in the sequence

Type:

List[genome_entropy.orf.types.OrfRecord]

proteins

List of translated proteins

Type:

List[genome_entropy.translate.translator.ProteinRecord]

three_dis

List of 3Di encoded structures

Type:

List[genome_entropy.encode3di.types.ThreeDiRecord]

entropy

Entropy report for all representations

Type:

genome_entropy.entropy.shannon.EntropyReport

input_id: str
input_dna_length: int
orfs: List[OrfRecord]
proteins: List[ProteinRecord]
three_dis: List[ThreeDiRecord]
entropy: EntropyReport
__init__(input_id, input_dna_length, orfs, proteins, three_dis, entropy)
Parameters:
Return type:

None

genome_entropy.pipeline.runner.run_pipeline(input_fasta=None, table_id=11, min_aa_len=30, model_name='gbouras13/modernprost-base', compute_entropy=True, output_json=None, device=None, use_multi_gpu=False, gpu_ids=None, genbank_file=None, encoding_size=None)[source]

Run the complete DNA to 3Di pipeline with entropy calculation.

Pipeline steps: 1. Read FASTA file or GenBank file 2. Find ORFs in all 6 reading frames 3. Translate ORFs to proteins 4. Encode proteins to 3Di structural tokens 5. Calculate entropy at all levels 6. Optionally match ORFs to GenBank CDS annotations 7. Optionally write results to JSON

Parameters:
  • input_fasta (str | Path | None) – Path to input FASTA file. Optional if genbank_file is provided.

  • table_id (int) – NCBI genetic code table ID

  • min_aa_len (int) – Minimum protein length in amino acids

  • model_name (str) – ProstT5 model name

  • compute_entropy (bool) – Whether to compute entropy values

  • output_json (str | Path | None) – Optional path to save results as JSON

  • device (str | None) – Device for 3Di encoding (“cuda”, “mps”, “cpu”, or None for auto) Ignored if use_multi_gpu is True.

  • use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available

  • gpu_ids (List[int] | None) – Optional list of GPU IDs for multi-GPU encoding. If None and use_multi_gpu=True, auto-discover available GPUs.

  • genbank_file (str | Path | None) – Optional path to GenBank file. If provided alone, extracts DNA sequences from it. Can be combined with input_fasta to use FASTA sequences with GenBank CDS annotations.

  • encoding_size (int | None) – Maximum size (approx. amino acids) to encode per batch. If None, uses DEFAULT_ENCODING_SIZE from config.

Returns:

List of PipelineResult objects (one per input sequence)

Raises:
Return type:

List[PipelineResult]

genome_entropy.pipeline.runner.calculate_pipeline_entropy(dna_sequence, orfs, proteins, three_dis)[source]

Calculate entropy at all representation levels.

Parameters:
Returns:

EntropyReport with entropy values

Return type:

EntropyReport