genome_entropy.encode3di.prostt5

ProstT5 encoder for amino acid to 3Di structural token conversion.

This module maintains backward compatibility by re-exporting classes and functions that have been moved to separate modules for better organization.

class genome_entropy.encode3di.prostt5.ProstT5ThreeDiEncoder(model_name='gbouras13/modernprost-base', device=None)[source]

Encoder for converting amino acid sequences to 3Di structural tokens.

Uses the ProstT5 model from HuggingFace to predict 3Di tokens directly from protein sequences without requiring 3D structures.

Parameters:
  • model_name (str)

  • device (str | None)

__init__(model_name='gbouras13/modernprost-base', device=None)[source]

Initialize the ProstT5 encoder.

Parameters:
  • model_name (str) – HuggingFace model identifier

  • device (str | None) – Device to use (“cuda”, “mps”, “cpu”, or None for auto-detect)

Raises:
  • ModelError – If PyTorch or Transformers are not installed

  • DeviceError – If specified device is not available

token_budget_batches(aa_sequences, token_budget)[source]

Yield batches of sequences (with original indices) under an approximate token budget.

Optimized strategy to address the problem of isolated long sequences:
  1. Keep original indices.

  2. Sort by length to minimize padding within each batch.

  3. For each batch: - Start with long sequences from the end (largest first) - Add long sequences until adding another would exceed budget - Fill remaining budget with short sequences from the beginning

  4. This approach avoids ending up with long proteins that can’t be combined, resulting in better token budget utilization and fewer iterations.

Parameters

aa_sequences : Sequence[str] Unordered amino acid sequences. token_budget : int Maximum approximate “tokens” per batch

Yields

List[IndexedSeq] A batch of (original_index, sequence) records.

Parameters:
Return type:

Iterator[List[IndexedSeq]]

encode(aa_sequences, encoding_size=10000, use_multi_gpu=False, gpu_ids=None, multi_gpu_encoder=None)[source]

Encode amino acid sequences to 3Di tokens.

Parameters:
  • aa_sequences (List[str]) –

    List of amino acid sequences. note: Amino acid sequences are expected to be upper-case,

    while 3Di-sequences need to be lower-case.

  • encoding_size (int) – Maximum size (approx. amino acids) to encode per gpu

  • use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available

  • gpu_ids (List[int] | None) – Optional list of GPU IDs to use for multi-GPU encoding. If None and use_multi_gpu=True, auto-discover available GPUs.

  • multi_gpu_encoder (Any | None) – Optional pre-initialized MultiGPUEncoder instance. If provided, this encoder will be reused instead of creating a new one. This is important for efficiency when processing multiple sequences.

Returns:

List of 3Di token sequences (one per input sequence)

Raises:

EncodingError – If encoding fails

Return type:

List[str]

encode_proteins(proteins, encoding_size=10000, use_multi_gpu=False, gpu_ids=None, multi_gpu_encoder=None)[source]

Encode protein records to 3Di records.

Parameters:
  • proteins (List[ProteinRecord]) – List of ProteinRecord objects

  • encoding_size (int) – Maximum size (approx. amino acids) to encode per batch

  • use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available

  • gpu_ids (List[int] | None) – Optional list of GPU IDs to use for multi-GPU encoding

  • multi_gpu_encoder (Any | None) – Optional pre-initialized MultiGPUEncoder instance. If provided, this encoder will be reused instead of creating a new one. This is important for efficiency when processing multiple sequences.

Returns:

List of ThreeDiRecord objects

Return type:

List[ThreeDiRecord]

class genome_entropy.encode3di.prostt5.ThreeDiRecord(protein, three_di, method, model_name, inference_device)[source]

Represents a 3Di structural encoding of a protein.

Parameters:
protein

The ProteinRecord that was encoded

Type:

genome_entropy.translate.translator.ProteinRecord

three_di

The 3Di token sequence

Type:

str

method

Method used for encoding (always “prostt5_aa2fold”)

Type:

Literal[‘prostt5_aa2fold’]

model_name

Name of the ProstT5 model used

Type:

str

inference_device

Device used for inference (“cuda”, “mps”, or “cpu”)

Type:

str

protein: ProteinRecord
three_di: str
method: Literal['prostt5_aa2fold']
model_name: str
inference_device: str
__init__(protein, three_di, method, model_name, inference_device)
Parameters:
Return type:

None

class genome_entropy.encode3di.prostt5.IndexedSeq(idx, seq)[source]

A sequence paired with its original position in the input list.

Parameters:
idx: int
seq: str
__init__(idx, seq)
Parameters:
Return type:

None