genome_entropy.encode3di.modernprost

ModernProst encoder for amino acid to 3Di structural token conversion.

This module implements an encoder for gbouras13/modernprost models, adapted from the phold implementation.

Note: ModernProst models require transformers >= 4.47.0 for ModernBert support. Multi-GPU support uses HuggingFace accelerate library.

Classes

ModernProstThreeDiEncoder(model_name[, ...])

Encoder for converting amino acid sequences to 3Di structural tokens.

class genome_entropy.encode3di.modernprost.ModernProstThreeDiEncoder(model_name, device=None, use_accelerate=False)[source]

Encoder for converting amino acid sequences to 3Di structural tokens.

Uses ModernProst models (gbouras13/modernprost-base or modernprost-profiles) from HuggingFace to predict 3Di tokens directly from protein sequences.

Based on implementation from phold: https://github.com/gbouras13/phold/blob/main/src/phold/features/predict_3Di.py

Parameters:
  • model_name (str)

  • device (str | None)

  • use_accelerate (bool)

__init__(model_name, device=None, use_accelerate=False)[source]

Initialize the ModernProst encoder.

Parameters:
  • model_name (str) – HuggingFace model identifier (gbouras13/modernprost-base or modernprost-profiles)

  • device (str | None) – Device to use (“cuda”, “mps”, “cpu”, or None for auto-detect)

  • use_accelerate (bool) – If True, use HuggingFace accelerate for multi-GPU support

Raises:
  • ModelError – If PyTorch or Transformers are not installed

  • DeviceError – If specified device is not available

token_budget_batches(aa_sequences, token_budget)[source]

Yield batches of sequences (with original indices) under an approximate token budget.

Optimized strategy to address the problem of isolated long sequences:
  1. Keep original indices.

  2. Sort by length to minimize padding within each batch.

  3. For each batch: - Start with long sequences from the end (largest first) - Add long sequences until adding another would exceed budget - Fill remaining budget with short sequences from the beginning

  4. This approach avoids ending up with long proteins that can’t be combined, resulting in better token budget utilization and fewer iterations.

Parameters

aa_sequences : Sequence[str] Unordered amino acid sequences. token_budget : int Maximum approximate “tokens” per batch

Yields

List[IndexedSeq] A batch of (original_index, sequence) records.

Parameters:
Return type:

Iterator[List[IndexedSeq]]

encode(aa_sequences, encoding_size=10000, use_multi_gpu=False, gpu_ids=None, multi_gpu_encoder=None)[source]

Encode amino acid sequences to 3Di tokens.

Parameters:
  • aa_sequences (List[str]) – List of amino acid sequences (upper-case).

  • encoding_size (int) – Maximum size (approx. amino acids) to encode per batch

  • use_multi_gpu (bool) – If True, use accelerate for multi-GPU parallel encoding

  • gpu_ids (List[int] | None) – Optional list of GPU IDs (currently unused with accelerate)

  • multi_gpu_encoder (Any | None) – Optional pre-initialized encoder (for backward compatibility)

Returns:

List of 3Di token sequences (one per input sequence)

Raises:

EncodingError – If encoding fails

Return type:

List[str]

encode_proteins(proteins, encoding_size=10000, use_multi_gpu=False, gpu_ids=None, multi_gpu_encoder=None)[source]

Encode protein records to 3Di records.

Parameters:
  • proteins (List[ProteinRecord]) – List of ProteinRecord objects

  • encoding_size (int) – Maximum size (approx. amino acids) to encode per batch

  • use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available

  • gpu_ids (List[int] | None) – Optional list of GPU IDs to use for multi-GPU encoding

  • multi_gpu_encoder (Any | None) – Optional pre-initialized MultiGPUEncoder instance.

Returns:

List of ThreeDiRecord objects

Return type:

List[ThreeDiRecord]