genome_entropy.encode3di

3Di encoding utilities.

class genome_entropy.encode3di.ProstT5ThreeDiEncoder(model_name='gbouras13/modernprost-base', device=None)[source]

Encoder for converting amino acid sequences to 3Di structural tokens.

Uses the ProstT5 model from HuggingFace to predict 3Di tokens directly from protein sequences without requiring 3D structures.

Parameters:

model_name (str)
device (str | None)

__init__(model_name='gbouras13/modernprost-base', device=None)[source]

Initialize the ProstT5 encoder.

Parameters:

model_name (str) – HuggingFace model identifier
device (str | None) – Device to use (“cuda”, “mps”, “cpu”, or None for auto-detect)

Raises:

ModelError – If PyTorch or Transformers are not installed
DeviceError – If specified device is not available

token_budget_batches(aa_sequences, token_budget)[source]

Yield batches of sequences (with original indices) under an approximate token budget.

Optimized strategy to address the problem of isolated long sequences:

Keep original indices.
Sort by length to minimize padding within each batch.
For each batch: - Start with long sequences from the end (largest first) - Add long sequences until adding another would exceed budget - Fill remaining budget with short sequences from the beginning
This approach avoids ending up with long proteins that can’t be combined, resulting in better token budget utilization and fewer iterations.

Parameters

aa_sequences : Sequence[str] Unordered amino acid sequences. token_budget : int Maximum approximate “tokens” per batch

Yields

List[IndexedSeq] A batch of (original_index, sequence) records.

Parameters:

aa_sequences (Sequence[str])
token_budget (int)

Return type:

Iterator[List[IndexedSeq]]

encode(aa_sequences, encoding_size=10000, use_multi_gpu=False, gpu_ids=None, multi_gpu_encoder=None)[source]

Encode amino acid sequences to 3Di tokens.

Parameters:

aa_sequences (List[str]) –
List of amino acid sequences. note: Amino acid sequences are expected to be upper-case,

while 3Di-sequences need to be lower-case.
encoding_size (int) – Maximum size (approx. amino acids) to encode per gpu
use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available
gpu_ids (List[int] | None) – Optional list of GPU IDs to use for multi-GPU encoding. If None and use_multi_gpu=True, auto-discover available GPUs.
multi_gpu_encoder (Any | None) – Optional pre-initialized MultiGPUEncoder instance. If provided, this encoder will be reused instead of creating a new one. This is important for efficiency when processing multiple sequences.

Returns:

List of 3Di token sequences (one per input sequence)

Raises:

EncodingError – If encoding fails

Return type:

List[str]

encode_proteins(proteins, encoding_size=10000, use_multi_gpu=False, gpu_ids=None, multi_gpu_encoder=None)[source]

Encode protein records to 3Di records.

Parameters:

proteins (List[ProteinRecord]) – List of ProteinRecord objects
encoding_size (int) – Maximum size (approx. amino acids) to encode per batch
use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available
gpu_ids (List[int] | None) – Optional list of GPU IDs to use for multi-GPU encoding
multi_gpu_encoder (Any | None) – Optional pre-initialized MultiGPUEncoder instance. If provided, this encoder will be reused instead of creating a new one. This is important for efficiency when processing multiple sequences.

Returns:

List of ThreeDiRecord objects

Return type:

List[ThreeDiRecord]

class genome_entropy.encode3di.ModernProstThreeDiEncoder(model_name, device=None, use_accelerate=False)[source]

Encoder for converting amino acid sequences to 3Di structural tokens.

Uses ModernProst models (gbouras13/modernprost-base or modernprost-profiles) from HuggingFace to predict 3Di tokens directly from protein sequences.

Based on implementation from phold: https://github.com/gbouras13/phold/blob/main/src/phold/features/predict_3Di.py

Parameters:

model_name (str)
device (str | None)
use_accelerate (bool)

__init__(model_name, device=None, use_accelerate=False)[source]

Initialize the ModernProst encoder.

Parameters:

model_name (str) – HuggingFace model identifier (gbouras13/modernprost-base or modernprost-profiles)
device (str | None) – Device to use (“cuda”, “mps”, “cpu”, or None for auto-detect)
use_accelerate (bool) – If True, use HuggingFace accelerate for multi-GPU support

Raises:

ModelError – If PyTorch or Transformers are not installed
DeviceError – If specified device is not available

token_budget_batches(aa_sequences, token_budget)[source]

Yield batches of sequences (with original indices) under an approximate token budget.

Optimized strategy to address the problem of isolated long sequences:

Keep original indices.
Sort by length to minimize padding within each batch.
For each batch: - Start with long sequences from the end (largest first) - Add long sequences until adding another would exceed budget - Fill remaining budget with short sequences from the beginning
This approach avoids ending up with long proteins that can’t be combined, resulting in better token budget utilization and fewer iterations.

Parameters

aa_sequences : Sequence[str] Unordered amino acid sequences. token_budget : int Maximum approximate “tokens” per batch

Yields

List[IndexedSeq] A batch of (original_index, sequence) records.

Parameters:

aa_sequences (Sequence[str])
token_budget (int)

Return type:

Iterator[List[IndexedSeq]]

encode(aa_sequences, encoding_size=10000, use_multi_gpu=False, gpu_ids=None, multi_gpu_encoder=None)[source]

Encode amino acid sequences to 3Di tokens.

Parameters:

aa_sequences (List[str]) – List of amino acid sequences (upper-case).
encoding_size (int) – Maximum size (approx. amino acids) to encode per batch
use_multi_gpu (bool) – If True, use accelerate for multi-GPU parallel encoding
gpu_ids (List[int] | None) – Optional list of GPU IDs (currently unused with accelerate)
multi_gpu_encoder (Any | None) – Optional pre-initialized encoder (for backward compatibility)

Returns:

List of 3Di token sequences (one per input sequence)

Raises:

EncodingError – If encoding fails

Return type:

List[str]

encode_proteins(proteins, encoding_size=10000, use_multi_gpu=False, gpu_ids=None, multi_gpu_encoder=None)[source]

Encode protein records to 3Di records.

Parameters:

proteins (List[ProteinRecord]) – List of ProteinRecord objects
encoding_size (int) – Maximum size (approx. amino acids) to encode per batch
use_multi_gpu (bool) – If True, use multi-GPU parallel encoding when available
gpu_ids (List[int] | None) – Optional list of GPU IDs to use for multi-GPU encoding
multi_gpu_encoder (Any | None) – Optional pre-initialized MultiGPUEncoder instance.

Returns:

List of ThreeDiRecord objects

Return type:

List[ThreeDiRecord]

class genome_entropy.encode3di.ThreeDiRecord(protein, three_di, method, model_name, inference_device)[source]

Represents a 3Di structural encoding of a protein.

Parameters:

protein (ProteinRecord)
three_di (str)
method (Literal['prostt5_aa2fold'])
model_name (str)
inference_device (str)

protein

The ProteinRecord that was encoded

Type:: genome_entropy.translate.translator.ProteinRecord

three_di

The 3Di token sequence

Type:: str

method

Method used for encoding (always “prostt5_aa2fold”)

Type:: Literal[‘prostt5_aa2fold’]

model_name

Name of the ProstT5 model used

Type:: str

inference_device

Device used for inference (“cuda”, “mps”, or “cpu”)

Type:: str

protein: ProteinRecord

three_di: str

method: Literal['prostt5_aa2fold']

model_name: str

inference_device: str

__init__(protein, three_di, method, model_name, inference_device)

Parameters:

protein (ProteinRecord)
three_di (str)
method (Literal['prostt5_aa2fold'])
model_name (str)
inference_device (str)

Return type:

None

class genome_entropy.encode3di.IndexedSeq(idx, seq)[source]

A sequence paired with its original position in the input list.

Parameters:

idx (int)
seq (str)

idx: int

seq: str

__init__(idx, seq)

Parameters:

idx (int)
seq (str)

Return type:

None

genome_entropy.encode3di.estimate_token_size(encoder, start_length=3000, end_length=10000, step=1000, num_trials=3, base_protein_length=100)[source]

Estimate optimal token size for GPU encoding by testing increasing lengths.

This function generates random protein sequences of increasing total length and attempts to encode them. It catches OutOfMemoryError to find the maximum length that can be encoded on the available GPU.

Parameters:

encoder (Any) – ProstT5ThreeDiEncoder instance to use for encoding
start_length (int) – Starting total length to test (default: 3000)
end_length (int) – Maximum total length to test (default: 10000)
step (int) – Increment between test lengths (default: 1000)
num_trials (int) – Number of trials per length for robustness (default: 3)
base_protein_length (int) – Approximate length of individual proteins (default: 100)

Returns:

‘max_length’: Maximum length successfully encoded
’recommended_token_size’: Recommended token budget (90% of max)
’trials_per_length’: Dictionary of successful trials per length
’device’: Device used for testing

Return type:

Dictionary with estimation results

Raises:

ValueError – If encoder doesn’t have required attributes or torch not available

genome_entropy.encode3di.generate_random_protein(length, seed=None)[source]

Generate a random protein sequence of specified length.

Parameters:

length (int) – Length of the protein sequence
seed (int | None) – Random seed for reproducibility (optional)

Returns:

Random protein sequence using the 20 standard amino acids

Return type:

str

genome_entropy.encode3di.generate_combined_proteins(target_length, base_length=100, seed=None)[source]

Generate multiple shorter proteins that combine to target length.

Parameters:

target_length (int) – Total target length across all proteins
base_length (int) – Approximate length of each individual protein
seed (int | None) – Random seed for reproducibility (optional)

Returns:

List of protein sequences that total approximately target_length

Return type:

List[str]

Modules

`encoder`	ProstT5-based encoder for amino acid to 3Di structural token conversion.
`encoding`	Core encoding functions for amino acid to 3Di conversion.
`gpu_utils`	GPU discovery and management utilities for multi-GPU encoding.
`modernprost`	ModernProst encoder for amino acid to 3Di structural token conversion.
`multi_gpu`	Multi-GPU asynchronous encoding for protein to 3Di conversion.
`prostt5`	ProstT5 encoder for amino acid to 3Di structural token conversion.
`token_estimator`	Token size estimation for optimal GPU memory usage in 3Di encoding.
`types`	Data types for 3Di encoding.