genome_entropy.encode3di.token_estimator

Token size estimation for optimal GPU memory usage in 3Di encoding.

Functions

`estimate_token_size`(encoder[, start_length, ...])	Estimate optimal token size for GPU encoding by testing increasing lengths.
`generate_combined_proteins`(target_length[, ...])	Generate multiple shorter proteins that combine to target length.
`generate_random_protein`(length[, seed])	Generate a random protein sequence of specified length.

genome_entropy.encode3di.token_estimator.generate_random_protein(length, seed=None)[source]

Generate a random protein sequence of specified length.

Parameters:

length (int) – Length of the protein sequence
seed (int | None) – Random seed for reproducibility (optional)

Returns:

Random protein sequence using the 20 standard amino acids

Return type:

str

genome_entropy.encode3di.token_estimator.generate_combined_proteins(target_length, base_length=100, seed=None)[source]

Generate multiple shorter proteins that combine to target length.

Parameters:

target_length (int) – Total target length across all proteins
base_length (int) – Approximate length of each individual protein
seed (int | None) – Random seed for reproducibility (optional)

Returns:

List of protein sequences that total approximately target_length

Return type:

List[str]

genome_entropy.encode3di.token_estimator.estimate_token_size(encoder, start_length=3000, end_length=10000, step=1000, num_trials=3, base_protein_length=100)[source]

Estimate optimal token size for GPU encoding by testing increasing lengths.

This function generates random protein sequences of increasing total length and attempts to encode them. It catches OutOfMemoryError to find the maximum length that can be encoded on the available GPU.

Parameters:

encoder (Any) – ProstT5ThreeDiEncoder instance to use for encoding
start_length (int) – Starting total length to test (default: 3000)
end_length (int) – Maximum total length to test (default: 10000)
step (int) – Increment between test lengths (default: 1000)
num_trials (int) – Number of trials per length for robustness (default: 3)
base_protein_length (int) – Approximate length of individual proteins (default: 100)

Returns:

‘max_length’: Maximum length successfully encoded
’recommended_token_size’: Recommended token budget (90% of max)
’trials_per_length’: Dictionary of successful trials per length
’device’: Device used for testing

Return type:

Dictionary with estimation results

Raises:

ValueError – If encoder doesn’t have required attributes or torch not available