genome_entropy.encode3di.token_estimator
Token size estimation for optimal GPU memory usage in 3Di encoding.
Functions
|
Estimate optimal token size for GPU encoding by testing increasing lengths. |
|
Generate multiple shorter proteins that combine to target length. |
|
Generate a random protein sequence of specified length. |
- genome_entropy.encode3di.token_estimator.generate_random_protein(length, seed=None)[source]
Generate a random protein sequence of specified length.
- genome_entropy.encode3di.token_estimator.generate_combined_proteins(target_length, base_length=100, seed=None)[source]
Generate multiple shorter proteins that combine to target length.
- genome_entropy.encode3di.token_estimator.estimate_token_size(encoder, start_length=3000, end_length=10000, step=1000, num_trials=3, base_protein_length=100)[source]
Estimate optimal token size for GPU encoding by testing increasing lengths.
This function generates random protein sequences of increasing total length and attempts to encode them. It catches OutOfMemoryError to find the maximum length that can be encoded on the available GPU.
- Parameters:
encoder (Any) – ProstT5ThreeDiEncoder instance to use for encoding
start_length (int) – Starting total length to test (default: 3000)
end_length (int) – Maximum total length to test (default: 10000)
step (int) – Increment between test lengths (default: 1000)
num_trials (int) – Number of trials per length for robustness (default: 3)
base_protein_length (int) – Approximate length of individual proteins (default: 100)
- Returns:
‘max_length’: Maximum length successfully encoded
’recommended_token_size’: Recommended token budget (90% of max)
’trials_per_length’: Dictionary of successful trials per length
’device’: Device used for testing
- Return type:
Dictionary with estimation results
- Raises:
ValueError – If encoder doesn’t have required attributes or torch not available