genome_entropy.encode3di.token_estimator

Token size estimation for optimal GPU memory usage in 3Di encoding.

Functions

estimate_token_size(encoder[, start_length, ...])

Estimate optimal token size for GPU encoding by testing increasing lengths.

generate_combined_proteins(target_length[, ...])

Generate multiple shorter proteins that combine to target length.

generate_random_protein(length[, seed])

Generate a random protein sequence of specified length.

genome_entropy.encode3di.token_estimator.generate_random_protein(length, seed=None)[source]

Generate a random protein sequence of specified length.

Parameters:
  • length (int) – Length of the protein sequence

  • seed (int | None) – Random seed for reproducibility (optional)

Returns:

Random protein sequence using the 20 standard amino acids

Return type:

str

genome_entropy.encode3di.token_estimator.generate_combined_proteins(target_length, base_length=100, seed=None)[source]

Generate multiple shorter proteins that combine to target length.

Parameters:
  • target_length (int) – Total target length across all proteins

  • base_length (int) – Approximate length of each individual protein

  • seed (int | None) – Random seed for reproducibility (optional)

Returns:

List of protein sequences that total approximately target_length

Return type:

List[str]

genome_entropy.encode3di.token_estimator.estimate_token_size(encoder, start_length=3000, end_length=10000, step=1000, num_trials=3, base_protein_length=100)[source]

Estimate optimal token size for GPU encoding by testing increasing lengths.

This function generates random protein sequences of increasing total length and attempts to encode them. It catches OutOfMemoryError to find the maximum length that can be encoded on the available GPU.

Parameters:
  • encoder (Any) – ProstT5ThreeDiEncoder instance to use for encoding

  • start_length (int) – Starting total length to test (default: 3000)

  • end_length (int) – Maximum total length to test (default: 10000)

  • step (int) – Increment between test lengths (default: 1000)

  • num_trials (int) – Number of trials per length for robustness (default: 3)

  • base_protein_length (int) – Approximate length of individual proteins (default: 100)

Returns:

  • ‘max_length’: Maximum length successfully encoded

  • ’recommended_token_size’: Recommended token budget (90% of max)

  • ’trials_per_length’: Dictionary of successful trials per length

  • ’device’: Device used for testing

Return type:

Dictionary with estimation results

Raises:

ValueError – If encoder doesn’t have required attributes or torch not available