Token Size Estimation for 3Di Encoding
Overview
The encode3di module has been refactored to improve code organization and add token size estimation functionality. This allows users to find the optimal encoding size for their GPU when converting proteins to 3Di structural tokens.
Module Structure
The encode3di module is now organized into separate files for better clarity:
New Module Organization
src/genome_entropy/encode3di/
├── __init__.py # Public API exports
├── types.py # Data types (ThreeDiRecord, IndexedSeq)
├── encoder.py # ProstT5ThreeDiEncoder class
├── encoding.py # Core encoding functions
├── token_estimator.py # Token size estimation utilities
└── prostt5.py # Backward compatibility exports
Key Components
types.py: Data structures
ThreeDiRecord: Represents a 3Di structural encodingIndexedSeq: Sequence with original position index
encoder.py: Main encoder class
ProstT5ThreeDiEncoder: Converts amino acids to 3Di tokenstoken_budget_batches(): Batch sequences under token budget_encode_batch(): Encode a single batch
encoding.py: Core encoding logic with reduced complexity
preprocess_sequences(): Prepare sequences for encodingprocess_batches(): Process batches with progress trackingformat_seconds(): Format time durationsget_memory_info(): Get GPU memory usage
token_estimator.py: New token size estimation
generate_random_protein(): Generate random protein sequencesgenerate_combined_proteins(): Generate multiple proteinsestimate_token_size(): Find optimal token budget
Token Size Estimation
Purpose
The token size (encoding size) determines how many amino acids are encoded in each GPU batch. Setting this too high can cause Out of Memory errors, while setting it too low wastes GPU capacity.
The token size estimator helps you find the optimal value for your GPU.
Usage
Via CLI
# Basic usage
genome_entropy estimate-tokens
# Custom range and parameters
genome_entropy estimate-tokens --start 3000 --end 10000 --step 1000 --trials 3
# Specify device
genome_entropy estimate-tokens --device cuda --model Rostlab/ProstT5_fp16
Via Python API
from genome_entropy.encode3di import ProstT5ThreeDiEncoder, estimate_token_size
# Initialize encoder
encoder = ProstT5ThreeDiEncoder()
# Run estimation
results = estimate_token_size(
encoder=encoder,
start_length=3000,
end_length=10000,
step=1000,
num_trials=3,
base_protein_length=100,
)
# Use recommended token size
print(f"Recommended token size: {results['recommended_token_size']} AA")
# Use in encoding
encoder.encode(proteins, encoding_size=results['recommended_token_size'])
How It Works
Generates random proteins: Creates realistic protein sequences of varying lengths
Combines into batches: Uses the same batching logic as actual encoding
Tests encoding: Attempts to encode with increasing total lengths
Catches OOM errors: Detects when GPU memory is exhausted
Recommends size: Returns 90% of maximum for safety margin
Output
The estimator returns a dictionary with:
max_length: Maximum length successfully encodedrecommended_token_size: 90% of max for safety (recommended)trials_per_length: Number of successful trials per length testeddevice: Device used for testing
Backward Compatibility
All existing imports continue to work:
# Old style - still works
from genome_entropy.encode3di.prostt5 import ThreeDiRecord, ProstT5ThreeDiEncoder
# New style - also works
from genome_entropy.encode3di import ThreeDiRecord, ProstT5ThreeDiEncoder
Testing
# Run all tests (excluding integration)
pytest tests/ -k "not integration"
# Run token estimator tests specifically
pytest tests/test_token_estimator.py -v
Examples
See examples/token_estimation_example.py for complete working examples.