Token Size Estimation for 3Di Encoding

Overview

The encode3di module has been refactored to improve code organization and add token size estimation functionality. This allows users to find the optimal encoding size for their GPU when converting proteins to 3Di structural tokens.

Module Structure

The encode3di module is now organized into separate files for better clarity:

New Module Organization

src/genome_entropy/encode3di/
├── __init__.py          # Public API exports
├── types.py             # Data types (ThreeDiRecord, IndexedSeq)
├── encoder.py           # ProstT5ThreeDiEncoder class
├── encoding.py          # Core encoding functions
├── token_estimator.py   # Token size estimation utilities
└── prostt5.py          # Backward compatibility exports

Key Components

types.py: Data structures
- ThreeDiRecord: Represents a 3Di structural encoding
- IndexedSeq: Sequence with original position index
encoder.py: Main encoder class
- ProstT5ThreeDiEncoder: Converts amino acids to 3Di tokens
- token_budget_batches(): Batch sequences under token budget
- _encode_batch(): Encode a single batch
encoding.py: Core encoding logic with reduced complexity
- preprocess_sequences(): Prepare sequences for encoding
- process_batches(): Process batches with progress tracking
- format_seconds(): Format time durations
- get_memory_info(): Get GPU memory usage
token_estimator.py: New token size estimation
- generate_random_protein(): Generate random protein sequences
- generate_combined_proteins(): Generate multiple proteins
- estimate_token_size(): Find optimal token budget

Token Size Estimation

Purpose

The token size (encoding size) determines how many amino acids are encoded in each GPU batch. Setting this too high can cause Out of Memory errors, while setting it too low wastes GPU capacity.

The token size estimator helps you find the optimal value for your GPU.

Usage

Via CLI

# Basic usage
genome_entropy estimate-tokens

# Custom range and parameters
genome_entropy estimate-tokens --start 3000 --end 10000 --step 1000 --trials 3

# Specify device
genome_entropy estimate-tokens --device cuda --model Rostlab/ProstT5_fp16

Via Python API

from genome_entropy.encode3di import ProstT5ThreeDiEncoder, estimate_token_size

# Initialize encoder
encoder = ProstT5ThreeDiEncoder()

# Run estimation
results = estimate_token_size(
    encoder=encoder,
    start_length=3000,
    end_length=10000,
    step=1000,
    num_trials=3,
    base_protein_length=100,
)

# Use recommended token size
print(f"Recommended token size: {results['recommended_token_size']} AA")

# Use in encoding
encoder.encode(proteins, encoding_size=results['recommended_token_size'])

How It Works

Generates random proteins: Creates realistic protein sequences of varying lengths
Combines into batches: Uses the same batching logic as actual encoding
Tests encoding: Attempts to encode with increasing total lengths
Catches OOM errors: Detects when GPU memory is exhausted
Recommends size: Returns 90% of maximum for safety margin

Output

The estimator returns a dictionary with:

max_length: Maximum length successfully encoded
recommended_token_size: 90% of max for safety (recommended)
trials_per_length: Number of successful trials per length tested
device: Device used for testing

Backward Compatibility

All existing imports continue to work:

# Old style - still works
from genome_entropy.encode3di.prostt5 import ThreeDiRecord, ProstT5ThreeDiEncoder

# New style - also works
from genome_entropy.encode3di import ThreeDiRecord, ProstT5ThreeDiEncoder

Testing

# Run all tests (excluding integration)
pytest tests/ -k "not integration"

# Run token estimator tests specifically
pytest tests/test_token_estimator.py -v

Examples

See examples/token_estimation_example.py for complete working examples.