genome_entropy.encode3di.encoding

Core encoding functions for amino acid to 3Di conversion.

Functions

encode(aa_sequences, encode_batch_fn, ...)

Encode amino acid sequences to 3Di tokens.

format_seconds(seconds)

Format seconds as H:MM:SS (or M:SS for < 1 hour).

get_memory_info()

Get current CUDA memory allocation and reservation in GB.

preprocess_sequences(aa_sequences)

Preprocess amino acid sequences for ProstT5 encoding.

process_batches(batches_iter, ...)

Process batches of sequences and return results in original order.

genome_entropy.encode3di.encoding.preprocess_sequences(aa_sequences)[source]

Preprocess amino acid sequences for ProstT5 encoding.

Parameters:

aa_sequences (List[str]) – List of raw amino acid sequences

Returns:

List of preprocessed sequences ready for ProstT5 model

Return type:

List[str]

genome_entropy.encode3di.encoding.format_seconds(seconds)[source]

Format seconds as H:MM:SS (or M:SS for < 1 hour).

Parameters:

seconds (float)

Return type:

str

genome_entropy.encode3di.encoding.get_memory_info()[source]

Get current CUDA memory allocation and reservation in GB.

Returns:

Tuple of (allocated_gb, reserved_gb). Returns (0, 0) if CUDA not available.

Return type:

Tuple[float, float]

genome_entropy.encode3di.encoding.process_batches(batches_iter, encode_batch_fn, total_sequences, total_batches)[source]

Process batches of sequences and return results in original order.

Parameters:
  • batches_iter (Iterator[Any]) – Iterator yielding batches of IndexedSeq objects

  • encode_batch_fn (Callable[[List[str]], List[str]]) – Function to encode a batch of sequences

  • total_sequences (int) – Total number of sequences being processed

  • total_batches (int) – Total number of batches to process

Returns:

List of encoded 3Di sequences in original input order

Raises:
Return type:

List[str]

genome_entropy.encode3di.encoding.encode(aa_sequences, encode_batch_fn, token_budget_batches_fn, encoding_size)[source]

Encode amino acid sequences to 3Di tokens.

This is a standalone encoding function that orchestrates the encoding pipeline.

Parameters:
  • aa_sequences (List[str]) – List of amino acid sequences (uppercase, standard 20 AAs)

  • encode_batch_fn (Callable[[List[str]], List[str]]) – Function that encodes a batch of preprocessed sequences

  • token_budget_batches_fn (Callable[[List[str], int], Iterator[Any]]) – Function that batches sequences under token budget

  • encoding_size (int) – Maximum size (approx. amino acids) to encode per batch

Returns:

List of 3Di token sequences (one per input sequence)

Raises:

EncodingError – If encoding fails

Return type:

List[str]