genome_entropy.io.genbank

GenBank file reading and parsing utilities.

Functions

extract_cds_features(genbank_path)

Extract CDS features from a GenBank file.

match_orf_to_genbank_cds(orf_aa_sequence, ...)

Check if an ORF matches any GenBank CDS by C-terminal sequence.

read_genbank(genbank_path)

Read a GenBank file and return a dictionary of sequence_id -> DNA sequence.

Classes

GenBankCDS(parent_id, start, end, strand, ...)

Represents a CDS (Coding Sequence) feature from GenBank.

class genome_entropy.io.genbank.GenBankCDS(parent_id, start, end, strand, protein_sequence)[source]

Represents a CDS (Coding Sequence) feature from GenBank.

Parameters:
  • parent_id (str)

  • start (int)

  • end (int)

  • strand (str)

  • protein_sequence (str)

parent_id

ID of the parent sequence

Type:

str

start

0-based start position (inclusive)

Type:

int

end

0-based end position (exclusive)

Type:

int

strand

Strand orientation (‘+’ or ‘-‘)

Type:

str

protein_sequence

Translated protein sequence

Type:

str

parent_id: str
start: int
end: int
strand: str
protein_sequence: str
__init__(parent_id, start, end, strand, protein_sequence)
Parameters:
  • parent_id (str)

  • start (int)

  • end (int)

  • strand (str)

  • protein_sequence (str)

Return type:

None

genome_entropy.io.genbank.read_genbank(genbank_path)[source]

Read a GenBank file and return a dictionary of sequence_id -> DNA sequence.

Automatically detects and handles gzipped files (ending in .gz).

Parameters:

genbank_path (str | Path) – Path to GenBank file (plain text or gzipped)

Returns:

Dictionary mapping sequence IDs to DNA sequences

Raises:
Return type:

Dict[str, str]

genome_entropy.io.genbank.extract_cds_features(genbank_path)[source]

Extract CDS features from a GenBank file.

Automatically detects and handles gzipped files (ending in .gz).

Parameters:

genbank_path (str | Path) – Path to GenBank file (plain text or gzipped)

Returns:

List of GenBankCDS objects

Raises:
Return type:

List[GenBankCDS]

genome_entropy.io.genbank.match_orf_to_genbank_cds(orf_aa_sequence, genbank_cds_list, min_c_terminal_match=10)[source]

Check if an ORF matches any GenBank CDS by C-terminal sequence.

Matches are determined by comparing the C-terminal (end) sequences of the protein sequences. This accounts for cases where the predicted ORF may not exactly match the annotated CDS start position.

Parameters:
  • orf_aa_sequence (str) – Amino acid sequence of the ORF

  • genbank_cds_list (List[GenBankCDS]) – List of CDS features from GenBank

  • min_c_terminal_match (int) – Minimum length of C-terminal sequence to match (default: 10)

Returns:

True if the ORF C-terminal matches any GenBank CDS, False otherwise

Return type:

bool