genome_entropy.io.genbank
GenBank file reading and parsing utilities.
Functions
|
Extract CDS features from a GenBank file. |
|
Check if an ORF matches any GenBank CDS by C-terminal sequence. |
|
Read a GenBank file and return a dictionary of sequence_id -> DNA sequence. |
Classes
|
Represents a CDS (Coding Sequence) feature from GenBank. |
- class genome_entropy.io.genbank.GenBankCDS(parent_id, start, end, strand, protein_sequence)[source]
Represents a CDS (Coding Sequence) feature from GenBank.
- genome_entropy.io.genbank.read_genbank(genbank_path)[source]
Read a GenBank file and return a dictionary of sequence_id -> DNA sequence.
Automatically detects and handles gzipped files (ending in .gz).
- Parameters:
genbank_path (str | Path) – Path to GenBank file (plain text or gzipped)
- Returns:
Dictionary mapping sequence IDs to DNA sequences
- Raises:
FileNotFoundError – If the GenBank file doesn’t exist
ValueError – If the GenBank file is malformed
- Return type:
- genome_entropy.io.genbank.extract_cds_features(genbank_path)[source]
Extract CDS features from a GenBank file.
Automatically detects and handles gzipped files (ending in .gz).
- Parameters:
genbank_path (str | Path) – Path to GenBank file (plain text or gzipped)
- Returns:
List of GenBankCDS objects
- Raises:
FileNotFoundError – If the GenBank file doesn’t exist
ValueError – If the GenBank file is malformed
- Return type:
- genome_entropy.io.genbank.match_orf_to_genbank_cds(orf_aa_sequence, genbank_cds_list, min_c_terminal_match=10)[source]
Check if an ORF matches any GenBank CDS by C-terminal sequence.
Matches are determined by comparing the C-terminal (end) sequences of the protein sequences. This accounts for cases where the predicted ORF may not exactly match the annotated CDS start position.
- Parameters:
orf_aa_sequence (str) – Amino acid sequence of the ORF
genbank_cds_list (List[GenBankCDS]) – List of CDS features from GenBank
min_c_terminal_match (int) – Minimum length of C-terminal sequence to match (default: 10)
- Returns:
True if the ORF C-terminal matches any GenBank CDS, False otherwise
- Return type: