genome_entropy.io.jsonio
JSON serialization for data models.
Functions
Convert PipelineResult to UnifiedPipelineResult format. |
|
|
Read JSON data from a file. |
|
Convert a dataclass object to a JSON-serializable dictionary. |
|
Write data to a JSON file. |
- genome_entropy.io.jsonio.to_json_dict(obj)[source]
Convert a dataclass object to a JSON-serializable dictionary.
Recursively handles nested dataclasses, lists, and dictionaries.
- genome_entropy.io.jsonio.convert_pipeline_result_to_unified(pipeline_result)[source]
Convert PipelineResult to UnifiedPipelineResult format.
This function transforms the old redundant format (separate orfs, proteins, three_dis lists) into the new unified format where each feature appears exactly once with all its related data organized hierarchically.
OLD FORMAT PROBLEM:
The old format had three parallel lists: - orfs: [ORF1, ORF2, …] - proteins: [{orf: ORF1, aa_seq: …}, {orf: ORF2, aa_seq: …}, …] - three_dis: [{protein: {orf: ORF1, …}, 3di: …}, …]
This caused: 1. ORF data duplicated 3 times (in orfs, inside proteins, inside three_dis) 2. Protein data duplicated 2 times (in proteins, inside three_dis) 3. ~2-3x larger files due to redundancy 4. Risk of inconsistency if data differs between copies
NEW UNIFIED FORMAT:
Single features dictionary with hierarchical organization: - features: {
- “orf_1”: {
location: {start, end, strand, frame}, dna: {sequence, length}, protein: {sequence, length}, three_di: {encoding, length, method, model, device}, metadata: {parent_id, table_id, has_start, has_stop, in_genbank}, entropy: {dna_entropy, protein_entropy, three_di_entropy}
}
}
Benefits: 1. Each piece of information stored exactly once 2. 40-50% smaller file sizes 3. Direct O(1) access by orf_id 4. Clear hierarchical organization matching biological concepts 5. Single source of truth - no inconsistency possible
- param pipeline_result:
PipelineResult object or list of PipelineResult objects
- returns:
UnifiedPipelineResult object or list of UnifiedPipelineResult objects
- genome_entropy.io.jsonio.write_json(data, output_path, indent=2)[source]
Write data to a JSON file.
Automatically handles dataclass objects by converting them to dictionaries. If data contains PipelineResult objects, they are automatically converted to the new unified format to eliminate redundancy. Automatically compresses output if filename ends with .gz.
AUTOMATIC CONVERSION:
This function transparently converts old-format PipelineResult objects to the new unified format. This means:
Users don’t need to manually call convert_pipeline_result_to_unified()
All JSON output from the pipeline automatically uses the new format
The conversion happens only once during serialization
No changes needed to pipeline code or user scripts
MAPPING: Old Keys → New Structure
- OLD FORMAT:
orfs[i].orf_id → features[orf_id].orf_id
orfs[i].start → features[orf_id].location.start
orfs[i].nt_sequence → features[orf_id].dna.nt_sequence
proteins[i].aa_sequence → features[orf_id].protein.aa_sequence
three_dis[i].three_di → features[orf_id].three_di.encoding
entropy.orf_nt_entropy[id] → features[id].entropy.dna_entropy
- NEW FORMAT adds:
schema_version: “2.0.0” (for compatibility tracking)
features: dict (replaces orfs, proteins, three_dis lists)
Hierarchical organization (location, dna, protein, three_di, metadata, entropy)
- param data:
Data to write (dataclass, dict, list, etc.)
- param output_path:
Path to output JSON file (plain text or .gz for compressed)
- param indent:
Indentation level for pretty printing (default: 2)
- genome_entropy.io.jsonio.read_json(input_path)[source]
Read JSON data from a file.
Automatically detects and handles gzipped files (ending in .gz).
- Parameters:
input_path (str | Path) – Path to input JSON file (plain text or gzipped)
- Returns:
Parsed JSON data (dict, list, etc.)
- Raises:
FileNotFoundError – If the JSON file doesn’t exist
json.JSONDecodeError – If the file contains invalid JSON
- Return type: