Skip to content

Convert a Genbank file with many extraneous Geneious annotations to EMBL format (even with of strict requirements as IPD)

License

Notifications You must be signed in to change notification settings

dholab/embl-my-gbk

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EMBL-my-Genbank: Convert a complicated Genbank file to a barebones EMBL file

Usage

This repo contains a minimal-dependency, Ruff-formatted, pure Python module that can be accessed directly through a script, e.g.,

python3 embl_my_genbank/embl_my_genbank.py -g file.gb -s "Homo sapiens"

Or, after running poetry install, as a command line command:

emb_my_gbk -g file.gb -s "Homo sapiens"

The repo comes with a pyproject.toml and requirements.txt that can be used with pip, pipx, uv, Poetry, or whatever declarative Python manager you prefer, to reproduce our dev environemnt--though really, the only non-stdlib dependencies are BioPython and Polars.

For complete documentation of the API, see the HTML in docs/. Recommended usage looks like this:

usage: emb_my_gbk [-h] --gb_path GB_PATH --species SPECIES [--out_fmt OUT_FMT] [--view_intermediate VIEW_INTERMEDIATE]

options:
  -h, --help            show this help message and exit
  --gb_path GB_PATH, -g GB_PATH
                        Genbank file to be converted.
  --metadata METADATA, -m METADATA
                        Metadata file with, at minimum, a column of allele names and a column of representative animals.
  --species SPECIES, -s SPECIES
                        Scientific name for the species under examination.
  --out_fmt OUT_FMT, -o OUT_FMT
                        Format to convert to. Can convert to EMBL, IPD_EMBL, and FASTA.
  --view_intermediate, -v
                        Whether to write out the intermediate cleaned Genbank file for inspection.

Purpose

With some regularity, the Genomic Services Unit at the Wisconsin National Primate Research Center run an allele discovery pipeline on primate MHC amplicons, the goal of which is to submit previously undocumented MHC alleles to the Immuno-Polymorphism Database (IPD). One of the primary pain points in this process is converting between Genbank format, which we use internally to annotate and review allele candidates alongside exon annotations, and the equally complicated EMBL format, which IPD requires. Additionally, IPD imposes non-standard EMBL format requirements, which we have to adhere to in as many as hundreds of new allele candidates. This python module implements many cleaning and conversion steps to take our internal review Genbank files and convert them into IPD-compatible EMBL files--a surprisingly tricky process!

About

Convert a Genbank file with many extraneous Geneious annotations to EMBL format (even with of strict requirements as IPD)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages

  • Python 100.0%