Skip to content

3. Linking MitoFish and NCBI datasets

shenjean edited this page Mar 16, 2022 · 28 revisions

Match accession numbers to primary accession numbers in NCBI records

Create pickle

  • createpickle.py (available in db.scripts folder in mitohelper repository) creates a dictionary from nt.list.
  • The key is the NCBI accession number and the value is the gene description.
  • This dictionary is then saved as a pickle named nt.pickle:
#!/usr/bin/env python
import pickle
ntdict = {'Accession':'Gene description'}

with open("nt.list",'r') as f:
    for line in f:
        line = line.rstrip()
        entry = line.split("\t")
        fullacc=entry[0].split(".")
        newentry={fullacc[0]:entry[1]} 
        ntdict.update(newentry)

with open("nt.pickle", 'wb') as handle:
    pickle.dump(ntdict, handle, protocol=pickle.HIGHEST_PROTOCOL)

Search pickle

  • getpickle.py (available in db.scripts folder in mitohelper repository) loads nt.pickle, then matches MitoFish entries with keys (NCBI accession numbers) in nt.pickle.
  • It took ~24 minutes to match 616,999 MitoFish records to 57,377,397 NCBI GenBank records.
  • Computing resources used: cpupercent=101,cput=00:22:53,mem=34500088kb,ncpus=24,vmem=34845840kb,walltime=00:23:34
#!/usr/bin/env python

import pickle

import sys
import os.path
from os import path

with open("nt.pickle", 'rb') as handle:
    NCBI = pickle.load(handle)

outfile="mitofish.genes"
nohit=0
count=0
length=(len(NCBI))

if path.exists(outfile) :
    sys.exit("Error: Output file exists! Please rename output file and try again!")

output=open(outfile,'a')

print("==== Searching... ====")

with open("mitofish.accession",'r') as infile:
    for inline in infile:
        inline = inline.rstrip()
        count +=1
        if inline in NCBI:
            output.write("%s\t%s\n" % (inline,NCBI[inline]))
        elif inline not in NCBI:
            output.write("%s\tNo hit found!\n" % inline)
            print("%s\tNo hit found!" % inline)
            nohit +=1
            
output.close()

print ("==== Run complete! ===")
print ("Total: %d accession numbers" % count)
print ("No hits for %d input accession numbers!" % nohit)
  • Extract hits from getpickle.py. The output file is genedef.1.tsv:
grep -v "No hit" mitofish.genes >genedef.1.tsv

Match remaining records

  • In NCBI fasta files, records with duplicate sequences will be concatenated in the header line, and will not be picked up by getpickle.py:
>LN610233.1 Chiloglanis anoterus mitochondrial partial D-loop, specimen voucher 68321_34LN610234.1 Chiloglanis anoterus mitochondrial partial D-loop, specimen voucher 68330_55
  • getpickle.py also does not pick up mostly UNVERIFIED entries not in nt fasta file

  • So, let's extract the accession numbers with no hits into an output file named nohit.accession:

grep "No hit" mitofish.genes | cut -f1 >nohit.accession
  • Get gene definitions for these accession numbers with the efetch Unix tool in NCBI's Entrez Direct package:
for i in `cat nohit.accession`
do
efetch -db nuccore -id $i -format gpc | xtract -pattern INSDSeq -element INSDSeq_primary-accession INSDSeq_definition >>genedef.2.tsv
done

Combine hits

cat genedef.1.tsv genedef.2.tsv | sort -k 1 >complete.partial.gene.tsv