3. Linking MitoFish and NCBI datasets

Match accession numbers to primary accession numbers in NCBI records

Create pickle

createpickle.py (available in db.scripts folder in mitohelper repository) creates a dictionary from nt.list.

The key is the NCBI accession number and the value is the gene description.

This dictionary is then saved as a pickle named nt.pickle:

#!/usr/bin/env python
import pickle
ntdict = {'Accession':'Gene description'}

with open("nt.list",'r') as f:
    for line in f:
        line = line.rstrip()
        entry = line.split("\t")
        fullacc=entry[0].split(".")
        newentry={fullacc[0]:entry[1]} 
        ntdict.update(newentry)

with open("nt.pickle", 'wb') as handle:
    pickle.dump(ntdict, handle, protocol=pickle.HIGHEST_PROTOCOL)

Search pickle

getpickle.py (available in db.scripts folder in mitohelper repository) loads nt.pickle, then matches MitoFish entries with keys (NCBI accession numbers) in nt.pickle.

It took ~24 minutes to match 616,999 MitoFish records to 57,377,397 NCBI GenBank records.

Computing resources used: cpupercent=101,cput=00:22:53,mem=34500088kb,ncpus=24,vmem=34845840kb,walltime=00:23:34

#!/usr/bin/env python

import pickle

import sys
import os.path
from os import path

with open("nt.pickle", 'rb') as handle:
    NCBI = pickle.load(handle)

outfile="mitofish.genes"
nohit=0
count=0
length=(len(NCBI))

if path.exists(outfile) :
    sys.exit("Error: Output file exists! Please rename output file and try again!")

output=open(outfile,'a')

print("==== Searching... ====")

with open("mitofish.accession",'r') as infile:
    for inline in infile:
        inline = inline.rstrip()
        count +=1
        if inline in NCBI:
            output.write("%s\t%s\n" % (inline,NCBI[inline]))
        elif inline not in NCBI:
            output.write("%s\tNo hit found!\n" % inline)
            print("%s\tNo hit found!" % inline)
            nohit +=1
            
output.close()

print ("==== Run complete! ===")
print ("Total: %d accession numbers" % count)
print ("No hits for %d input accession numbers!" % nohit)

Extract hits from getpickle.py. The output file is genedef.1.tsv:

grep -v "No hit" mitofish.genes >genedef.1.tsv

Match remaining records

In NCBI fasta files, records with duplicate sequences will be concatenated in the header line, and will not be picked up by getpickle.py:

>LN610233.1 Chiloglanis anoterus mitochondrial partial D-loop, specimen voucher 68321_34LN610234.1 Chiloglanis anoterus mitochondrial partial D-loop, specimen voucher 68330_55

getpickle.py also does not pick up mostly UNVERIFIED entries not in nt fasta file
So, let's extract the accession numbers with no hits into an output file named nohit.accession:

grep "No hit" mitofish.genes | cut -f1 >nohit.accession

Get gene definitions for these accession numbers with the efetch Unix tool in NCBI's Entrez Direct package:

for i in `cat nohit.accession`
do
efetch -db nuccore -id $i -format gpc | xtract -pattern INSDSeq -element INSDSeq_primary-accession INSDSeq_definition >>genedef.2.tsv
done

Combine hits

cat genedef.1.tsv genedef.2.tsv | sort -k 1 >complete.partial.gene.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3. Linking MitoFish and NCBI datasets

Match accession numbers to primary accession numbers in NCBI records

Create pickle

Search pickle

Match remaining records

Combine hits

Clone this wiki locally