-
Notifications
You must be signed in to change notification settings - Fork 8
3. Linking MitoFish and NCBI datasets
shenjean edited this page Mar 16, 2022
·
28 revisions
createpickle.py
(available in db.scripts folder in mitohelper repository) creates a dictionary fromnt.list
.- The key is the NCBI accession number and the value is the gene description.
- This dictionary is then saved as a pickle named
nt.pickle
:
#!/usr/bin/env python
import pickle
ntdict = {'Accession':'Gene description'}
with open("nt.list",'r') as f:
for line in f:
line = line.rstrip()
entry = line.split("\t")
fullacc=entry[0].split(".")
newentry={fullacc[0]:entry[1]}
ntdict.update(newentry)
with open("nt.pickle", 'wb') as handle:
pickle.dump(ntdict, handle, protocol=pickle.HIGHEST_PROTOCOL)
getpickle.py
(available in db.scripts folder in mitohelper repository) loadsnt.pickle
, then matches MitoFish entries with keys (NCBI accession numbers) innt.pickle
.
- It took ~24 minutes to match 616,999 MitoFish records to 57,377,397 NCBI GenBank records.
- Computing resources used:
cpupercent=101,cput=00:22:53,mem=34500088kb,ncpus=24,vmem=34845840kb,walltime=00:23:34
#!/usr/bin/env python
import pickle
import sys
import os.path
from os import path
with open("nt.pickle", 'rb') as handle:
NCBI = pickle.load(handle)
outfile="mitofish.genes"
nohit=0
count=0
length=(len(NCBI))
if path.exists(outfile) :
sys.exit("Error: Output file exists! Please rename output file and try again!")
output=open(outfile,'a')
print("==== Searching... ====")
with open("mitofish.accession",'r') as infile:
for inline in infile:
inline = inline.rstrip()
count +=1
if inline in NCBI:
output.write("%s\t%s\n" % (inline,NCBI[inline]))
elif inline not in NCBI:
output.write("%s\tNo hit found!\n" % inline)
print("%s\tNo hit found!" % inline)
nohit +=1
output.close()
print ("==== Run complete! ===")
print ("Total: %d accession numbers" % count)
print ("No hits for %d input accession numbers!" % nohit)
- Extract hits from
getpickle.py
. The output file isgenedef.1.tsv
:
grep -v "No hit" mitofish.genes >genedef.1.tsv
- In NCBI fasta files, records with duplicate sequences will be concatenated in the header line, and will not be picked up by
getpickle.py
:
>LN610233.1 Chiloglanis anoterus mitochondrial partial D-loop, specimen voucher 68321_34LN610234.1 Chiloglanis anoterus mitochondrial partial D-loop, specimen voucher 68330_55
-
getpickle.py
also does not pick up mostly UNVERIFIED entries not in nt fasta file -
So, let's extract the accession numbers with no hits into an output file named
nohit.accession
:
grep "No hit" mitofish.genes | cut -f1 >nohit.accession
- Get gene definitions for these accession numbers with the
efetch
Unix tool in NCBI's Entrez Direct package:
for i in `cat nohit.accession`
do
efetch -db nuccore -id $i -format gpc | xtract -pattern INSDSeq -element INSDSeq_primary-accession INSDSeq_definition >>genedef.2.tsv
done
cat genedef.1.tsv genedef.2.tsv | sort -k 1 >complete.partial.gene.tsv