Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipyrad (reference version, pairddrad) step 7 crash -- Message: KeyError: 68 #508

Open
imaa9 opened this issue May 21, 2023 · 2 comments
Open

Comments

@imaa9
Copy link

imaa9 commented May 21, 2023

Hi Isaac,

I'm running ipyrad [v.0.9.90] with maximum memory allocation (184G), 48 threads, and with the following (relevant) params:

~/all_trimmed_reads/*.fq                       ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted/trimmed/unzipped fastq files
reference                                                ## [5] [assembly_method]: Assembly method 
~/reference1.1.fa                                   ## [6] [reference_sequence]: Location of reference sequence file
pairddrad                                               ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.
AATTC, GCATG                                        ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2) [EcoRI, SphI]
5                                                             ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read
33                                                           ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard)
5                                                             ## [11] [mindepth_statistical]: Min depth for statistical base calling
5                                                             ## [12] [mindepth_majrule]: Min depth for majority-rule base calling
10000                                                     ## [13] [maxdepth]: Max cluster depth within samples [default = 10,000]
0.86                                                        ## [14] [clust_threshold]: Clustering threshold for de novo assembly
2                                                             ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences
0.1                                                          ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus
0.1                                                          ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus
5                                                             ## [21] [min_samples_locus]: GLOBAL Min # samples per locus 
0.25                                                        ## [22] [max_SNPs_locus]: Max % SNPs per locus 
8                                                             ## [23] [max_Indels_locus]: Max # of indels per locus
0.5                                                          ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus
*                                                             ## [27] [output_formats]: Output formats (see docs) [* = all of them]

Thus runs fine through part 2 of step 7:

  Step 7: Filtering and formatting output files
  [####################] 100% 0:05:58 | applying filters
  [####################] 100% 1:02:59 | building arrays

  Encountered an Error.
  Message: KeyError: 68
  Parallel connection closed.

Here is the traceback info:

KeyError                                  Traceback (most recent call last)
File <string>:1, in <module>

File ~/.conda/envs/ipyrad/lib/python3.10/site-packages/ipyrad/assemble/write_outputs.py:2158, in fill_snp_array(data, ntaxa, nsnps)
   2156 # fill for each taxon
   2157 for sidx in range(ntaxa):
-> 2158     resos = [DCONS[i] for i in snparr[sidx, :]]
   2160     # pseudoref version
   2161     io5['genos'][:, sidx, :] = get_genos(
   2162         np.array([i[0] for i in resos]),
   2163         np.array([i[1] for i in resos]),
   2164         io5['pseudoref'][:]
   2165     )

File ~/.conda/envs/ipyrad/lib/python3.10/site-packages/ipyrad/assemble/write_outputs.py:2158, in <listcomp>(.0)
   2156 # fill for each taxon
   2157 for sidx in range(ntaxa):
-> 2158     resos = [DCONS[i] for i in snparr[sidx, :]]
   2160     # pseudoref version
   2161     io5['genos'][:, sidx, :] = get_genos(
   2162         np.array([i[0] for i in resos]),
   2163         np.array([i[1] for i in resos]),
   2164         io5['pseudoref'][:]
   2165     )

KeyError: 68

I followed the suggestion of a previous issue about using a reference genome with masked ambiguous bases (I just converted each to one of the possible resolution options) and tried running step 7 again with that, but it failed as above. Do I need to run the entire pipeline again from the beginning using the unambiguated reference, or is there something else that's causing this error in step 7? any insights would be much appreciated!

Thanks, Inbar

@isaacovercast
Copy link
Collaborator

Yes, ambig bases in the reference will cause problems, so it's good you found that and fixed it. By the time of step 7 all the formal assembly has been completed, so fixing the reference sequence will require to roll back and re-run from at least step 3 (including the -f flag) in order for the change in reference fix this error at step 7. Let me know how it goes....

@imaa9
Copy link
Author

imaa9 commented May 21, 2023

cool, many thanks for the quick reply! I'll run it again from the start, I think that should fix it. Just wanted to make sure this was the issue before I submit this big job again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants