Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy between metadata search results & piped fetch results #125

Open
nvpatin opened this issue Feb 13, 2023 · 4 comments
Open

Discrepancy between metadata search results & piped fetch results #125

nvpatin opened this issue Feb 13, 2023 · 4 comments

Comments

@nvpatin
Copy link

nvpatin commented Feb 13, 2023

I am trying to download a set of samples based on metadata information. When I search with my parameters, I find a certain number of samples; but when I pipe those results into 'redbiom fetch' (with a particular context) it downloads a different number of samples. I think there is a similar problem when I pipe the search results into 'redbiom summarize contexts'; it shows a list of contexts, some of which are associated with my samples but some of which are not, and I have to guess which one I have to use for fetching. So I have two questions: 1) How can I see the contexts associated only with my searched samples? and 2) How can I only fetch the samples associated with my metadata search? See below for the problems associated with question 2.

Looking for marine water samples within the EMP

% redbiom search metadata "where qiita_study_id == 13114 and empo_4 == 'Water (saline)'" | wc -l
39

Defining a context based on previous search results (it took several attempts to find one that worked)

% echo $CTX
Deblur_2021.09-Illumina-16S-V4-150nt-ac8c0b

Fetching samples based on metadata and context

% redbiom search metadata "where qiita_study_id == 13114 and empo_4 == 'Water (saline)'" | redbiom fetch samples --context $CTX --output EMP_marine_samples.biom
38 sample ambiguities observed. Writing ambiguity mappings to: EMP_marine_samples.biom.ambiguities

Data summary shows many more samples than metadata search originally found

% biom summarize-table -i EMP_marine_samples.biom | head
Num samples: 97
Num observations: 16,547
Total count: 1,354,853
Table density (fraction of non-zero values): 0.030

Counts/sample summary:
Min: 4,111.000
Max: 38,769.000
Median: 12,268.000
Mean: 13,967.557

@nvpatin
Copy link
Author

nvpatin commented Feb 13, 2023

Update: I see that the list of samples found in the metadata search and the list of samples in the downloaded biom table do match, but the biom table seems to have sub-set the samples. For example, "13114.palenik.42.s001" in the sample list corresponds to the sample IDs "13114.palenik.42.s001.134469" and "13114.palenik.42.s001.134523" in the biom table. The sample IDs in the metadata table match the list of sample IDs in the biom table, but all the metadata values are identical within each sample "grouping", e.g. "13114.palenik.42.s001.134469" and "13114.palenik.42.s001.134523" have exactly the same metadata.

Is there documentation about how and why that sub-sampling was done? I guess I can combine sample replicates (if that's what they are).

@antgonza
Copy link
Collaborator

@nvpatin; thank you for the question and update. I think @justinshaffer might be able to answer your question.

@wasade
Copy link
Member

wasade commented Feb 15, 2023

Hi @nvpatin, sorry for a brief delay, I was OOO the last few days.

For (1), that is an excellent idea and is not currently something that is exposed to the user, but would be a great addition. I would be happy to propose a suggestion to do this via bash script or python as a stop gap.

For (2), the issue is that the same physical sample has been sequenced multiple times. The command shown is correct, but each individual sequencing run is differentiated. These "ambiguities" are expressed in the resulting ambiguity map. You can get around this by specifying --resolve-ambiguities with the call to fetch. For redbiom fetch samples, I usually do --resolve-ambiguities merge which combines the sample data from multiple runs together.

If you haven't seen it, there is a longer tutorial on use on the QIIME 2 forum.

@nvpatin
Copy link
Author

nvpatin commented Feb 16, 2023

Thank you @wasade that's very helpful! I will check back for future functionality that provides contexts associated with samples in the metadata search results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants