Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a QC condition for phenotype_imputation.ipynb #1061

Merged
merged 4 commits into from
Oct 1, 2024
Merged

Conversation

rl3328
Copy link
Contributor

@rl3328 rl3328 commented Sep 17, 2024

  1. create a rm index showing less than 75% of PSI values >=0.05, combine with the other two conditions
  2. Keep a cleaner version that also works

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@gaow
Copy link
Contributor

gaow commented Sep 17, 2024

create a rm index showing less than 75% of PSI values >=0.05, combine with the other two conditions

I dont think this would work. 0.05 is a number that is very arbitrary and is scale dependent. In this case you are basically removing an event if >25% have small values such as 0.05 and less. That is too much data removed.

We can talk about it when we meet how you should approach this kind of tasks. But for now, just remove anything that has >95% data equal to zero

@rfeng2023
Copy link
Contributor

@rl3328 agree with @gaow 0.75 could be too strict to lose so many events

@gaow
Copy link
Contributor

gaow commented Sep 18, 2024

@rfeng2023 @rl3328 i just realized that the step to remove rare events ended up in the imputation code ... do we have the normalization step? I think it is conceptually clearer if this QC goes to that step. What do you think?

@rfeng2023
Copy link
Contributor

@gaow I agree, in principle we should add those qc to normalization section, which would be clearer. While we have different molecular phenotype, which could have different normalization pipeline, so we need to that qc part to each normalization pipeline. And we are aiming to do imputation first then do normalization as we discussed. so I think it might be better to still put it here?

@gaow
Copy link
Contributor

gaow commented Sep 19, 2024

then it is indeed more convenient to do it here but conceptually it is somewhat confusing. I would argue that conceptual clarity precedes that of engineering convenience. So I am still in favor of throwing in this line for every molecular phenotype at their own QC part.

@rfeng2023
Copy link
Contributor

rfeng2023 commented Sep 24, 2024

As we discussed today, we can keep it in imputation for now and move it in future, which is also posted as a ticket #1073 (comment)

Another thing is, this filtering should not only be set in EBMF, but should also in gEBMF etc. @rl3328

@rl3328
Copy link
Contributor Author

rl3328 commented Sep 30, 2024

  1. I added the QC requirement that filter out features with more than 95% missing or 0.
  2. I added the whole QC before imputation part(remove features 1. >95% missing or 0 | 40% missing) to [EBMF], [gEBMF], [missforest], [missxgboost], [knn], [soft], [mean], [lod], [bed_filter_na].
  3. I resolved the errors caused inconsistenncies between number of rows and some small typos.
  4. I add drop = FALSE to make df keep as dataframe even if there's only one line of missX or missY(although for most large datasets , they are unlikely to have only one line to impute)
  5. removed samples(columns) that have more than 80% missingness(default requirement to use knn)
  6. gEBMF not resolved. I got error error in evaluating the argument 'x' in selecting a method for function 'as.matrix': infinite or missing values in 'x' Calls: flash_init_cluster_for_grouped_data ... softImpute -> softImpute.x -> softImpute.x -> simpute.als -> svd Execution halted, but I didn't know this function in flashier package(not publicly online).

@rl3328 rl3328 changed the title Add a QC condition for phenotype_imputation.ipynb + Change the [cis_results_export] for fine_mapping_post_processing.ipynb Add a QC condition for phenotype_imputation.ipynb Oct 1, 2024
@gaow gaow merged commit 148cdc3 into cumc:main Oct 1, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants