Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample size for LD estimation (EUR) #106

Open
Shicheng-Guo opened this issue Sep 24, 2021 · 5 comments
Open

Sample size for LD estimation (EUR) #106

Shicheng-Guo opened this issue Sep 24, 2021 · 5 comments

Comments

@Shicheng-Guo
Copy link

I notice you selected a random subset of unrelated samples. two questions:

  1. for EUR population, who dataset you used? 1000G_CEU, hapmap_CEU_r23a_filtered, UK10K, HRC reference panel?
  2. for EUR population, did you estimated the minimum sample size to receive stable LD estimation for lead SNP identification?

Thanks.

Shicheng

@Shicheng-Guo
Copy link
Author

BTW: Is there any GTEx-V8-pre-calculated clumped SNPs to download directly?

@gaow
Copy link
Contributor

gaow commented Sep 24, 2021

@Shicheng-Guo which workflow are you referring to? In our applications we mostly have the matching genotypes so we don't really use reference panels as far as I can recall, for most workflows in this repo.

@Shicheng-Guo
Copy link
Author

Thanks Gao for your response. I mean the workflow below:

https://github.com/cumc/bioworkflows/blob/master/GWAS/LD_Clumping.ipynb

Thanks

Shicheng

@Shicheng-Guo
Copy link
Author

I notice lots of papers use 1000Genme-EUR as reference, however, I prefer to use UKB-WGS individual data as reference. my question is what's the best sample size to use? 150K WGS data will make the process very time-consuming while sample number sample size may cause biased LD-clumping.

@Shicheng-Guo which workflow are you referring to? In our applications we mostly have the matching genotypes so we don't really use reference panels as far as I can recall, for most workflows in this repo.

@gaow
Copy link
Contributor

gaow commented Sep 27, 2021

@Shicheng-Guo our LD clumping application was for association analysis with UK Biobank data -- that was why we selected subsets of UKB genotypes and used that as reference panel. We used 2000 samples I believe.

I don't think LD clumping is as picky as eg fine-mapping applications in terms of LD panel. Since our application was on UKB data itself, we believe 2000 samples is good enough of an approximation. We don't have the reference for GTEx V8 data. I have not formally assessed it, but if you are concerned, perhaps you can take a few regions of UKB data, try computing LD panel from sample sizes 500 to 10K see how robust your estimates are?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants