Skip to content

Commit

Permalink
Improve narratives of notebooks
Browse files Browse the repository at this point in the history
  • Loading branch information
gaow committed May 5, 2022
1 parent 8f6623f commit a69f2e7
Show file tree
Hide file tree
Showing 5 changed files with 234 additions and 181 deletions.
37 changes: 6 additions & 31 deletions code/association_scan/APEX/APEX.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,8 @@
},
"source": [
"# APEX QTL association testing\n",
"This notebook implements a workflow for using the APEX to conduct analysis. APEX implements a linear mixed model for association testing. This is potentially useful for analysis with related individuals.\n",
"\n",
"**FIXME: cite APEX paper**"
"This notebook implements a workflow for using the APEX to conduct analysis. APEX implements a linear mixed model for association testing. This is potentially useful for analysis with related individuals."
]
},
{
Expand All @@ -19,20 +18,17 @@
},
"source": [
"## Input\n",
"- `--molecular-pheno`, The `bed.gz` file containing the table describing the molecular phenotype. It should have a companion index file in `tbi` format.\n",
"\n",
"- `--molecular-pheno`, The `bed.gz` file containing the table describing the molecular phenotype. It should have a companion index file in `tbi` format.\n",
"- `genotype_list` a list of whole genome vcf file for each chromosome, those vcf are converted beforehand from plink trio in the data_processing sections\n",
"\n",
"- `grm_list` is a file containing list of grm matrixs that generated by the GRM module of this pipeline.\n",
"\n",
"- `covariate` is a file with #id + samples name as colnames and each row a covariate: fixed and known covariates as well as hidden covariates recovered from factor analysis.\n",
"\n",
"\n",
"## Output\n",
"\n",
"For each chromosome, a sets of summary statistics files , including both nomial test statistics for each test, as well as region (gene) level association evidence.\n",
"\n",
"In addition to chr,pos,ref,alt, the column specification of nomial result are as followed:\n",
"In addition to chr,pos,ref,alt, the column specification of nomial result are as follows:\n",
"\n",
"- #chrom : Variant chromosome.\n",
"- pos : Variant chromosomal position (basepairs).\n",
Expand All @@ -44,8 +40,7 @@
"- pval : Single-variant association nominal p-value.\n",
"- variant_id: ID of the variant (rsid or chr:position:ref:alt)\n",
"\n",
"The column specification of region (gene) level association evidence are as followed:\n",
"\n",
"The column specification of region (gene) level association evidence are as follows:\n",
"\n",
"- #chrom : Molecular trait chromosome.\n",
"- start : Molecular trait start position.\n",
Expand Down Expand Up @@ -148,7 +143,7 @@
},
"outputs": [],
"source": [
"sos run /home/hs3163/GIT/xqtl-pipeline/pipeline/APEX.ipynb APEX_cis \\\n",
"sos run APEX.ipynb APEX_cis \\\n",
"--genotype_file_list GRCh38_liftedover_sorted_all.add_chr.leftnorm.filtered.renamed.filtered.renamed.filtered.filtered.vcf_files_list.txt \\\n",
"--molecular_pheno_list /mnt/mfs/statgen/snuc_pseudo_bulk/eight_tissue_analysis/data_preprocessing/ALL/phenotype_data/ALL.log2cpm.bed.processed_phenotype.per_chrom.recipe \\\n",
"--grm_list data_preprocessing/genotype/grm/plink_files_list.loco_grm_list.txt \\\n",
Expand All @@ -167,24 +162,6 @@
"The section outlined the parameters that can be set in the command interface."
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "Bash"
},
"source": [
"**FIXME: please revisit this input list for various default parameter setting, in particular the list of genotypes to load. You can copy and paste the genotype load code from genotype_reformatting for now. I'll repurpose the LDtools package to include those codes in the future so we dont have to copy them over and over again.**"
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "Bash"
},
"source": [
"**FIXME: can we combine APEX_3 and APEX_4**?"
]
},
{
"cell_type": "code",
"execution_count": 5,
Expand Down Expand Up @@ -221,8 +198,6 @@
"parameter: walltime = '5h'\n",
"parameter: mem = '80G'\n",
"\n",
"\n",
"\n",
"import pandas as pd\n",
"molecular_pheno_chr_inv = pd.read_csv(molecular_pheno_list,sep = \"\\t\")\n",
"geno_chr_inv = pd.read_csv(genotype_file_list,sep = \"\\t\")\n",
Expand Down Expand Up @@ -413,7 +388,7 @@
"sos"
]
],
"version": "0.22.6"
"version": "0.22.4"
}
},
"nbformat": 4,
Expand Down
91 changes: 10 additions & 81 deletions code/association_scan/TensorQTL/TensorQTL.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -40,13 +40,8 @@
"source": [
"## Input\n",
"- `--molecular-pheno`, The bed.gz file containing the table describing the molecular phenotype. It shall also have a tbi index accompaning it.\n",
"\n",
"\n",
"- `genotype_list` a list of whole genome plink file for each chromosome\n",
"\n",
"\n",
"- `genotype_list` a list of whole genome plink file for each chromosome.\n",
"- `grm_list` is a file containing list of grm matrixs that generated by the GRM module of this pipeline.\n",
"\n",
"- `covariate` is a file with #id + samples name as colnames and each row a covariate: fixed and known covariates as well as hidden covariates recovered from factor analysis.\n",
"\n",
"## Output\n",
Expand All @@ -70,8 +65,7 @@
"- alt : Variant alternate allele.\n",
"\n",
"\n",
"\n",
"The column specification of region (gene) level association evidence are as followed:\n",
"The column specification of region (gene) level association evidence are as follows:\n",
"\n",
"- phenotype_id - Molecular trait identifier. (gene)\n",
"- num_var - Total number of variants tested in cis\n",
Expand All @@ -89,7 +83,7 @@
"- slope - Slope of the linear regression\n",
"- slope_se - Standard error of the slope\n",
"- pval_perm - First permutation P-value directly obtained from the permutations with the direct method\n",
"- pval_beta - Second permutation P-value obtained via beta approximation. This is the one to use for downstream analysis\n"
"- pval_beta - Second permutation P-value obtained via beta approximation. This is the one to use for downstream analysis"
]
},
{
Expand All @@ -112,7 +106,6 @@
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[91mERROR\u001b[0m: \u001b[91mNotebook JSON is invalid: %s\u001b[0m\n",
"usage: sos run TensorQTL.ipynb [workflow_name | -t targets] [options] [workflow_options]\n",
" workflow_name: Single or combined workflows defined in this script\n",
" targets: One or more targets to generate\n",
Expand All @@ -134,16 +127,14 @@
" --region-list . (as path)\n",
" An optional subset of region list containing a column of\n",
" ENSG gene_id to limit the analysis\n",
" --wd . (as path)\n",
" --cwd . (as path)\n",
" Path to the work directory of the analysis.\n",
" --job-size 2 (as int)\n",
" Specify the number of jobs per run.\n",
" --container ''\n",
" Container option for software to run the analysis:\n",
" docker or singularity\n",
" --name ROSMAP\n",
" Prefix for the analysis output\n",
" --window 1000000 (as list)\n",
" --window 1000000 (as int)\n",
" Specify the scanning window for the up and downstream\n",
" radius to analyze around the region of interest, in\n",
" units of bp\n",
Expand All @@ -164,28 +155,14 @@
"sos run TensorQTL.ipynb -h"
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "SoS"
},
"source": [
"## Example\n",
"**FIXME: add it**"
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "SoS",
"tags": []
},
"source": [
"## Global parameter settings\n",
"\n",
"The section outlined the parameters that can be set in the command interface.\n",
"\n",
"**FIXME: same comments as in APEX.ipynb**"
"## Global parameter settings"
]
},
{
Expand All @@ -212,8 +189,6 @@
"# Container option for software to run the analysis: docker or singularity\n",
"parameter: container = ''\n",
"\n",
"\n",
"\n",
"# Specify the scanning window for the up and downstream radius to analyze around the region of interest, in units of bp\n",
"parameter: window = 1000000\n",
"\n",
Expand All @@ -224,24 +199,13 @@
"input_inv = input_inv.values.tolist()"
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "SoS"
},
"source": [
"## QTL Sumstat generation \n",
"This step generate the cis-QTL summary statistics and vcov (covariate-adjusted LD) files for downstream analysis from summary statistics. The analysis is done per chromosome to reduce running time."
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "Bash"
},
"source": [
"## Cis QTL Sumstat generation via tensorQTL\n",
"\n"
"## cisQTL association testing"
]
},
{
Expand Down Expand Up @@ -320,7 +284,7 @@
"kernel": "SoS"
},
"source": [
"## Trans QTL Sumstat generation via tensorQTL\n"
"## TransQTL association testing"
]
},
{
Expand Down Expand Up @@ -392,7 +356,7 @@
"kernel": "SoS"
},
"source": [
"**FIXME: we can consolidate these steps. I'll take a look myself after we have the MWE test**"
"## Association results processing"
]
},
{
Expand Down Expand Up @@ -468,41 +432,6 @@
" data_tempt.to_csv(\"$[_output[0]]\",index = False,sep = \"\\t\" )\n",
" column_info_df.to_csv(\"$[_output[1]]\",index = True,sep = \"\\t\" )"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"kernel": "SoS"
},
"outputs": [],
"source": [
"tss_distance af ma_samples ma_count"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"kernel": "SoS"
},
"outputs": [],
"source": [
"Info TensorQTL\n",
" GENE,CHR,POS,A0,A1\n",
" chrom\n",
" pos\n",
" ref\n",
" alt\n",
" variant_id\n",
" beta\n",
" se\n",
" tss_distance\n",
" af\n",
" ma_samples\n",
" ma_count\n",
" phenotype_id"
]
}
],
"metadata": {
Expand Down Expand Up @@ -535,7 +464,7 @@
"sos"
]
],
"version": "0.22.6"
"version": "0.22.4"
}
},
"nbformat": 4,
Expand Down
Loading

0 comments on commit a69f2e7

Please sign in to comment.