Improve narratives of notebooks

cumc · May 5, 2022 · a69f2e7 · a69f2e7
1 parent 8f6623f
commit a69f2e7
Show file tree

Hide file tree

Showing 5 changed files with 234 additions and 181 deletions.
diff --git a/code/association_scan/APEX/APEX.ipynb b/code/association_scan/APEX/APEX.ipynb
@@ -7,9 +7,8 @@
  },
  "source": [
  "# APEX QTL association testing\n",
- "This notebook implements a workflow for using the APEX to conduct analysis. APEX implements a linear mixed model for association testing. This is potentially useful for analysis with related individuals.\n",
  "\n",
- "**FIXME: cite APEX paper**"
+ "This notebook implements a workflow for using the APEX to conduct analysis. APEX implements a linear mixed model for association testing. This is potentially useful for analysis with related individuals."
  ]
  },
  {
@@ -19,20 +18,17 @@
  },
  "source": [
  "## Input\n",
- "- `--molecular-pheno`, The `bed.gz` file containing the table describing the molecular phenotype. It should have a companion index file in `tbi` format.\n",
  "\n",
+ "- `--molecular-pheno`, The `bed.gz` file containing the table describing the molecular phenotype. It should have a companion index file in `tbi` format.\n",
  "- `genotype_list` a list of whole genome vcf file for each chromosome, those vcf are converted beforehand from plink trio in the data_processing sections\n",
- "\n",
  "- `grm_list` is a file containing list of grm matrixs that generated by the GRM module of this pipeline.\n",
- "\n",
  "- `covariate` is a file with #id + samples name as colnames and each row a covariate: fixed and known covariates as well as hidden covariates recovered from factor analysis.\n",
  "\n",
- "\n",
  "## Output\n",
  "\n",
  "For each chromosome, a sets of summary statistics files , including both nomial test statistics for each test, as well as region (gene) level association evidence.\n",
  "\n",
- "In addition to chr,pos,ref,alt, the column specification of nomial result are as followed:\n",
+ "In addition to chr,pos,ref,alt, the column specification of nomial result are as follows:\n",
  "\n",
  "- #chrom : Variant chromosome.\n",
  "- pos : Variant chromosomal position (basepairs).\n",
@@ -44,8 +40,7 @@
  "- pval : Single-variant association nominal p-value.\n",
  "- variant_id: ID of the variant (rsid or chr:position:ref:alt)\n",
  "\n",
- "The column specification of region (gene) level association evidence are as followed:\n",
- "\n",
+ "The column specification of region (gene) level association evidence are as follows:\n",
  "\n",
  "- #chrom : Molecular trait chromosome.\n",
  "- start : Molecular trait start position.\n",
@@ -148,7 +143,7 @@
  },
  "outputs": [],
  "source": [
- "sos run /home/hs3163/GIT/xqtl-pipeline/pipeline/APEX.ipynb APEX_cis \\\n",
+ "sos run APEX.ipynb APEX_cis \\\n",
  "--genotype_file_list GRCh38_liftedover_sorted_all.add_chr.leftnorm.filtered.renamed.filtered.renamed.filtered.filtered.vcf_files_list.txt \\\n",
  "--molecular_pheno_list /mnt/mfs/statgen/snuc_pseudo_bulk/eight_tissue_analysis/data_preprocessing/ALL/phenotype_data/ALL.log2cpm.bed.processed_phenotype.per_chrom.recipe \\\n",
  "--grm_list data_preprocessing/genotype/grm/plink_files_list.loco_grm_list.txt \\\n",
@@ -167,24 +162,6 @@
  "The section outlined the parameters that can be set in the command interface."
  ]
  },
- {
- "cell_type": "markdown",
- "metadata": {
- "kernel": "Bash"
- },
- "source": [
- "**FIXME: please revisit this input list for various default parameter setting, in particular the list of genotypes to load. You can copy and paste the genotype load code from genotype_reformatting for now. I'll repurpose the LDtools package to include those codes in the future so we dont have to copy them over and over again.**"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "kernel": "Bash"
- },
- "source": [
- "**FIXME: can we combine APEX_3 and APEX_4**?"
- ]
- },
  {
  "cell_type": "code",
  "execution_count": 5,
@@ -221,8 +198,6 @@
  "parameter: walltime = '5h'\n",
  "parameter: mem = '80G'\n",
  "\n",
- "\n",
- "\n",
  "import pandas as pd\n",
  "molecular_pheno_chr_inv = pd.read_csv(molecular_pheno_list,sep = \"\\t\")\n",
  "geno_chr_inv = pd.read_csv(genotype_file_list,sep = \"\\t\")\n",
@@ -413,7 +388,7 @@
  "sos"
  ]
  ],
- "version": "0.22.6"
+ "version": "0.22.4"
  }
  },
  "nbformat": 4,

diff --git a/code/association_scan/TensorQTL/TensorQTL.ipynb b/code/association_scan/TensorQTL/TensorQTL.ipynb
@@ -40,13 +40,8 @@
  "source": [
  "## Input\n",
  "- `--molecular-pheno`, The bed.gz file containing the table describing the molecular phenotype. It shall also have a tbi index accompaning it.\n",
- "\n",
- "\n",
- "- `genotype_list` a list of whole genome plink file for each chromosome\n",
- "\n",
- "\n",
+ "- `genotype_list` a list of whole genome plink file for each chromosome.\n",
  "- `grm_list` is a file containing list of grm matrixs that generated by the GRM module of this pipeline.\n",
- "\n",
  "- `covariate` is a file with #id + samples name as colnames and each row a covariate: fixed and known covariates as well as hidden covariates recovered from factor analysis.\n",
  "\n",
  "## Output\n",
@@ -70,8 +65,7 @@
  "- alt : Variant alternate allele.\n",
  "\n",
  "\n",
- "\n",
- "The column specification of region (gene) level association evidence are as followed:\n",
+ "The column specification of region (gene) level association evidence are as follows:\n",
  "\n",
  "- phenotype_id - Molecular trait identifier. (gene)\n",
  "- num_var - Total number of variants tested in cis\n",
@@ -89,7 +83,7 @@
  "- slope - Slope of the linear regression\n",
  "- slope_se - Standard error of the slope\n",
  "- pval_perm - First permutation P-value directly obtained from the permutations with the direct method\n",
- "- pval_beta - Second permutation P-value obtained via beta approximation. This is the one to use for downstream analysis\n"
+ "- pval_beta - Second permutation P-value obtained via beta approximation. This is the one to use for downstream analysis"
  ]
  },
  {
@@ -112,7 +106,6 @@
  "name": "stdout",
  "output_type": "stream",
  "text": [
- "\u001b[91mERROR\u001b[0m: \u001b[91mNotebook JSON is invalid: %s\u001b[0m\n",
  "usage: sos run TensorQTL.ipynb [workflow_name | -t targets] [options] [workflow_options]\n",
  " workflow_name: Single or combined workflows defined in this script\n",
  " targets: One or more targets to generate\n",
@@ -134,16 +127,14 @@
  " --region-list . (as path)\n",
  " An optional subset of region list containing a column of\n",
  " ENSG gene_id to limit the analysis\n",
- " --wd . (as path)\n",
+ " --cwd . (as path)\n",
  " Path to the work directory of the analysis.\n",
  " --job-size 2 (as int)\n",
  " Specify the number of jobs per run.\n",
  " --container ''\n",
  " Container option for software to run the analysis:\n",
  " docker or singularity\n",
- " --name ROSMAP\n",
- " Prefix for the analysis output\n",
- " --window 1000000 (as list)\n",
+ " --window 1000000 (as int)\n",
  " Specify the scanning window for the up and downstream\n",
  " radius to analyze around the region of interest, in\n",
  " units of bp\n",
@@ -164,28 +155,14 @@
  "sos run TensorQTL.ipynb -h"
  ]
  },
- {
- "cell_type": "markdown",
- "metadata": {
- "kernel": "SoS"
- },
- "source": [
- "## Example\n",
- "**FIXME: add it**"
- ]
- },
  {
  "cell_type": "markdown",
  "metadata": {
  "kernel": "SoS",
  "tags": []
  },
  "source": [
- "## Global parameter settings\n",
- "\n",
- "The section outlined the parameters that can be set in the command interface.\n",
- "\n",
- "**FIXME: same comments as in APEX.ipynb**"
+ "## Global parameter settings"
  ]
  },
  {
@@ -212,8 +189,6 @@
  "# Container option for software to run the analysis: docker or singularity\n",
  "parameter: container = ''\n",
  "\n",
- "\n",
- "\n",
  "# Specify the scanning window for the up and downstream radius to analyze around the region of interest, in units of bp\n",
  "parameter: window = 1000000\n",
  "\n",
@@ -224,24 +199,13 @@
  "input_inv = input_inv.values.tolist()"
  ]
  },
- {
- "cell_type": "markdown",
- "metadata": {
- "kernel": "SoS"
- },
- "source": [
- "## QTL Sumstat generation \n",
- "This step generate the cis-QTL summary statistics and vcov (covariate-adjusted LD) files for downstream analysis from summary statistics. The analysis is done per chromosome to reduce running time."
- ]
- },
  {
  "cell_type": "markdown",
  "metadata": {
  "kernel": "Bash"
  },
  "source": [
- "## Cis QTL Sumstat generation via tensorQTL\n",
- "\n"
+ "## cisQTL association testing"
  ]
  },
  {
@@ -320,7 +284,7 @@
  "kernel": "SoS"
  },
  "source": [
- "## Trans QTL Sumstat generation via tensorQTL\n"
+ "## TransQTL association testing"
  ]
  },
  {
@@ -392,7 +356,7 @@
  "kernel": "SoS"
  },
  "source": [
- "**FIXME: we can consolidate these steps. I'll take a look myself after we have the MWE test**"
+ "## Association results processing"
  ]
  },
  {
@@ -468,41 +432,6 @@
  " data_tempt.to_csv(\"$[_output[0]]\",index = False,sep = \"\\t\" )\n",
  " column_info_df.to_csv(\"$[_output[1]]\",index = True,sep = \"\\t\" )"
  ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "kernel": "SoS"
- },
- "outputs": [],
- "source": [
- "tss_distance af ma_samples ma_count"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "kernel": "SoS"
- },
- "outputs": [],
- "source": [
- "Info TensorQTL\n",
- " GENE,CHR,POS,A0,A1\n",
- " chrom\n",
- " pos\n",
- " ref\n",
- " alt\n",
- " variant_id\n",
- " beta\n",
- " se\n",
- " tss_distance\n",
- " af\n",
- " ma_samples\n",
- " ma_count\n",
- " phenotype_id"
- ]
  }
  ],
  "metadata": {
@@ -535,7 +464,7 @@
  "sos"
  ]
  ],
- "version": "0.22.6"
+ "version": "0.22.4"
  }
  },
  "nbformat": 4,