Update reference data page

cumc · May 5, 2022 · f4195c2 · f4195c2
1 parent a69f2e7
commit f4195c2
Show file tree

Hide file tree

Showing 3 changed files with 64 additions and 82 deletions.
diff --git a/code/data_preprocessing/phenotype/gene_annotation.ipynb b/code/data_preprocessing/phenotype/gene_annotation.ipynb
@@ -11,7 +11,7 @@
  "# Gene coordinate annotation\n",
  "\n",
  "\n",
- "This workflow adds genomic coordinate annotation to gene-level molecular phenotype files generated, eg in GCT format, converting them to `bed` format. "
+ "This workflow adds genomic coordinate annotation to gene-level molecular phenotype files generated in `gct` format and convert them to `bed` format for downstreams analysis."
  ]
  },
  {
@@ -25,6 +25,8 @@
  "\n",
  "This pipeline is based on [`pyqtl`, as demonstrated here](https://github.com/broadinstitute/gtex-pipeline/blob/master/qtl/src/eqtl_prepare_expression.py).\n",
  "\n",
+ "**FIXME: please explain here what we do with gene symbol vs gene ID**\n",
+ "\n",
  "### Alternative implementation\n",
  "\n",
  "Previously we use `biomaRt` package in R instead of code from `pyqtl`. The core function calls are:\n",
@@ -47,9 +49,9 @@
  "source": [
  "## Input\n",
  "\n",
- "1. Molecular phenotype data with the first column being ENSEMBL ID and other columns being sample names. \n",
+ "1. Molecular phenotype data in `gct` format, with the first column being ENSEMBL ID and other columns being sample names. \n",
  "2. GTF for collapsed gene model\n",
- " - the gene names must be consistent with the GCT matrices (eg ENSG00000000003 vs. ENSG00000000003.1 will not work) \n",
+ " - the gene names must be consistent with the molecular phenotype data matrices (eg ENSG00000000003 vs. ENSG00000000003.1 will not work) \n",
  "3. (Optional) Meta-data to match between sample names in expression data and genotype files\n",
  " - Required input\n",
  " - Tab delimited with header\n",
@@ -172,19 +174,6 @@
  "sos run gene_annotation.ipynb -h"
  ]
  },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "arabic-enemy",
- "metadata": {
- "kernel": "Bash"
- },
- "outputs": [],
- "source": [
- "cd /mnt/mfs/statgen/xqtl_workflow_testing/normalization_r2\n",
- "sos run gene_annotation.ipynb -h"
- ]
- },
  {
  "cell_type": "code",
  "execution_count": 2,
@@ -214,52 +203,6 @@
  "parameter: container = \"\""
  ]
  },
- {
- "cell_type": "markdown",
- "id": "2f75689b-7ddf-4f3a-9e4a-e47ebdd7536a",
- "metadata": {
- "kernel": "SoS"
- },
- "source": [
- "## Region List generation\n",
- "To partitioning the data by genes require a region list file which:\n",
- " 1. have 5 columns: chr,start,end,gene_id,gene_name\n",
- " 2. have the same gene as or less gene than that of the bed file\n",
- "Input:\n",
- " 1. A gtf file used to generated the bed\n",
- " 2. A phenotype bed file, must have a gene_id column indicating the name of genes.\n",
- " "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "b06cd895-342b-4206-8692-795cc39f4c20",
- "metadata": {
- "kernel": "SoS"
- },
- "outputs": [],
- "source": [
- "[region_list_generation]\n",
- "input: phenoFile, annotation_gtf\n",
- "output: f'{cwd:a}/{_input[0]:bnn}.region_list'\n",
- "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}' \n",
- "python: expand= \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container\n",
- " import pandas as pd\n",
- " import qtl.io\n",
- " # get the five column data\n",
- " bed_template_df_id = qtl.io.gtf_to_tss_bed(${_input[1]:ar}, feature='transcript',phenotype_id = \"gene_id\" )\n",
- " bed_template_df_name = qtl.io.gtf_to_tss_bed(${_input[1]:ar}, feature='transcript',phenotype_id = \"gene_name\" )\n",
- " bed_template_df = bed_template_df_id.merge(bed_template_df_name, on = [\"chr\",\"start\",\"end\"])\n",
- " # retaining only somatic chromosome\n",
- " bed_template_df = bed_template_df[bed_template_df.chr.isin([\"chr\" + str(x) for x in (range(1,23))])]\n",
- " bed_template_df.columns = [\"#chr\",\"start\",\"end\",\"gene_id\",\"gene_name\"]\n",
- " pheno = pd.read_csv(${_input[0]:r}, sep = \"\\t\")\n",
- " # Retaining only the genes in the data\n",
- " region_list = bed_template_df[bed_template_df.${phenotype_id_type}.isin(pheno.gene_id)]\n",
- " region_list.to_csv(\"${_output}\", sep = \"\\t\",index = 0)"
- ]
- },
  {
  "cell_type": "markdown",
  "id": "cheap-credits",
@@ -272,16 +215,6 @@
  "Implementation based on [GTEx pipeline](https://github.com/broadinstitute/gtex-pipeline/blob/master/qtl/src/eqtl_prepare_expression.py)."
  ]
  },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "b9fda214-3ef6-41c8-9ae2-fcb10a8bae63",
- "metadata": {
- "kernel": "SoS"
- },
- "outputs": [],
- "source": []
- },
  {
  "cell_type": "code",
  "execution_count": null,
@@ -421,7 +354,7 @@
  "sos"
  ]
  ],
- "version": "0.22.6"
+ "version": "0.22.9"
  }
  },
  "nbformat": 4,

diff --git a/code/data_preprocessing/phenotype/phenotype_formatting.ipynb b/code/data_preprocessing/phenotype/phenotype_formatting.ipynb
@@ -11,16 +11,17 @@
  "# Phenotype data formatting\n",
  "\n",
  "\n",
- "This is the region extraction step for data processing pipeline for xqtl workflow, containing the generation of:\n",
- "1. Molecular_phenotype per chrom within selected regions in the format APEX and tensorQTL takes\n",
+ "This module implements a collection of workflows used to format molecular phenotype data.\n",
  "\n",
- "### Input\n",
+ "\n",
+ "\n",
+ "## Input\n",
  "The input for this workflow is the collection of data for 1 conditions as described in the readme of this git repo\n",
  "1. 1 complete residual molecular_phenotype data\n",
  "2. 1 region_list\n",
  "Both of these input can be generated by the annotation module of this pipeline\n",
  "\n",
- "### Output\n",
+ "## Output\n",
  "For each collection, the output is \n",
  "1. 1 lists of phenotype file (bed+index) for each chrom, suitable to be fed into both apex and tensorQTL, annotated with chrom and pos\n",
  "2. 1 lists of phenotype file (bed+index) for each gene, annotated with chrom and tss"
@@ -81,15 +82,56 @@
  "parameter: gene_name_as_phenotype_id = False"
  ]
  },
+ {
+ "cell_type": "markdown",
+ "id": "1df0a8a6-a874-4c02-b007-142d96b860fe",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "source": [
+ "## Region List generation\n",
+ "\n",
+ "To partitioning the data by genes require a region list file which:\n",
+ "\n",
+ " 1. have 5 columns: chr,start,end,gene_id,gene_name\n",
+ " 2. have the same gene as or less gene than that of the bed file\n",
+ " \n",
+ "Input:\n",
+ "\n",
+ " 1. A gtf file used to generated the bed\n",
+ " 2. A phenotype bed file, must have a gene_id column indicating the name of genes. "
+ ]
+ },
  {
  "cell_type": "code",
  "execution_count": null,
- "id": "coated-maintenance",
+ "id": "5913f0bd-a40e-45ff-a5d1-77b99b5752dd",
  "metadata": {
  "kernel": "SoS"
  },
  "outputs": [],
- "source": []
+ "source": [
+ "[generate_region_list]\n",
+ "# gene gtf annotation table\n",
+ "parameter: annotation_gtf = path\n",
+ "input: phenoFile, annotation_gtf\n",
+ "output: f'{cwd:a}/{_input[0]:bnn}.region_list'\n",
+ "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}' \n",
+ "python: expand= \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container\n",
+ " import pandas as pd\n",
+ " import qtl.io\n",
+ " # get the five column data\n",
+ " bed_template_df_id = qtl.io.gtf_to_tss_bed(${_input[1]:ar}, feature='transcript',phenotype_id = \"gene_id\" )\n",
+ " bed_template_df_name = qtl.io.gtf_to_tss_bed(${_input[1]:ar}, feature='transcript',phenotype_id = \"gene_name\" )\n",
+ " bed_template_df = bed_template_df_id.merge(bed_template_df_name, on = [\"chr\",\"start\",\"end\"])\n",
+ " # retaining only somatic chromosome\n",
+ " bed_template_df = bed_template_df[bed_template_df.chr.isin([\"chr\" + str(x) for x in (range(1,23))])]\n",
+ " bed_template_df.columns = [\"#chr\",\"start\",\"end\",\"gene_id\",\"gene_name\"]\n",
+ " pheno = pd.read_csv(${_input[0]:r}, sep = \"\\t\")\n",
+ " # Retaining only the genes in the data\n",
+ " region_list = bed_template_df[bed_template_df.${phenotype_id_type}.isin(pheno.gene_id)]\n",
+ " region_list.to_csv(\"${_output}\", sep = \"\\t\",index = 0)"
+ ]
  },
  {
  "cell_type": "markdown",
@@ -214,7 +256,7 @@
  ""
  ]
  ],
- "version": "0.22.6"
+ "version": "0.22.9"
  }
  },
  "nbformat": 4,

diff --git a/code/data_preprocessing/reference_data.ipynb b/code/data_preprocessing/reference_data.ipynb
@@ -12,7 +12,7 @@
  "\n",
  "This module provides reference data download, indexing and preprocessing (if necessary), in preparation for use throughout the pipeline.\n",
  "\n",
- "We have included the PDF document compiled by data standardization subgroup in the [minimal working example folder on Google Drive](https://drive.google.com/file/d/1R5sw5o8vqk_mbQQb4CGmtH3ldu1T3Vu0/view?usp=sharing). It contains the reference data to use for the project."
+ "We have included the PDF document compiled by data standardization subgroup in the [on Google Drive](https://drive.google.com/file/d/1R5sw5o8vqk_mbQQb4CGmtH3ldu1T3Vu0/view?usp=sharing) as well as on [ADSP Dashboard](https://www.niagads.org/adsp/content/adspgcadgenomeresources-v2pdf). It contains the reference data to use for the project."
  ]
  },
  {
@@ -24,7 +24,14 @@
  "source": [
  "## Overview\n",
  "\n",
- "This module is based on the [TOPMed workflow from Broad](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md).\n",
+ "This module is based on the [TOPMed workflow from Broad](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md). The reference data after we process it (details see Methods section and the rest of the analysis) can be found [in this folder on Google Drive](https://drive.google.com/drive/folders/19fmoII8yS7XE7HFcMU4OfvC2bL1zMD_P). Specifically, the list of reference files to be used are:\n",
+ "\n",
+ "1. `GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.{dict,fasta,fasta.fai}`\n",
+ "2. `Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf` for stranded protocol, and `Homo_sapiens.GRCh38.103.chr.reformatted.gene.ERCC.gtf` for unstranded protocol.\n",
+ "3. Everything under `STAR_Index` folder\n",
+ "4. Everything under `RSEM_Index` folder\n",
+ "\n",
+ "## Methods\n",
  "\n",
  "Workflows implemented include:\n",
  "\n",