diff --git a/code/data_preprocessing/phenotype/gene_annotation.ipynb b/code/data_preprocessing/phenotype/gene_annotation.ipynb index e4cbae2ff..da7127f4a 100644 --- a/code/data_preprocessing/phenotype/gene_annotation.ipynb +++ b/code/data_preprocessing/phenotype/gene_annotation.ipynb @@ -11,7 +11,7 @@ "# Gene coordinate annotation\n", "\n", "\n", - "This workflow adds genomic coordinate annotation to gene-level molecular phenotype files generated, eg in GCT format, converting them to `bed` format. " + "This workflow adds genomic coordinate annotation to gene-level molecular phenotype files generated in `gct` format and convert them to `bed` format for downstreams analysis." ] }, { @@ -25,6 +25,8 @@ "\n", "This pipeline is based on [`pyqtl`, as demonstrated here](https://github.com/broadinstitute/gtex-pipeline/blob/master/qtl/src/eqtl_prepare_expression.py).\n", "\n", + "**FIXME: please explain here what we do with gene symbol vs gene ID**\n", + "\n", "### Alternative implementation\n", "\n", "Previously we use `biomaRt` package in R instead of code from `pyqtl`. The core function calls are:\n", @@ -47,9 +49,9 @@ "source": [ "## Input\n", "\n", - "1. Molecular phenotype data with the first column being ENSEMBL ID and other columns being sample names. \n", + "1. Molecular phenotype data in `gct` format, with the first column being ENSEMBL ID and other columns being sample names. \n", "2. GTF for collapsed gene model\n", - " - the gene names must be consistent with the GCT matrices (eg ENSG00000000003 vs. ENSG00000000003.1 will not work) \n", + " - the gene names must be consistent with the molecular phenotype data matrices (eg ENSG00000000003 vs. ENSG00000000003.1 will not work) \n", "3. (Optional) Meta-data to match between sample names in expression data and genotype files\n", " - Required input\n", " - Tab delimited with header\n", @@ -172,19 +174,6 @@ "sos run gene_annotation.ipynb -h" ] }, - { - "cell_type": "code", - "execution_count": null, - "id": "arabic-enemy", - "metadata": { - "kernel": "Bash" - }, - "outputs": [], - "source": [ - "cd /mnt/mfs/statgen/xqtl_workflow_testing/normalization_r2\n", - "sos run gene_annotation.ipynb -h" - ] - }, { "cell_type": "code", "execution_count": 2, @@ -214,52 +203,6 @@ "parameter: container = \"\"" ] }, - { - "cell_type": "markdown", - "id": "2f75689b-7ddf-4f3a-9e4a-e47ebdd7536a", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Region List generation\n", - "To partitioning the data by genes require a region list file which:\n", - " 1. have 5 columns: chr,start,end,gene_id,gene_name\n", - " 2. have the same gene as or less gene than that of the bed file\n", - "Input:\n", - " 1. A gtf file used to generated the bed\n", - " 2. A phenotype bed file, must have a gene_id column indicating the name of genes.\n", - " " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b06cd895-342b-4206-8692-795cc39f4c20", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[region_list_generation]\n", - "input: phenoFile, annotation_gtf\n", - "output: f'{cwd:a}/{_input[0]:bnn}.region_list'\n", - "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}' \n", - "python: expand= \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container\n", - " import pandas as pd\n", - " import qtl.io\n", - " # get the five column data\n", - " bed_template_df_id = qtl.io.gtf_to_tss_bed(${_input[1]:ar}, feature='transcript',phenotype_id = \"gene_id\" )\n", - " bed_template_df_name = qtl.io.gtf_to_tss_bed(${_input[1]:ar}, feature='transcript',phenotype_id = \"gene_name\" )\n", - " bed_template_df = bed_template_df_id.merge(bed_template_df_name, on = [\"chr\",\"start\",\"end\"])\n", - " # retaining only somatic chromosome\n", - " bed_template_df = bed_template_df[bed_template_df.chr.isin([\"chr\" + str(x) for x in (range(1,23))])]\n", - " bed_template_df.columns = [\"#chr\",\"start\",\"end\",\"gene_id\",\"gene_name\"]\n", - " pheno = pd.read_csv(${_input[0]:r}, sep = \"\\t\")\n", - " # Retaining only the genes in the data\n", - " region_list = bed_template_df[bed_template_df.${phenotype_id_type}.isin(pheno.gene_id)]\n", - " region_list.to_csv(\"${_output}\", sep = \"\\t\",index = 0)" - ] - }, { "cell_type": "markdown", "id": "cheap-credits", @@ -272,16 +215,6 @@ "Implementation based on [GTEx pipeline](https://github.com/broadinstitute/gtex-pipeline/blob/master/qtl/src/eqtl_prepare_expression.py)." ] }, - { - "cell_type": "code", - "execution_count": null, - "id": "b9fda214-3ef6-41c8-9ae2-fcb10a8bae63", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [] - }, { "cell_type": "code", "execution_count": null, @@ -421,7 +354,7 @@ "sos" ] ], - "version": "0.22.6" + "version": "0.22.9" } }, "nbformat": 4, diff --git a/code/data_preprocessing/phenotype/phenotype_formatting.ipynb b/code/data_preprocessing/phenotype/phenotype_formatting.ipynb index 5a8583ed2..a579f1e9b 100644 --- a/code/data_preprocessing/phenotype/phenotype_formatting.ipynb +++ b/code/data_preprocessing/phenotype/phenotype_formatting.ipynb @@ -11,16 +11,17 @@ "# Phenotype data formatting\n", "\n", "\n", - "This is the region extraction step for data processing pipeline for xqtl workflow, containing the generation of:\n", - "1. Molecular_phenotype per chrom within selected regions in the format APEX and tensorQTL takes\n", + "This module implements a collection of workflows used to format molecular phenotype data.\n", "\n", - "### Input\n", + "\n", + "\n", + "## Input\n", "The input for this workflow is the collection of data for 1 conditions as described in the readme of this git repo\n", "1. 1 complete residual molecular_phenotype data\n", "2. 1 region_list\n", "Both of these input can be generated by the annotation module of this pipeline\n", "\n", - "### Output\n", + "## Output\n", "For each collection, the output is \n", "1. 1 lists of phenotype file (bed+index) for each chrom, suitable to be fed into both apex and tensorQTL, annotated with chrom and pos\n", "2. 1 lists of phenotype file (bed+index) for each gene, annotated with chrom and tss" @@ -81,15 +82,56 @@ "parameter: gene_name_as_phenotype_id = False" ] }, + { + "cell_type": "markdown", + "id": "1df0a8a6-a874-4c02-b007-142d96b860fe", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Region List generation\n", + "\n", + "To partitioning the data by genes require a region list file which:\n", + "\n", + " 1. have 5 columns: chr,start,end,gene_id,gene_name\n", + " 2. have the same gene as or less gene than that of the bed file\n", + " \n", + "Input:\n", + "\n", + " 1. A gtf file used to generated the bed\n", + " 2. A phenotype bed file, must have a gene_id column indicating the name of genes. " + ] + }, { "cell_type": "code", "execution_count": null, - "id": "coated-maintenance", + "id": "5913f0bd-a40e-45ff-a5d1-77b99b5752dd", "metadata": { "kernel": "SoS" }, "outputs": [], - "source": [] + "source": [ + "[generate_region_list]\n", + "# gene gtf annotation table\n", + "parameter: annotation_gtf = path\n", + "input: phenoFile, annotation_gtf\n", + "output: f'{cwd:a}/{_input[0]:bnn}.region_list'\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}' \n", + "python: expand= \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container\n", + " import pandas as pd\n", + " import qtl.io\n", + " # get the five column data\n", + " bed_template_df_id = qtl.io.gtf_to_tss_bed(${_input[1]:ar}, feature='transcript',phenotype_id = \"gene_id\" )\n", + " bed_template_df_name = qtl.io.gtf_to_tss_bed(${_input[1]:ar}, feature='transcript',phenotype_id = \"gene_name\" )\n", + " bed_template_df = bed_template_df_id.merge(bed_template_df_name, on = [\"chr\",\"start\",\"end\"])\n", + " # retaining only somatic chromosome\n", + " bed_template_df = bed_template_df[bed_template_df.chr.isin([\"chr\" + str(x) for x in (range(1,23))])]\n", + " bed_template_df.columns = [\"#chr\",\"start\",\"end\",\"gene_id\",\"gene_name\"]\n", + " pheno = pd.read_csv(${_input[0]:r}, sep = \"\\t\")\n", + " # Retaining only the genes in the data\n", + " region_list = bed_template_df[bed_template_df.${phenotype_id_type}.isin(pheno.gene_id)]\n", + " region_list.to_csv(\"${_output}\", sep = \"\\t\",index = 0)" + ] }, { "cell_type": "markdown", @@ -214,7 +256,7 @@ "" ] ], - "version": "0.22.6" + "version": "0.22.9" } }, "nbformat": 4, diff --git a/code/data_preprocessing/reference_data.ipynb b/code/data_preprocessing/reference_data.ipynb index cf8f1084c..48c29c57e 100644 --- a/code/data_preprocessing/reference_data.ipynb +++ b/code/data_preprocessing/reference_data.ipynb @@ -12,7 +12,7 @@ "\n", "This module provides reference data download, indexing and preprocessing (if necessary), in preparation for use throughout the pipeline.\n", "\n", - "We have included the PDF document compiled by data standardization subgroup in the [minimal working example folder on Google Drive](https://drive.google.com/file/d/1R5sw5o8vqk_mbQQb4CGmtH3ldu1T3Vu0/view?usp=sharing). It contains the reference data to use for the project." + "We have included the PDF document compiled by data standardization subgroup in the [on Google Drive](https://drive.google.com/file/d/1R5sw5o8vqk_mbQQb4CGmtH3ldu1T3Vu0/view?usp=sharing) as well as on [ADSP Dashboard](https://www.niagads.org/adsp/content/adspgcadgenomeresources-v2pdf). It contains the reference data to use for the project." ] }, { @@ -24,7 +24,14 @@ "source": [ "## Overview\n", "\n", - "This module is based on the [TOPMed workflow from Broad](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md).\n", + "This module is based on the [TOPMed workflow from Broad](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md). The reference data after we process it (details see Methods section and the rest of the analysis) can be found [in this folder on Google Drive](https://drive.google.com/drive/folders/19fmoII8yS7XE7HFcMU4OfvC2bL1zMD_P). Specifically, the list of reference files to be used are:\n", + "\n", + "1. `GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.{dict,fasta,fasta.fai}`\n", + "2. `Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf` for stranded protocol, and `Homo_sapiens.GRCh38.103.chr.reformatted.gene.ERCC.gtf` for unstranded protocol.\n", + "3. Everything under `STAR_Index` folder\n", + "4. Everything under `RSEM_Index` folder\n", + "\n", + "## Methods\n", "\n", "Workflows implemented include:\n", "\n",