Skip to content

Commit

Permalink
Update reference data page
Browse files Browse the repository at this point in the history
  • Loading branch information
gaow committed May 5, 2022
1 parent a69f2e7 commit f4195c2
Show file tree
Hide file tree
Showing 3 changed files with 64 additions and 82 deletions.
79 changes: 6 additions & 73 deletions code/data_preprocessing/phenotype/gene_annotation.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
"# Gene coordinate annotation\n",
"\n",
"\n",
"This workflow adds genomic coordinate annotation to gene-level molecular phenotype files generated, eg in GCT format, converting them to `bed` format. "
"This workflow adds genomic coordinate annotation to gene-level molecular phenotype files generated in `gct` format and convert them to `bed` format for downstreams analysis."
]
},
{
Expand All @@ -25,6 +25,8 @@
"\n",
"This pipeline is based on [`pyqtl`, as demonstrated here](https://github.com/broadinstitute/gtex-pipeline/blob/master/qtl/src/eqtl_prepare_expression.py).\n",
"\n",
"**FIXME: please explain here what we do with gene symbol vs gene ID**\n",
"\n",
"### Alternative implementation\n",
"\n",
"Previously we use `biomaRt` package in R instead of code from `pyqtl`. The core function calls are:\n",
Expand All @@ -47,9 +49,9 @@
"source": [
"## Input\n",
"\n",
"1. Molecular phenotype data with the first column being ENSEMBL ID and other columns being sample names. \n",
"1. Molecular phenotype data in `gct` format, with the first column being ENSEMBL ID and other columns being sample names. \n",
"2. GTF for collapsed gene model\n",
" - the gene names must be consistent with the GCT matrices (eg ENSG00000000003 vs. ENSG00000000003.1 will not work) \n",
" - the gene names must be consistent with the molecular phenotype data matrices (eg ENSG00000000003 vs. ENSG00000000003.1 will not work) \n",
"3. (Optional) Meta-data to match between sample names in expression data and genotype files\n",
" - Required input\n",
" - Tab delimited with header\n",
Expand Down Expand Up @@ -172,19 +174,6 @@
"sos run gene_annotation.ipynb -h"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "arabic-enemy",
"metadata": {
"kernel": "Bash"
},
"outputs": [],
"source": [
"cd /mnt/mfs/statgen/xqtl_workflow_testing/normalization_r2\n",
"sos run gene_annotation.ipynb -h"
]
},
{
"cell_type": "code",
"execution_count": 2,
Expand Down Expand Up @@ -214,52 +203,6 @@
"parameter: container = \"\""
]
},
{
"cell_type": "markdown",
"id": "2f75689b-7ddf-4f3a-9e4a-e47ebdd7536a",
"metadata": {
"kernel": "SoS"
},
"source": [
"## Region List generation\n",
"To partitioning the data by genes require a region list file which:\n",
" 1. have 5 columns: chr,start,end,gene_id,gene_name\n",
" 2. have the same gene as or less gene than that of the bed file\n",
"Input:\n",
" 1. A gtf file used to generated the bed\n",
" 2. A phenotype bed file, must have a gene_id column indicating the name of genes.\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b06cd895-342b-4206-8692-795cc39f4c20",
"metadata": {
"kernel": "SoS"
},
"outputs": [],
"source": [
"[region_list_generation]\n",
"input: phenoFile, annotation_gtf\n",
"output: f'{cwd:a}/{_input[0]:bnn}.region_list'\n",
"task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}' \n",
"python: expand= \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container\n",
" import pandas as pd\n",
" import qtl.io\n",
" # get the five column data\n",
" bed_template_df_id = qtl.io.gtf_to_tss_bed(${_input[1]:ar}, feature='transcript',phenotype_id = \"gene_id\" )\n",
" bed_template_df_name = qtl.io.gtf_to_tss_bed(${_input[1]:ar}, feature='transcript',phenotype_id = \"gene_name\" )\n",
" bed_template_df = bed_template_df_id.merge(bed_template_df_name, on = [\"chr\",\"start\",\"end\"])\n",
" # retaining only somatic chromosome\n",
" bed_template_df = bed_template_df[bed_template_df.chr.isin([\"chr\" + str(x) for x in (range(1,23))])]\n",
" bed_template_df.columns = [\"#chr\",\"start\",\"end\",\"gene_id\",\"gene_name\"]\n",
" pheno = pd.read_csv(${_input[0]:r}, sep = \"\\t\")\n",
" # Retaining only the genes in the data\n",
" region_list = bed_template_df[bed_template_df.${phenotype_id_type}.isin(pheno.gene_id)]\n",
" region_list.to_csv(\"${_output}\", sep = \"\\t\",index = 0)"
]
},
{
"cell_type": "markdown",
"id": "cheap-credits",
Expand All @@ -272,16 +215,6 @@
"Implementation based on [GTEx pipeline](https://github.com/broadinstitute/gtex-pipeline/blob/master/qtl/src/eqtl_prepare_expression.py)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b9fda214-3ef6-41c8-9ae2-fcb10a8bae63",
"metadata": {
"kernel": "SoS"
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
Expand Down Expand Up @@ -421,7 +354,7 @@
"sos"
]
],
"version": "0.22.6"
"version": "0.22.9"
}
},
"nbformat": 4,
Expand Down
56 changes: 49 additions & 7 deletions code/data_preprocessing/phenotype/phenotype_formatting.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -11,16 +11,17 @@
"# Phenotype data formatting\n",
"\n",
"\n",
"This is the region extraction step for data processing pipeline for xqtl workflow, containing the generation of:\n",
"1. Molecular_phenotype per chrom within selected regions in the format APEX and tensorQTL takes\n",
"This module implements a collection of workflows used to format molecular phenotype data.\n",
"\n",
"### Input\n",
"\n",
"\n",
"## Input\n",
"The input for this workflow is the collection of data for 1 conditions as described in the readme of this git repo\n",
"1. 1 complete residual molecular_phenotype data\n",
"2. 1 region_list\n",
"Both of these input can be generated by the annotation module of this pipeline\n",
"\n",
"### Output\n",
"## Output\n",
"For each collection, the output is \n",
"1. 1 lists of phenotype file (bed+index) for each chrom, suitable to be fed into both apex and tensorQTL, annotated with chrom and pos\n",
"2. 1 lists of phenotype file (bed+index) for each gene, annotated with chrom and tss"
Expand Down Expand Up @@ -81,15 +82,56 @@
"parameter: gene_name_as_phenotype_id = False"
]
},
{
"cell_type": "markdown",
"id": "1df0a8a6-a874-4c02-b007-142d96b860fe",
"metadata": {
"kernel": "SoS"
},
"source": [
"## Region List generation\n",
"\n",
"To partitioning the data by genes require a region list file which:\n",
"\n",
" 1. have 5 columns: chr,start,end,gene_id,gene_name\n",
" 2. have the same gene as or less gene than that of the bed file\n",
" \n",
"Input:\n",
"\n",
" 1. A gtf file used to generated the bed\n",
" 2. A phenotype bed file, must have a gene_id column indicating the name of genes. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "coated-maintenance",
"id": "5913f0bd-a40e-45ff-a5d1-77b99b5752dd",
"metadata": {
"kernel": "SoS"
},
"outputs": [],
"source": []
"source": [
"[generate_region_list]\n",
"# gene gtf annotation table\n",
"parameter: annotation_gtf = path\n",
"input: phenoFile, annotation_gtf\n",
"output: f'{cwd:a}/{_input[0]:bnn}.region_list'\n",
"task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}' \n",
"python: expand= \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container\n",
" import pandas as pd\n",
" import qtl.io\n",
" # get the five column data\n",
" bed_template_df_id = qtl.io.gtf_to_tss_bed(${_input[1]:ar}, feature='transcript',phenotype_id = \"gene_id\" )\n",
" bed_template_df_name = qtl.io.gtf_to_tss_bed(${_input[1]:ar}, feature='transcript',phenotype_id = \"gene_name\" )\n",
" bed_template_df = bed_template_df_id.merge(bed_template_df_name, on = [\"chr\",\"start\",\"end\"])\n",
" # retaining only somatic chromosome\n",
" bed_template_df = bed_template_df[bed_template_df.chr.isin([\"chr\" + str(x) for x in (range(1,23))])]\n",
" bed_template_df.columns = [\"#chr\",\"start\",\"end\",\"gene_id\",\"gene_name\"]\n",
" pheno = pd.read_csv(${_input[0]:r}, sep = \"\\t\")\n",
" # Retaining only the genes in the data\n",
" region_list = bed_template_df[bed_template_df.${phenotype_id_type}.isin(pheno.gene_id)]\n",
" region_list.to_csv(\"${_output}\", sep = \"\\t\",index = 0)"
]
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -214,7 +256,7 @@
""
]
],
"version": "0.22.6"
"version": "0.22.9"
}
},
"nbformat": 4,
Expand Down
11 changes: 9 additions & 2 deletions code/data_preprocessing/reference_data.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
"\n",
"This module provides reference data download, indexing and preprocessing (if necessary), in preparation for use throughout the pipeline.\n",
"\n",
"We have included the PDF document compiled by data standardization subgroup in the [minimal working example folder on Google Drive](https://drive.google.com/file/d/1R5sw5o8vqk_mbQQb4CGmtH3ldu1T3Vu0/view?usp=sharing). It contains the reference data to use for the project."
"We have included the PDF document compiled by data standardization subgroup in the [on Google Drive](https://drive.google.com/file/d/1R5sw5o8vqk_mbQQb4CGmtH3ldu1T3Vu0/view?usp=sharing) as well as on [ADSP Dashboard](https://www.niagads.org/adsp/content/adspgcadgenomeresources-v2pdf). It contains the reference data to use for the project."
]
},
{
Expand All @@ -24,7 +24,14 @@
"source": [
"## Overview\n",
"\n",
"This module is based on the [TOPMed workflow from Broad](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md).\n",
"This module is based on the [TOPMed workflow from Broad](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md). The reference data after we process it (details see Methods section and the rest of the analysis) can be found [in this folder on Google Drive](https://drive.google.com/drive/folders/19fmoII8yS7XE7HFcMU4OfvC2bL1zMD_P). Specifically, the list of reference files to be used are:\n",
"\n",
"1. `GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.{dict,fasta,fasta.fai}`\n",
"2. `Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf` for stranded protocol, and `Homo_sapiens.GRCh38.103.chr.reformatted.gene.ERCC.gtf` for unstranded protocol.\n",
"3. Everything under `STAR_Index` folder\n",
"4. Everything under `RSEM_Index` folder\n",
"\n",
"## Methods\n",
"\n",
"Workflows implemented include:\n",
"\n",
Expand Down

0 comments on commit f4195c2

Please sign in to comment.