diff --git a/code/data_preprocessing/phenotype/gene_annotation.ipynb b/code/data_preprocessing/phenotype/gene_annotation.ipynb
index e4cbae2ff..da7127f4a 100644
--- a/code/data_preprocessing/phenotype/gene_annotation.ipynb
+++ b/code/data_preprocessing/phenotype/gene_annotation.ipynb
@@ -11,7 +11,7 @@
     "# Gene coordinate annotation\n",
     "\n",
     "\n",
-    "This workflow adds genomic coordinate annotation to gene-level molecular phenotype files generated, eg in GCT format, converting them to `bed` format. "
+    "This workflow adds genomic coordinate annotation to gene-level molecular phenotype files generated in `gct` format and convert them to `bed` format for downstreams analysis."
    ]
   },
   {
@@ -25,6 +25,8 @@
     "\n",
     "This pipeline is based on [`pyqtl`, as demonstrated here](https://github.com/broadinstitute/gtex-pipeline/blob/master/qtl/src/eqtl_prepare_expression.py).\n",
     "\n",
+    "**FIXME: please explain here what we do with gene symbol vs gene ID**\n",
+    "\n",
     "### Alternative implementation\n",
     "\n",
     "Previously we use `biomaRt` package in R instead of code from `pyqtl`. The core function calls are:\n",
@@ -47,9 +49,9 @@
    "source": [
     "## Input\n",
     "\n",
-    "1. Molecular phenotype data with the first column being ENSEMBL ID and other columns being sample names. \n",
+    "1. Molecular phenotype data in `gct` format, with the first column being ENSEMBL ID and other columns being sample names. \n",
     "2. GTF for collapsed gene model\n",
-    "    - the gene names must be consistent with the GCT matrices (eg ENSG00000000003 vs. ENSG00000000003.1 will not work) \n",
+    "    - the gene names must be consistent with the molecular phenotype data matrices (eg ENSG00000000003 vs. ENSG00000000003.1 will not work) \n",
     "3. (Optional) Meta-data to match between sample names in expression data and genotype files\n",
     "    - Required input\n",
     "    - Tab delimited with header\n",
@@ -172,19 +174,6 @@
     "sos run gene_annotation.ipynb -h"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "arabic-enemy",
-   "metadata": {
-    "kernel": "Bash"
-   },
-   "outputs": [],
-   "source": [
-    "cd /mnt/mfs/statgen/xqtl_workflow_testing/normalization_r2\n",
-    "sos run gene_annotation.ipynb -h"
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": 2,
@@ -214,52 +203,6 @@
     "parameter: container = \"\""
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "2f75689b-7ddf-4f3a-9e4a-e47ebdd7536a",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "source": [
-    "## Region List generation\n",
-    "To partitioning the data by genes require a region list file which:\n",
-    "    1. have 5 columns: chr,start,end,gene_id,gene_name\n",
-    "    2. have the same gene as or less gene than that of the bed file\n",
-    "Input:\n",
-    "    1. A gtf file used to generated the bed\n",
-    "    2. A phenotype bed file, must have a gene_id column indicating the name of genes.\n",
-    "    "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "b06cd895-342b-4206-8692-795cc39f4c20",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": [
-    "[region_list_generation]\n",
-    "input: phenoFile, annotation_gtf\n",
-    "output: f'{cwd:a}/{_input[0]:bnn}.region_list'\n",
-    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  \n",
-    "python: expand= \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container\n",
-    "    import pandas as pd\n",
-    "    import qtl.io\n",
-    "    # get the five column data\n",
-    "    bed_template_df_id = qtl.io.gtf_to_tss_bed(${_input[1]:ar}, feature='transcript',phenotype_id = \"gene_id\" )\n",
-    "    bed_template_df_name = qtl.io.gtf_to_tss_bed(${_input[1]:ar}, feature='transcript',phenotype_id = \"gene_name\" )\n",
-    "    bed_template_df = bed_template_df_id.merge(bed_template_df_name, on = [\"chr\",\"start\",\"end\"])\n",
-    "    # retaining only somatic chromosome\n",
-    "    bed_template_df = bed_template_df[bed_template_df.chr.isin([\"chr\" + str(x) for x in (range(1,23))])]\n",
-    "    bed_template_df.columns = [\"#chr\",\"start\",\"end\",\"gene_id\",\"gene_name\"]\n",
-    "    pheno = pd.read_csv(${_input[0]:r}, sep = \"\\t\")\n",
-    "    # Retaining only the genes in the data\n",
-    "    region_list = bed_template_df[bed_template_df.${phenotype_id_type}.isin(pheno.gene_id)]\n",
-    "    region_list.to_csv(\"${_output}\", sep = \"\\t\",index = 0)"
-   ]
-  },
   {
    "cell_type": "markdown",
    "id": "cheap-credits",
@@ -272,16 +215,6 @@
     "Implementation based on [GTEx pipeline](https://github.com/broadinstitute/gtex-pipeline/blob/master/qtl/src/eqtl_prepare_expression.py)."
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "b9fda214-3ef6-41c8-9ae2-fcb10a8bae63",
-   "metadata": {
-    "kernel": "SoS"
-   },
-   "outputs": [],
-   "source": []
-  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -421,7 +354,7 @@
      "sos"
     ]
    ],
-   "version": "0.22.6"
+   "version": "0.22.9"
   }
  },
  "nbformat": 4,
diff --git a/code/data_preprocessing/phenotype/phenotype_formatting.ipynb b/code/data_preprocessing/phenotype/phenotype_formatting.ipynb
index 5a8583ed2..a579f1e9b 100644
--- a/code/data_preprocessing/phenotype/phenotype_formatting.ipynb
+++ b/code/data_preprocessing/phenotype/phenotype_formatting.ipynb
@@ -11,16 +11,17 @@
     "# Phenotype data formatting\n",
     "\n",
     "\n",
-    "This is the region extraction step for data processing pipeline for xqtl workflow, containing the generation of:\n",
-    "1. Molecular_phenotype per chrom within selected regions in the format APEX and tensorQTL takes\n",
+    "This module implements a collection of workflows used to format molecular phenotype data.\n",
     "\n",
-    "### Input\n",
+    "\n",
+    "\n",
+    "## Input\n",
     "The input for this workflow is the collection of data for 1 conditions as described in the readme of this git repo\n",
     "1. 1 complete residual molecular_phenotype data\n",
     "2. 1 region_list\n",
     "Both of these input can be generated by the annotation module of this pipeline\n",
     "\n",
-    "### Output\n",
+    "## Output\n",
     "For each collection, the output is \n",
     "1. 1 lists of phenotype file (bed+index) for each chrom, suitable to be fed into both apex and tensorQTL, annotated with chrom and pos\n",
     "2. 1 lists of phenotype file (bed+index) for each gene, annotated with chrom and tss"
@@ -81,15 +82,56 @@
     "parameter: gene_name_as_phenotype_id = False"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "1df0a8a6-a874-4c02-b007-142d96b860fe",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Region List generation\n",
+    "\n",
+    "To partitioning the data by genes require a region list file which:\n",
+    "\n",
+    "    1. have 5 columns: chr,start,end,gene_id,gene_name\n",
+    "    2. have the same gene as or less gene than that of the bed file\n",
+    "    \n",
+    "Input:\n",
+    "\n",
+    "    1. A gtf file used to generated the bed\n",
+    "    2. A phenotype bed file, must have a gene_id column indicating the name of genes.    "
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "coated-maintenance",
+   "id": "5913f0bd-a40e-45ff-a5d1-77b99b5752dd",
    "metadata": {
     "kernel": "SoS"
    },
    "outputs": [],
-   "source": []
+   "source": [
+    "[generate_region_list]\n",
+    "#  gene gtf annotation table\n",
+    "parameter: annotation_gtf = path\n",
+    "input: phenoFile, annotation_gtf\n",
+    "output: f'{cwd:a}/{_input[0]:bnn}.region_list'\n",
+    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  \n",
+    "python: expand= \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container\n",
+    "    import pandas as pd\n",
+    "    import qtl.io\n",
+    "    # get the five column data\n",
+    "    bed_template_df_id = qtl.io.gtf_to_tss_bed(${_input[1]:ar}, feature='transcript',phenotype_id = \"gene_id\" )\n",
+    "    bed_template_df_name = qtl.io.gtf_to_tss_bed(${_input[1]:ar}, feature='transcript',phenotype_id = \"gene_name\" )\n",
+    "    bed_template_df = bed_template_df_id.merge(bed_template_df_name, on = [\"chr\",\"start\",\"end\"])\n",
+    "    # retaining only somatic chromosome\n",
+    "    bed_template_df = bed_template_df[bed_template_df.chr.isin([\"chr\" + str(x) for x in (range(1,23))])]\n",
+    "    bed_template_df.columns = [\"#chr\",\"start\",\"end\",\"gene_id\",\"gene_name\"]\n",
+    "    pheno = pd.read_csv(${_input[0]:r}, sep = \"\\t\")\n",
+    "    # Retaining only the genes in the data\n",
+    "    region_list = bed_template_df[bed_template_df.${phenotype_id_type}.isin(pheno.gene_id)]\n",
+    "    region_list.to_csv(\"${_output}\", sep = \"\\t\",index = 0)"
+   ]
   },
   {
    "cell_type": "markdown",
@@ -214,7 +256,7 @@
      ""
     ]
    ],
-   "version": "0.22.6"
+   "version": "0.22.9"
   }
  },
  "nbformat": 4,
diff --git a/code/data_preprocessing/reference_data.ipynb b/code/data_preprocessing/reference_data.ipynb
index cf8f1084c..48c29c57e 100644
--- a/code/data_preprocessing/reference_data.ipynb
+++ b/code/data_preprocessing/reference_data.ipynb
@@ -12,7 +12,7 @@
     "\n",
     "This module provides reference data download, indexing and preprocessing (if necessary), in preparation for use throughout the pipeline.\n",
     "\n",
-    "We have included the PDF document compiled by data standardization subgroup in the [minimal working example folder on Google Drive](https://drive.google.com/file/d/1R5sw5o8vqk_mbQQb4CGmtH3ldu1T3Vu0/view?usp=sharing). It contains the reference data to use for the project."
+    "We have included the PDF document compiled by data standardization subgroup in the [on Google Drive](https://drive.google.com/file/d/1R5sw5o8vqk_mbQQb4CGmtH3ldu1T3Vu0/view?usp=sharing) as well as on [ADSP Dashboard](https://www.niagads.org/adsp/content/adspgcadgenomeresources-v2pdf). It contains the reference data to use for the project."
    ]
   },
   {
@@ -24,7 +24,14 @@
    "source": [
     "## Overview\n",
     "\n",
-    "This module is based on the [TOPMed workflow from Broad](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md).\n",
+    "This module is based on the [TOPMed workflow from Broad](https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md). The reference data after we process it (details see Methods section and the rest of the analysis) can be found [in this folder on Google Drive](https://drive.google.com/drive/folders/19fmoII8yS7XE7HFcMU4OfvC2bL1zMD_P). Specifically, the list of reference files to be used are:\n",
+    "\n",
+    "1. `GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.{dict,fasta,fasta.fai}`\n",
+    "2. `Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf` for stranded protocol, and `Homo_sapiens.GRCh38.103.chr.reformatted.gene.ERCC.gtf` for unstranded protocol.\n",
+    "3. Everything under `STAR_Index` folder\n",
+    "4. Everything under `RSEM_Index` folder\n",
+    "\n",
+    "## Methods\n",
     "\n",
     "Workflows implemented include:\n",
     "\n",