cumc · gouwh29 · Dec 9, 2022
diff --git a/code/molecular_phenotypes/calling/RNA_calling.ipynb b/code/molecular_phenotypes/calling/RNA_calling.ipynb
@@ -100,6 +100,7 @@
  ]
  },
  {
+ "attachments": {},
  "cell_type": "markdown",
  "id": "rough-robinson",
  "metadata": {
@@ -116,6 +117,7 @@
  "sample_2 samp2_r1.fq.gz samp2_r2.fq.gz fr 150\n",
  "sample_3 samp3_r1.fq.gz samp3_r2.fq.gz strand_missing 75\n",
  "```\n",
+ "**Warning**: For single-ended (SE) data, your input data should exclude the `fq2` column. For other 2 columns, the `strand` column is optional (the `strand` information can be detected in `STAR_output` pipeline) and the `read_length` colunm (can be validated/infered from `fastqc` report) is **required**. If no prior information is avaiable for `read_length`, leave the column empty\n",
  "\n",
  "All the fastq files should be available under specified folder (default assumes the same folder as where the meta-data file is)."
  ]
@@ -195,11 +197,12 @@
  " --data-dir data \\\n",
  " --STAR-index reference_data/STAR_Index/ \\\n",
  " --container containers/rna_quantification.sif \\\n",
- " --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta.fasta \\\n",
+ " --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \\\n",
  " --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gtf.ref.flat "
  ]
  },
  {
+ "attachments": {},
  "cell_type": "markdown",
  "id": "f04410f9-6721-4bd5-8aea-ef9150103347",
  "metadata": {
@@ -208,7 +211,9 @@
  "source": [
  "To align the reads with STAR and generate the bam_list recipe for downstream molecular phenotype count matrixes. The `-J 3 -c csg.yml -q csg` part is crucial for it ask for the required memory to conduct the STAR alignment.\n",
  "\n",
- "***Warning***: It is of crucial importance that, the STAR Alignment shall be ran with the gtf file that were generated `before collapsing to gene`, i.e. the one that are used to generated RSEM index. Other wise there will be an [error while running RSEM later on](https://github.com/cumc/xqtl-pipeline/issues/390)"
+ "***Warning1***: It is of crucial importance that, the STAR Alignment shall be ran with the gtf file that were generated `before collapsing to gene`, i.e. the one that are used to generated RSEM index. Other wise there will be an [error while running RSEM later on](https://github.com/cumc/xqtl-pipeline/issues/390)\n",
+ "\n",
+ "***Warning2***: If the `read_length` colunm is empty in your original meta-data file `sample_fastq.list`, please add following arg: `--sjdbOverhang 100` (default value of the `STAR` alignment) or `--sjdbOverhang {read_length}` (if you can deduce the correct read length from the QC report) in `STAR_output` pipeline. The default value 100 will be sufficient to give correct results. "
  ]
  },
  {
@@ -259,7 +264,7 @@
  " --STAR-index reference_data/STAR_Index/ \\\n",
  " --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gene.gtf \\\n",
  " --container containers/rna_quantification.sif \\\n",
- " --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta.fasta \\\n",
+ " --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \\\n",
  " --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gtf.ref.flat \\\n",
  " --bam_list output/rnaseq/sample_fastq_bam_list -J 3 -c csg.yml -q csg"
  ]
@@ -291,21 +296,11 @@
  " --RSEM-index reference_data/RSEM_Index/ \\\n",
  " --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gtf \\\n",
  " --container containers/rna_quantification.sif \\\n",
- " --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta.fasta \\\n",
+ " --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \\\n",
  " --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gtf.ref.flat \\\n",
  " --bam_list output/rnaseq/sample_fastq_bam_list -J 3 -c csg.yml -q csg"
  ]
  },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "b5689265-c9d3-40a6-adf9-21349a342e68",
- "metadata": {
- "kernel": "Bash"
- },
- "outputs": [],
- "source": []
- },
  {
  "cell_type": "markdown",
  "id": "temporal-worse",
@@ -578,11 +573,10 @@
  "source": [
  "[fastqc]\n",
  "input: fastq, group_by = 1\n",
- "output: f'{cwd}/{_input:bn}_fastqc.html',f'{cwd}/{_input:bn}_fastqc/fastqc_data.txt' \n",
+ "output: f'{cwd}/{_input:bn}_fastqc.html',f'{cwd}/{_input:bn}_fastqc.zip' \n",
  "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n",
  "bash: expand= \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container\n",
- " fastqc ${_input} -o ${_output[0]:d}\n",
- " unzip -o ${_output[0]:n}.zip -d ${cwd}"
+ " fastqc ${_input} -o ${_output[0]:d}"
  ]
  },
  {
@@ -676,6 +670,7 @@
  ]
  },
  {
+ "attachments": {},
  "cell_type": "markdown",
  "id": "registered-shooting",
  "metadata": {
@@ -688,7 +683,7 @@
  "\n",
  "Documentation: [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic)\n",
  "\n",
- "We have replaced this with `fastp`, see above, which performs better than `Trimmomatic` in terms of removing adaptors that `Trimmomatic` cannot detect. `fastp` can also automatically guess the adapter sequences from data and by default no adapter sequence is required for input to `fastp`.\n",
+ "We have replaced this with `fastp`, see above, which performs better than `Trimmomatic` in terms of removing adaptors that `Trimmomatic` cannot detect. `fastp` can also automatically guess the adapter sequences from data and by default no adapter sequence is required for input to `fastp`. **PE data and SE data have different pipline** (see below)\n",
  "\n",
  "### Step Inputs\n",
  "\n",
@@ -707,7 +702,9 @@
  ">For single-ended data, one input and one output file are specified, plus the processing steps. For paired-end data, two input files are specified, and 4 output files, 2 for the 'paired' output where both reads survived the processing, and 2 for corresponding 'unpaired' output where a read survived, but the partner read did not.\n",
  "\n",
  "\n",
- "**You need to figure out from fastqc results what adapter reference sequence to use.**, eg `--fasta_with_adapters_etc TruSeq3-PE.fa`. These files can be downloaded from `Trimmomatic` github repo."
+ "**You need to figure out from fastqc results what adapter reference sequence to use.**, eg `--fasta_with_adapters_etc TruSeq3-PE.fa`. These files can be downloaded from `Trimmomatic` github repo.\n",
+ "\n",
+ "* For pair-ended (PE) data:"
  ]
  },
  {
@@ -720,7 +717,7 @@
  },
  "outputs": [],
  "source": [
- "[trimmomatic_trim_adaptor]\n",
+ "[trimmomatic_trim_adaptor_PE]\n",
  "# Path to the software. Default set to using our rna_quantification.sif image\n",
  "parameter: trimmomatic_jar = \"/opt/Trimmomatic-0.39/trimmomatic-0.39.jar\"\n",
  "# Illumina clip setting\n",
@@ -754,6 +751,52 @@
  " LEADING:${leading} TRAILING:${trailing} SLIDINGWINDOW:${window_size}:${required_quality} MINLEN:${min_len}"
  ]
  },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "21b1c985",
+ "metadata": {},
+ "source": [
+ "* For single-ended (SE) data:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "turkish-alfred",
+ "metadata": {
+ "kernel": "SoS",
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "[trimmomatic_trim_adaptor_SE]\n",
+ "# Path to the software. Default set to using our rna_quantification.sif image\n",
+ "parameter: trimmomatic_jar = \"/opt/Trimmomatic-0.39/trimmomatic-0.39.jar\"\n",
+ "# Illumina clip setting\n",
+ "# Path to the reference adaptors\n",
+ "parameter: fasta_with_adapters_etc = path(\".\")\n",
+ "parameter: seed_mismatches = 2\n",
+ "parameter: palindrome_clip_threshold = 30\n",
+ "parameter: simple_clip_threshold = 10\n",
+ "# sliding window setting\n",
+ "parameter: window_size = 4\n",
+ "parameter: required_quality = 20\n",
+ "# Other settings\n",
+ "parameter: leading = 3\n",
+ "parameter: trailing = 3\n",
+ "parameter: min_len = 50\n",
+ "input: fastq, group_by = is_paired_end + 1, group_with = \"sample_id\"\n",
+ "output: f'{cwd}/{_sample_id}_trimmed_{_input:bn}.gz'\n",
+ "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n",
+ "bash: container=container, expand= \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'\n",
+ " java -jar -Xmx${java_mem} ${trimmomatic_jar} SE -threads ${numThreads} \\\n",
+ " ${_input} \\\n",
+ " ${_output} \\\n",
+ " ILLUMINACLIP:${fasta_with_adapters_etc}:${seed_mismatches}:${palindrome_clip_threshold}:${simple_clip_threshold} \\\n",
+ " LEADING:${leading} TRAILING:${trailing} SLIDINGWINDOW:${window_size}:${required_quality} MINLEN:${min_len}"
+ ]
+ },
  {
  "cell_type": "markdown",
  "id": "unnecessary-necessity",
@@ -876,7 +919,6 @@
  " --sjdbGTFfile ${gtf} \\\n",
  "\n",
  " rm -r ${_output[0]:nnnn}._STARgenome\n",
- " rm -r ${_output[0]:nnnn}._STARtmp\n",
  " touch ${_output[0]:n}.sort_start.timestamp\n",
  " samtools sort --threads ${numThreads} -o ${_output[0]:nnnn}.Aligned.sortedByCoord.out.bam ${_output[0]:nnnn}.Aligned.out.bam # According to GTEX, this can help reducing the amount of memory consumption.\n",
  " rm ${_output[0]:nnnn}.Aligned.out.bam \n",
@@ -1504,7 +1546,7 @@
  " \n",
  " metrics <- list()\n",
  " for (i in 1:length(files)) {\n",
- " m <- read.table(files[i], header=TRUE, sep=\"\\t\", comment.char=\"#\", stringsAsFactors=FALSE, nrows=2)\n",
+ " m <- read.table(files[i], header=TRUE, sep=\"\\t\", comment.char=\"#\", stringsAsFactors=FALSE, nrows=1)\n",
  " metrics[[i]] <- data.frame(Sample=samples[i],\n",
  " File=files[i],\n",
  " PF_READS=sum(m$PF_READS[1:2]),\n",