diff --git a/code/molecular_phenotypes/calling/RNA_calling.ipynb b/code/molecular_phenotypes/calling/RNA_calling.ipynb index bd0b17fad..02d222b95 100644 --- a/code/molecular_phenotypes/calling/RNA_calling.ipynb +++ b/code/molecular_phenotypes/calling/RNA_calling.ipynb @@ -100,6 +100,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "rough-robinson", "metadata": { @@ -116,6 +117,7 @@ "sample_2 samp2_r1.fq.gz samp2_r2.fq.gz fr 150\n", "sample_3 samp3_r1.fq.gz samp3_r2.fq.gz strand_missing 75\n", "```\n", + "**Warning**: For single-ended (SE) data, your input data should exclude the `fq2` column. For other 2 columns, the `strand` column is optional (the `strand` information can be detected in `STAR_output` pipeline) and the `read_length` colunm (can be validated/infered from `fastqc` report) is **required**. If no prior information is avaiable for `read_length`, leave the column empty\n", "\n", "All the fastq files should be available under specified folder (default assumes the same folder as where the meta-data file is)." ] @@ -195,11 +197,12 @@ " --data-dir data \\\n", " --STAR-index reference_data/STAR_Index/ \\\n", " --container containers/rna_quantification.sif \\\n", - " --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta.fasta \\\n", + " --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \\\n", " --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gtf.ref.flat " ] }, { + "attachments": {}, "cell_type": "markdown", "id": "f04410f9-6721-4bd5-8aea-ef9150103347", "metadata": { @@ -208,7 +211,9 @@ "source": [ "To align the reads with STAR and generate the bam_list recipe for downstream molecular phenotype count matrixes. The `-J 3 -c csg.yml -q csg` part is crucial for it ask for the required memory to conduct the STAR alignment.\n", "\n", - "***Warning***: It is of crucial importance that, the STAR Alignment shall be ran with the gtf file that were generated `before collapsing to gene`, i.e. the one that are used to generated RSEM index. Other wise there will be an [error while running RSEM later on](https://github.com/cumc/xqtl-pipeline/issues/390)" + "***Warning1***: It is of crucial importance that, the STAR Alignment shall be ran with the gtf file that were generated `before collapsing to gene`, i.e. the one that are used to generated RSEM index. Other wise there will be an [error while running RSEM later on](https://github.com/cumc/xqtl-pipeline/issues/390)\n", + "\n", + "***Warning2***: If the `read_length` colunm is empty in your original meta-data file `sample_fastq.list`, please add following arg: `--sjdbOverhang 100` (default value of the `STAR` alignment) or `--sjdbOverhang {read_length}` (if you can deduce the correct read length from the QC report) in `STAR_output` pipeline. The default value 100 will be sufficient to give correct results. " ] }, { @@ -259,7 +264,7 @@ " --STAR-index reference_data/STAR_Index/ \\\n", " --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gene.gtf \\\n", " --container containers/rna_quantification.sif \\\n", - " --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta.fasta \\\n", + " --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \\\n", " --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gtf.ref.flat \\\n", " --bam_list output/rnaseq/sample_fastq_bam_list -J 3 -c csg.yml -q csg" ] @@ -291,21 +296,11 @@ " --RSEM-index reference_data/RSEM_Index/ \\\n", " --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gtf \\\n", " --container containers/rna_quantification.sif \\\n", - " --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta.fasta \\\n", + " --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \\\n", " --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gtf.ref.flat \\\n", " --bam_list output/rnaseq/sample_fastq_bam_list -J 3 -c csg.yml -q csg" ] }, - { - "cell_type": "code", - "execution_count": null, - "id": "b5689265-c9d3-40a6-adf9-21349a342e68", - "metadata": { - "kernel": "Bash" - }, - "outputs": [], - "source": [] - }, { "cell_type": "markdown", "id": "temporal-worse", @@ -578,11 +573,10 @@ "source": [ "[fastqc]\n", "input: fastq, group_by = 1\n", - "output: f'{cwd}/{_input:bn}_fastqc.html',f'{cwd}/{_input:bn}_fastqc/fastqc_data.txt' \n", + "output: f'{cwd}/{_input:bn}_fastqc.html',f'{cwd}/{_input:bn}_fastqc.zip' \n", "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n", "bash: expand= \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container\n", - " fastqc ${_input} -o ${_output[0]:d}\n", - " unzip -o ${_output[0]:n}.zip -d ${cwd}" + " fastqc ${_input} -o ${_output[0]:d}" ] }, { @@ -676,6 +670,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "registered-shooting", "metadata": { @@ -688,7 +683,7 @@ "\n", "Documentation: [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic)\n", "\n", - "We have replaced this with `fastp`, see above, which performs better than `Trimmomatic` in terms of removing adaptors that `Trimmomatic` cannot detect. `fastp` can also automatically guess the adapter sequences from data and by default no adapter sequence is required for input to `fastp`.\n", + "We have replaced this with `fastp`, see above, which performs better than `Trimmomatic` in terms of removing adaptors that `Trimmomatic` cannot detect. `fastp` can also automatically guess the adapter sequences from data and by default no adapter sequence is required for input to `fastp`. **PE data and SE data have different pipline** (see below)\n", "\n", "### Step Inputs\n", "\n", @@ -707,7 +702,9 @@ ">For single-ended data, one input and one output file are specified, plus the processing steps. For paired-end data, two input files are specified, and 4 output files, 2 for the 'paired' output where both reads survived the processing, and 2 for corresponding 'unpaired' output where a read survived, but the partner read did not.\n", "\n", "\n", - "**You need to figure out from fastqc results what adapter reference sequence to use.**, eg `--fasta_with_adapters_etc TruSeq3-PE.fa`. These files can be downloaded from `Trimmomatic` github repo." + "**You need to figure out from fastqc results what adapter reference sequence to use.**, eg `--fasta_with_adapters_etc TruSeq3-PE.fa`. These files can be downloaded from `Trimmomatic` github repo.\n", + "\n", + "* For pair-ended (PE) data:" ] }, { @@ -720,7 +717,7 @@ }, "outputs": [], "source": [ - "[trimmomatic_trim_adaptor]\n", + "[trimmomatic_trim_adaptor_PE]\n", "# Path to the software. Default set to using our rna_quantification.sif image\n", "parameter: trimmomatic_jar = \"/opt/Trimmomatic-0.39/trimmomatic-0.39.jar\"\n", "# Illumina clip setting\n", @@ -754,6 +751,52 @@ " LEADING:${leading} TRAILING:${trailing} SLIDINGWINDOW:${window_size}:${required_quality} MINLEN:${min_len}" ] }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "21b1c985", + "metadata": {}, + "source": [ + "* For single-ended (SE) data:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "turkish-alfred", + "metadata": { + "kernel": "SoS", + "tags": [] + }, + "outputs": [], + "source": [ + "[trimmomatic_trim_adaptor_SE]\n", + "# Path to the software. Default set to using our rna_quantification.sif image\n", + "parameter: trimmomatic_jar = \"/opt/Trimmomatic-0.39/trimmomatic-0.39.jar\"\n", + "# Illumina clip setting\n", + "# Path to the reference adaptors\n", + "parameter: fasta_with_adapters_etc = path(\".\")\n", + "parameter: seed_mismatches = 2\n", + "parameter: palindrome_clip_threshold = 30\n", + "parameter: simple_clip_threshold = 10\n", + "# sliding window setting\n", + "parameter: window_size = 4\n", + "parameter: required_quality = 20\n", + "# Other settings\n", + "parameter: leading = 3\n", + "parameter: trailing = 3\n", + "parameter: min_len = 50\n", + "input: fastq, group_by = is_paired_end + 1, group_with = \"sample_id\"\n", + "output: f'{cwd}/{_sample_id}_trimmed_{_input:bn}.gz'\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n", + "bash: container=container, expand= \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'\n", + " java -jar -Xmx${java_mem} ${trimmomatic_jar} SE -threads ${numThreads} \\\n", + " ${_input} \\\n", + " ${_output} \\\n", + " ILLUMINACLIP:${fasta_with_adapters_etc}:${seed_mismatches}:${palindrome_clip_threshold}:${simple_clip_threshold} \\\n", + " LEADING:${leading} TRAILING:${trailing} SLIDINGWINDOW:${window_size}:${required_quality} MINLEN:${min_len}" + ] + }, { "cell_type": "markdown", "id": "unnecessary-necessity", @@ -876,7 +919,6 @@ " --sjdbGTFfile ${gtf} \\\n", "\n", " rm -r ${_output[0]:nnnn}._STARgenome\n", - " rm -r ${_output[0]:nnnn}._STARtmp\n", " touch ${_output[0]:n}.sort_start.timestamp\n", " samtools sort --threads ${numThreads} -o ${_output[0]:nnnn}.Aligned.sortedByCoord.out.bam ${_output[0]:nnnn}.Aligned.out.bam # According to GTEX, this can help reducing the amount of memory consumption.\n", " rm ${_output[0]:nnnn}.Aligned.out.bam \n", @@ -1504,7 +1546,7 @@ " \n", " metrics <- list()\n", " for (i in 1:length(files)) {\n", - " m <- read.table(files[i], header=TRUE, sep=\"\\t\", comment.char=\"#\", stringsAsFactors=FALSE, nrows=2)\n", + " m <- read.table(files[i], header=TRUE, sep=\"\\t\", comment.char=\"#\", stringsAsFactors=FALSE, nrows=1)\n", " metrics[[i]] <- data.frame(Sample=samples[i],\n", " File=files[i],\n", " PF_READS=sum(m$PF_READS[1:2]),\n",