Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update RNA_calling.ipynb #478

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 64 additions & 22 deletions code/molecular_phenotypes/calling/RNA_calling.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "rough-robinson",
"metadata": {
Expand All @@ -116,6 +117,7 @@
"sample_2 samp2_r1.fq.gz samp2_r2.fq.gz fr 150\n",
"sample_3 samp3_r1.fq.gz samp3_r2.fq.gz strand_missing 75\n",
"```\n",
"**Warning**: For single-ended (SE) data, your input data should exclude the `fq2` column. For other 2 columns, the `strand` column is optional (the `strand` information can be detected in `STAR_output` pipeline) and the `read_length` colunm (can be validated/infered from `fastqc` report) is **required**. If no prior information is avaiable for `read_length`, leave the column empty\n",
"\n",
"All the fastq files should be available under specified folder (default assumes the same folder as where the meta-data file is)."
]
Expand Down Expand Up @@ -195,11 +197,12 @@
" --data-dir data \\\n",
" --STAR-index reference_data/STAR_Index/ \\\n",
" --container containers/rna_quantification.sif \\\n",
" --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta.fasta \\\n",
" --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \\\n",
" --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gtf.ref.flat "
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "f04410f9-6721-4bd5-8aea-ef9150103347",
"metadata": {
Expand All @@ -208,7 +211,9 @@
"source": [
"To align the reads with STAR and generate the bam_list recipe for downstream molecular phenotype count matrixes. The `-J 3 -c csg.yml -q csg` part is crucial for it ask for the required memory to conduct the STAR alignment.\n",
"\n",
"***Warning***: It is of crucial importance that, the STAR Alignment shall be ran with the gtf file that were generated `before collapsing to gene`, i.e. the one that are used to generated RSEM index. Other wise there will be an [error while running RSEM later on](https://github.com/cumc/xqtl-pipeline/issues/390)"
"***Warning1***: It is of crucial importance that, the STAR Alignment shall be ran with the gtf file that were generated `before collapsing to gene`, i.e. the one that are used to generated RSEM index. Other wise there will be an [error while running RSEM later on](https://github.com/cumc/xqtl-pipeline/issues/390)\n",
"\n",
"***Warning2***: If the `read_length` colunm is empty in your original meta-data file `sample_fastq.list`, please add following arg: `--sjdbOverhang 100` (default value of the `STAR` alignment) or `--sjdbOverhang {read_length}` (if you can deduce the correct read length from the QC report) in `STAR_output` pipeline. The default value 100 will be sufficient to give correct results. "
]
},
{
Expand Down Expand Up @@ -259,7 +264,7 @@
" --STAR-index reference_data/STAR_Index/ \\\n",
" --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gene.gtf \\\n",
" --container containers/rna_quantification.sif \\\n",
" --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta.fasta \\\n",
" --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \\\n",
" --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gtf.ref.flat \\\n",
" --bam_list output/rnaseq/sample_fastq_bam_list -J 3 -c csg.yml -q csg"
]
Expand Down Expand Up @@ -291,21 +296,11 @@
" --RSEM-index reference_data/RSEM_Index/ \\\n",
" --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gtf \\\n",
" --container containers/rna_quantification.sif \\\n",
" --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta.fasta \\\n",
" --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \\\n",
" --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gtf.ref.flat \\\n",
" --bam_list output/rnaseq/sample_fastq_bam_list -J 3 -c csg.yml -q csg"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b5689265-c9d3-40a6-adf9-21349a342e68",
"metadata": {
"kernel": "Bash"
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "temporal-worse",
Expand Down Expand Up @@ -578,11 +573,10 @@
"source": [
"[fastqc]\n",
"input: fastq, group_by = 1\n",
"output: f'{cwd}/{_input:bn}_fastqc.html',f'{cwd}/{_input:bn}_fastqc/fastqc_data.txt' \n",
"output: f'{cwd}/{_input:bn}_fastqc.html',f'{cwd}/{_input:bn}_fastqc.zip' \n",
"task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n",
"bash: expand= \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container\n",
" fastqc ${_input} -o ${_output[0]:d}\n",
" unzip -o ${_output[0]:n}.zip -d ${cwd}"
" fastqc ${_input} -o ${_output[0]:d}"
]
},
{
Expand Down Expand Up @@ -676,6 +670,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "registered-shooting",
"metadata": {
Expand All @@ -688,7 +683,7 @@
"\n",
"Documentation: [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic)\n",
"\n",
"We have replaced this with `fastp`, see above, which performs better than `Trimmomatic` in terms of removing adaptors that `Trimmomatic` cannot detect. `fastp` can also automatically guess the adapter sequences from data and by default no adapter sequence is required for input to `fastp`.\n",
"We have replaced this with `fastp`, see above, which performs better than `Trimmomatic` in terms of removing adaptors that `Trimmomatic` cannot detect. `fastp` can also automatically guess the adapter sequences from data and by default no adapter sequence is required for input to `fastp`. **PE data and SE data have different pipline** (see below)\n",
"\n",
"### Step Inputs\n",
"\n",
Expand All @@ -707,7 +702,9 @@
">For single-ended data, one input and one output file are specified, plus the processing steps. For paired-end data, two input files are specified, and 4 output files, 2 for the 'paired' output where both reads survived the processing, and 2 for corresponding 'unpaired' output where a read survived, but the partner read did not.\n",
"\n",
"\n",
"**You need to figure out from fastqc results what adapter reference sequence to use.**, eg `--fasta_with_adapters_etc TruSeq3-PE.fa`. These files can be downloaded from `Trimmomatic` github repo."
"**You need to figure out from fastqc results what adapter reference sequence to use.**, eg `--fasta_with_adapters_etc TruSeq3-PE.fa`. These files can be downloaded from `Trimmomatic` github repo.\n",
"\n",
"* For pair-ended (PE) data:"
]
},
{
Expand All @@ -720,7 +717,7 @@
},
"outputs": [],
"source": [
"[trimmomatic_trim_adaptor]\n",
"[trimmomatic_trim_adaptor_PE]\n",
"# Path to the software. Default set to using our rna_quantification.sif image\n",
"parameter: trimmomatic_jar = \"/opt/Trimmomatic-0.39/trimmomatic-0.39.jar\"\n",
"# Illumina clip setting\n",
Expand Down Expand Up @@ -754,6 +751,52 @@
" LEADING:${leading} TRAILING:${trailing} SLIDINGWINDOW:${window_size}:${required_quality} MINLEN:${min_len}"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "21b1c985",
"metadata": {},
"source": [
"* For single-ended (SE) data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "turkish-alfred",
"metadata": {
"kernel": "SoS",
"tags": []
},
"outputs": [],
"source": [
"[trimmomatic_trim_adaptor_SE]\n",
"# Path to the software. Default set to using our rna_quantification.sif image\n",
"parameter: trimmomatic_jar = \"/opt/Trimmomatic-0.39/trimmomatic-0.39.jar\"\n",
"# Illumina clip setting\n",
"# Path to the reference adaptors\n",
"parameter: fasta_with_adapters_etc = path(\".\")\n",
"parameter: seed_mismatches = 2\n",
"parameter: palindrome_clip_threshold = 30\n",
"parameter: simple_clip_threshold = 10\n",
"# sliding window setting\n",
"parameter: window_size = 4\n",
"parameter: required_quality = 20\n",
"# Other settings\n",
"parameter: leading = 3\n",
"parameter: trailing = 3\n",
"parameter: min_len = 50\n",
"input: fastq, group_by = is_paired_end + 1, group_with = \"sample_id\"\n",
"output: f'{cwd}/{_sample_id}_trimmed_{_input:bn}.gz'\n",
"task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n",
"bash: container=container, expand= \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'\n",
" java -jar -Xmx${java_mem} ${trimmomatic_jar} SE -threads ${numThreads} \\\n",
" ${_input} \\\n",
" ${_output} \\\n",
" ILLUMINACLIP:${fasta_with_adapters_etc}:${seed_mismatches}:${palindrome_clip_threshold}:${simple_clip_threshold} \\\n",
" LEADING:${leading} TRAILING:${trailing} SLIDINGWINDOW:${window_size}:${required_quality} MINLEN:${min_len}"
]
},
{
"cell_type": "markdown",
"id": "unnecessary-necessity",
Expand Down Expand Up @@ -876,7 +919,6 @@
" --sjdbGTFfile ${gtf} \\\n",
"\n",
" rm -r ${_output[0]:nnnn}._STARgenome\n",
" rm -r ${_output[0]:nnnn}._STARtmp\n",
" touch ${_output[0]:n}.sort_start.timestamp\n",
" samtools sort --threads ${numThreads} -o ${_output[0]:nnnn}.Aligned.sortedByCoord.out.bam ${_output[0]:nnnn}.Aligned.out.bam # According to GTEX, this can help reducing the amount of memory consumption.\n",
" rm ${_output[0]:nnnn}.Aligned.out.bam \n",
Expand Down Expand Up @@ -1504,7 +1546,7 @@
" \n",
" metrics <- list()\n",
" for (i in 1:length(files)) {\n",
" m <- read.table(files[i], header=TRUE, sep=\"\\t\", comment.char=\"#\", stringsAsFactors=FALSE, nrows=2)\n",
" m <- read.table(files[i], header=TRUE, sep=\"\\t\", comment.char=\"#\", stringsAsFactors=FALSE, nrows=1)\n",
" metrics[[i]] <- data.frame(Sample=samples[i],\n",
" File=files[i],\n",
" PF_READS=sum(m$PF_READS[1:2]),\n",
Expand Down