This Repository contains several indiviual script and analysis pipeline for whole transcriptome upsteam analysis
Paired-end 150bp fastq file used in this study.
Using fastp to trim reads and filter low quality reads
fastp -w 16 -i raw_1.fastq.gz -o clean.R1.fq.gz -I raw_2.fastq.gz -O clean.R2.fq.gz
#fix unpaired reads with seqkit
seqkit pair -1 clean.R1.fq.gz -2 clean.R2.fq.gz
#concatenate reads
cat *.clean.R1.paired.fq.gz > all_R1.fq.gz
cat *.clean.R2.paired.fq.gz > all_R2.fq.gz
cat ncbi_silkworm_genome.fa AcMNPV.fa > ncbi_silkworm_Ac_genome.fa
cat ncbi_genome_bm.gtf AcMNPV.gtf >ncbi_silkworm_Ac_genome.gtf
Using hisat2:
hisat2_extract_splice_sites.py ncbi_silkworm_Ac_genome.gtf > splice.txt
hisat2_extract_exons.py ncbi_silkworm_Ac_genome.gtf > exons.txt
hisat2-build -p 40 ncbi_silkworm_Ac_genome.fa --ss splice.txt --exon exons.txt silkworm_Ac_ht_index
Then mapping to the reference genome
hisat2 --dta -p 40 -1 all_R1.fq.gz -2 all_R2.fq.gz --rf -x silkworm_Ac_ht_index |samtools view -q 10 -b - > all.bam
samtools sort -@ 40 -o all_sorted.bam all.bam
Using stringtie to identify new transcript
This optional if your gtf is not incompatible with stringtie (because stringtie can't accept transcript_id == '' ):\
awk -F '\t' '$3 != "gene" ' ncbi_silkworm_Ac_genome.gtf > ncbi_stringtie_fix.gtf
stringtie all_sorted.bam -m 200 -p 40 -G ncbi_stringtie_fix.gtf -T 1 -o new.gtf
Run FEELnc 3 steps to find new long non coding RNA. It is recommend to use conda to install FEELnc.
FEELnc_filter.pl -i new.gtf -a ncbi_stringtie_fix.gtf -b transcript_biotype=protein_coding > candidate_lncRNA.gtf
FEELnc_codpot.pl -i candidate_lncRNA.gtf -a ncbi_stringtie_fix.gtf -l Bombyx_mori.lncRNA.fa -g ncbi_silkworm_Ac_genome.fa
FEELnc_classifier.pl -i feelnc_codpot_out candidate_lncRNA.gtf.lncRNA.gtf -a ncbi_stringtie_fix.gtf > lncRNA_classes.txt
stringtie --merge -G ncbi_stringtie_fix ./feelnc_codpot_out/candidate_lncRNA.gtf.lncRNA.gtf -o ncbi_silkworm_Ac_lnc.gtf
snakemake -s run_RNA-seq.smk -c 40
Single-read 75 bp fastq file used in this study.
For example:
fastp -w 10 -i sample.fq.gz -o sample_clean.fq.gz
In this study , the sequence adapter in fastq files can not auto detected by fastp, we need to trim the reads manually. In this place, the adapter is TGGAATTCTCGGGTGCCAAGGAACTC:
python process_fq.py sample_clean.fq.gz sample_trim.fa
In this place, we choose bowtie to mapped short reads to previous generated fused genome:
bowtie-build ncbi_silkworm_Ac_genome.fa bmor_bt1
mapper.pl sample_trim.fa -o 40 -q -c -j -m -l 18 -p bmor_bt1 -s sample.collapse.fa -t sample.genome.arf -v -o 4 > sample.log
select mapped reads grep -f <(awk -F '\t' '{ print $1 }' sample.genome.arf) -A 1 sample.collapse.fa| grep -v '-'> sample_mapped.fa
We used cmscan to annotate mapped reads type, then download and built rfam database.
for i in *.fa
do
cmscan -E 0.01 --cpu 8 --tblout ../rfam/$i.tab ../rfam_database/Rfam.cm $i &
done
python statitcs.py >rfam_stat.txt
perl exclude_rfam.pl
#this script require #2 mapping results as input and generate sample.collapse.-rfam.fa and sample.genome.-rfam.arf, please put sample.collapse.fa and sample.genome.arf and stript into the same folder.
miRDeep2 only accept fasta file without space, if your fasta file have space, you should use fix_fa_header.py to fix your fasta file.
miRDeep2.pl sample.collapse.-rfam.fa
ncbi_silkworm_Ac_genome.fasta sample.genome.-rfam.arf
bmo.fa mature.fa hairpin.fa 2>sample.log
The bmo.fa mature.fa hairpin.fa were download form miRbase, which is also in data.tar.gz
See the code generate_count_mat.R to generate count matrix
python extract_utr.py all_transcripts.fa all_utr.fa
We use miRanda to finding genomic targets for microRNAs
miranda bmo.fa all_utr.fa -en -15 -strict -out bmo.mir-tar.miranda.res -quiet