Silkworm_whole_transcriptome

This Repository contains several indiviual script and analysis pipeline for whole transcriptome upsteam analysis

For long non coding RNA-seq

Paired-end 150bp fastq file used in this study.

1. Run qc in your raw fastq file

Using fastp to trim reads and filter low quality reads

fastp -w 16 -i raw_1.fastq.gz -o clean.R1.fq.gz -I raw_2.fastq.gz -O clean.R2.fq.gz

2. Remove unpaired reads (optional) and aggregate reads form all samples

#fix unpaired reads with seqkit

seqkit pair -1 clean.R1.fq.gz -2 clean.R2.fq.gz

#concatenate reads

cat *.clean.R1.paired.fq.gz > all_R1.fq.gz
cat *.clean.R2.paired.fq.gz > all_R2.fq.gz

3. Generate a new reference genome which contain silkworm and AcMNPV genome

cat ncbi_silkworm_genome.fa AcMNPV.fa > ncbi_silkworm_Ac_genome.fa
cat ncbi_genome_bm.gtf AcMNPV.gtf >ncbi_silkworm_Ac_genome.gtf

4. Build genome index and mapping

Using hisat2:

hisat2_extract_splice_sites.py ncbi_silkworm_Ac_genome.gtf > splice.txt
hisat2_extract_exons.py ncbi_silkworm_Ac_genome.gtf > exons.txt
hisat2-build -p 40 ncbi_silkworm_Ac_genome.fa --ss splice.txt --exon exons.txt silkworm_Ac_ht_index

Then mapping to the reference genome

hisat2 --dta -p 40 -1 all_R1.fq.gz -2 all_R2.fq.gz --rf -x silkworm_Ac_ht_index |samtools view -q 10 -b - > all.bam

samtools sort -@ 40 -o all_sorted.bam all.bam

5. Run stringtie to detect new transcript

Using stringtie to identify new transcript

This optional if your gtf is not incompatible with stringtie (because stringtie can't accept transcript_id == '' ):\

awk -F '\t' '$3 != "gene" ' ncbi_silkworm_Ac_genome.gtf > ncbi_stringtie_fix.gtf

stringtie all_sorted.bam -m 200 -p 40 -G ncbi_stringtie_fix.gtf -T 1 -o new.gtf

6. Annotate the long non coding RNA

Run FEELnc 3 steps to find new long non coding RNA. It is recommend to use conda to install FEELnc.

FEELnc_filter.pl -i new.gtf -a ncbi_stringtie_fix.gtf -b transcript_biotype=protein_coding > candidate_lncRNA.gtf

FEELnc_codpot.pl -i candidate_lncRNA.gtf -a ncbi_stringtie_fix.gtf -l Bombyx_mori.lncRNA.fa -g ncbi_silkworm_Ac_genome.fa

FEELnc_classifier.pl -i feelnc_codpot_out candidate_lncRNA.gtf.lncRNA.gtf -a ncbi_stringtie_fix.gtf > lncRNA_classes.txt

7. Merge all gtf

stringtie --merge -G ncbi_stringtie_fix ./feelnc_codpot_out/candidate_lncRNA.gtf.lncRNA.gtf -o ncbi_silkworm_Ac_lnc.gtf

8. Quantify expression of all gene in gtf

snakemake -s run_RNA-seq.smk -c 40

For microRNA-seq

Single-read 75 bp fastq file used in this study.

1. Run qc in your raw fastq file and convert to fasta file

For example:

fastp -w 10 -i sample.fq.gz -o sample_clean.fq.gz

In this study , the sequence adapter in fastq files can not auto detected by fastp, we need to trim the reads manually. In this place, the adapter is TGGAATTCTCGGGTGCCAAGGAACTC:

python process_fq.py sample_clean.fq.gz sample_trim.fa

2. Mapping reads to genome

In this place, we choose bowtie to mapped short reads to previous generated fused genome:

bowtie-build ncbi_silkworm_Ac_genome.fa bmor_bt1

mapper.pl sample_trim.fa -o 40 -q -c -j -m -l 18 -p bmor_bt1 -s sample.collapse.fa -t sample.genome.arf -v -o 4 > sample.log

select mapped reads grep -f <(awk -F '\t' '{ print $1 }' sample.genome.arf) -A 1 sample.collapse.fa| grep -v '-'> sample_mapped.fa

3. Classify reads

We used cmscan to annotate mapped reads type, then download and built rfam database.

for i in *.fa
do
cmscan -E 0.01 --cpu 8 --tblout ../rfam/$i.tab ../rfam_database/Rfam.cm $i &
done

python statitcs.py >rfam_stat.txt

4. Run miRDeep2 for expression matrix

perl exclude_rfam.pl

#this script require #2 mapping results as input and generate sample.collapse.-rfam.fa and sample.genome.-rfam.arf, please put sample.collapse.fa and sample.genome.arf and stript into the same folder.

miRDeep2 only accept fasta file without space, if your fasta file have space, you should use fix_fa_header.py to fix your fasta file.

miRDeep2.pl sample.collapse.-rfam.fa
ncbi_silkworm_Ac_genome.fasta sample.genome.-rfam.arf
bmo.fa mature.fa hairpin.fa 2>sample.log

The bmo.fa mature.fa hairpin.fa were download form miRbase, which is also in data.tar.gz

See the code generate_count_mat.R to generate count matrix

5. microRNA target predict

python extract_utr.py all_transcripts.fa all_utr.fa

We use miRanda to finding genomic targets for microRNAs

miranda bmo.fa all_utr.fa -en -15 -strict -out bmo.mir-tar.miranda.res -quiet

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.vscode		.vscode
lncRNA		lncRNA
miRNA		miRNA
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Silkworm_whole_transcriptome

For long non coding RNA-seq

1. Run qc in your raw fastq file

2. Remove unpaired reads (optional) and aggregate reads form all samples

3. Generate a new reference genome which contain silkworm and AcMNPV genome

4. Build genome index and mapping

5. Run stringtie to detect new transcript

6. Annotate the long non coding RNA

7. Merge all gtf

8. Quantify expression of all gene in gtf

For microRNA-seq

1. Run qc in your raw fastq file and convert to fasta file

2. Mapping reads to genome

3. Classify reads

4. Run miRDeep2 for expression matrix

5. microRNA target predict

About

Releases

Packages

Languages

License

yuanlizhanshi/Silkworm_whole_transcriptome

Folders and files

Latest commit

History

Repository files navigation

Silkworm_whole_transcriptome

For long non coding RNA-seq

1. Run qc in your raw fastq file

2. Remove unpaired reads (optional) and aggregate reads form all samples

3. Generate a new reference genome which contain silkworm and AcMNPV genome

4. Build genome index and mapping

5. Run stringtie to detect new transcript

6. Annotate the long non coding RNA

7. Merge all gtf

8. Quantify expression of all gene in gtf

For microRNA-seq

1. Run qc in your raw fastq file and convert to fasta file

2. Mapping reads to genome

3. Classify reads

4. Run miRDeep2 for expression matrix

5. microRNA target predict

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages