Alignment of RNA Reads to Reference Genome

In this section, we will map the cleaned-up sequencing reads (from fastq files that have already undergone quality control) to appropriate regions on the reference genome. From this point forward, we will be using ...

Overview of available methods
Setup
Preparing an indexed reference genome
- Installing a pre-built index
- Building a new index from scratch
Read alignment
[Handling and manipulating SAM/BAM files](Handling and manipulating SAM/BAM files)
- Recovering the SAM file
- Sorting, indexing, and storing alignments
  - Sorting BAM files by read names
  - Sorting and indexing BAM files by chromosome position

Overview of Available Methods

There are various methods for aligning short reads to a reference genome, depending on the experimental setting, speed, and required sensitivity. The following table summarizes the most widely used tools and their pros/cons.

Method	Read Type	Pros	Cons
BWA	DNA	Sensitivity	Slower than bowtie, no big INDELS
Bowtie & Bowtie2	DNA	Speed	Less sensitive than BWA, no big INDELS
TopHat	RNA	Standard method	1. Uses Bowtie2 to align reads so not very sensitive, usually maps 75% of reads. 2. Not ready for long reads (>150bp), mapping decrease to below 50%; poor performance, can take several hours to map. 3. Big memory footprint and a lot of disk used. 4. Mapping is difficult with mismatches, INDELS and longer reads
HPG-Aligner	RNA/DNA	Better sensitivity, longer reads, various HPC implementations (multicore, SSE, GPUs), only one execution is needed to generate the BAM output file (saves disk)	Still under development (unstable code)

Among these tools, TopHat/Bowtie has been used extensively and been chosen for the rest of this workshop. It is worth mentioning that HPG-Aligner, which is part of the Open-source software for Computational Biology (OpenCB) movement shows promising results, both in terms of speed and alignment quality; however, it is still in the development phase and thus is not demonstrated here.

Setup

Before we run other experiments, we need to setup the environment. The following lines ensure that everything is in the right place:

# Change to dataset folder
cd dataset/workshop

# To store bowtie index
if [ ! -d index ]; then
   mkdir index
fi

# To store TopHat alignments
if [ ! -d tophat ]; then
   mkdir tophat
fi

# To store sam files
if [ ! -d sam ]; then
   mkdir sam
fi

# To store bam files sorted by name
if [ ! -d bam_name_sorted ]; then
   mkdir bam_name_sorted
fi

# To store bam files sorted by position and indexed
if [ ! -d bam_sorted_indexed ]; then
   mkdir bam_sorted_indexed
fi

# To store bam files sorted by position and indexed
if [ ! -d cufflinks ]; then
   mkdir cufflinks
fi

Preparing an Indexed Reference Genome

The first step in using any alignment method is to prepare an index for the reference genome under study.

While traditional sequence alignment methods such as the Smith-Waterman (SW) algorithm and BLAST have been traditionally used to align DNA/protein sequences to find functional orthologs, they are poor choices for aligning large number of short reads to a reference genome, both due to memory requirements as well as poor alignment speed.

As such, new data structures have been specifically developed for this task. Burrows-Wheeler Transform (BWT) algorithm is one such alternative which has much faster speed than BLAST, but has lower sensitivity. Suffix Arrays (SA) algorithm is another alternative which has higher performance than BWT, but has higher memory requirements.

When we index a reference genome, the aligner typically constructs one of these two data structures and stores it in a format that is proper for its internal use. Here, we focus on preparing an indexed genome to be used by TopHat/Bowtie. There are two main options in this case:

If we are studying a widely used species, we can directly download a pre-built index
If we are studying a new species, or if we are only interested in a subset of its genome, we can build an indexed genome from scratch.

Both of these methods are described below.

Installing a pre-built index

All packages for pre-built indexes can be installed directly from the Bowtie FTP website. These packages are compressed and available in *.zip format. You can choose whichever reference genome (Ensembl, NCBI, or UCSC) is of interest. For example, you can download the indexed human genome based on Ensembl from ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/GRCh38_no_alt.zip, while the index using USCS genome is available for download from ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/hg19.ebwt.zip.

Alternatively, you can download the set of genome sequence indexes provided by Illumina for the RNA-Seq user community. A short list of these indices for Bowtie is summarized in iGenomes.

Building a new index from scratch

Building an indexed reference genome from scratch requires two steps:

Download the genome in FASTA format
Use the bowtie-build command to index the genome. More details as follows:

Downloading the reference genome

There are three main sources from which you can download reference genomes: Ensembl, NCBI, and UCSC. Here we focus on the genomes downloaded from Ensembl. All different releases for various species are accessible from the Ensembl FTP website. The latest FASTA version of the human genome (DNA) may be directly accessed from here.

The whole reference genome based on the GRCh38 may be downloaded here. However, the datasets we will analyze in this workshop are sampled from chromosome 21, and we can directly download that subset of the reference genome here.

For the purposes of the workshop, we have provided the ready-for-use FASTA file (f000_chr21_ref_genome_sequence.fa) in the dataset/workshop directory of this repository.

Indexing the reference genome (bowtie2)

To index a reference genome in FASTA format, we will use the bowtie2-build command. Make sure you have bowtie binary files in your search paths before running any of the commands here. Then, go to the folder containing the FASTA file (dataset/workshop) to index the genome as follows:

bowtie2-build f000_chr21_ref_genome_sequence.fa index/f000_chr21_ref_genome_sequence

Alternatively, if you want to directly index the genome from Ensembl, use the following commands in bash:

wget ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.21.fa.gz
gunzip Homo_sapiens.GRCh38.dna.chromosome.21.fa.gz
bowtie2-build Homo_sapiens.GRCh38.dna.chromosome.21.fa index/Homo_sapiens.GRCh37.75.dna.chromosome.21

The bowtie2-build command will read the FASTA file(s) given as the first argument (separated by comma if more than one), create some *.bt2 files and store them in the index folder that we just created, and prefix them according to the parameter passed to it after the output folder path.

NOTE: The index needs to be built only once. After that, the indexed reference genome will be used for all the different alignments with Bowtie2.

Read Alignment

All FastQ files for this dataset are stored in the dataset/workshop/reads sub-folder.

Inspecting FastQ files

Count the number of lines:

echo Count of lines: `wc -l reads/f011_case_read1.fastq`

Output:

Count of lines: 40000 reads/f011_case_read1.fastq

Show the first 3 reads and their quality scores:

head -n 12 reads/f011_case_read1.fastq

Output:

@ENST00000286800_14324_14637_0_1_0_0_1:0:0_1:0:0_0/1 AGGAAAAAAAAAAGGTCCATGCATGATTTTCATGCTATCTGCCACCGCCGATGAAACAAAACAGTCCTCTCGTTCAAAGGGATGGACACGTCTTACGTAG

2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222 @ENST00000286800_45430_45069_1_0_0_0_1:0:0_5:0:0_1/1 CATATACTCTCAAGGGGAAGGCCAACTTGCTTCAGACTTGTTCATACACAGTGGGTGGTGGCTGCCCACATTAGCTAGACAAGCCCTGCACTAGGCTCAG + 2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222 @ENST00000286800_1282_901_1_0_0_0_3:0:0_0:1:0_2/1 CACTTGACAGAAACATAGGTTACACAACAATACAAGAATCACTACGCCAAACTCACCCTGCTCCAAGCTGACTCACCGGGAGCCTTTCTTCCCTAAGCTT + 2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222

As you can see, each read contains some additional information attached to it. The overall structure of FastQ file is as follows:

Line 1 contains the sequence identifier, which is unique for each read
Line 2 is DNA sequence
Line 3 is usually unused and either has “+” or a repeat of the sequence ID
Line 4 is the Phred quality score in ASCII format, which is often specially formatted depending on the sequencing platform

Show only the first 3 reads (no meta-data):

head -n 12 reads/f011_case_read1.fastq | awk ' NR % 4 == 2 { print; }'

Output:

AGGAAAAAAAAAAGGTCCATGCATGATTTTCATGCTATCTGCCACCGCCGATGAAACAAAACAGTCCTCTCGTTCAAAGGGATGGACACGTCTTACGTAG CATATACTCTCAAGGGGAAGGCCAACTTGCTTCAGACTTGTTCATACACAGTGGGTGGTGGCTGCCCACATTAGCTAGACAAGCCCTGCACTAGGCTCAG CACTTGACAGAAACATAGGTTACACAACAATACAAGAATCACTACGCCAAACTCACCCTGCTCCAAGCTGACTCACCGGGAGCCTTTCTTCCCTAAGCTT

Compute the average length of the reads:

awk ' NR % 4 == 2 { thislen=length($0); totlen+=thislen } END { printf("average read length: %d bases \n", 4*totlen/NR); }' reads/f011_case_read1.fastq

Output:

awk ' NR % 4 == 2 { thislen=length($0); totlen+=thislen } END { printf("average read length: %d bases \n", 4*totlen/NR); }' reads/f011_case_read1.fastq

Aligning short reads using TopHat

You can align a single FastQ file using:

# Align case reads
tophat2 -r 300 -o tophat/f021_case_tophat_out   index/f000_chr21_ref_genome_sequence   reads/f011_case_read1.fastq reads/f011_case_read2.fastq

or align all of them in one run:

# Align case reads
for i in {1..3}; do tophat2 -r 300 -o tophat/f02$i\_case_tophat_out   index/f000_chr21_ref_genome_sequence   reads/f01$i\_case_read1.fastq reads/f01$i\_case_read2.fastq; done

# Align control reads
for i in {4..6}; do tophat2 -r 300 -o tophat/f02$i\_cont_tophat_out   index/f000_chr21_ref_genome_sequence   reads/f01$i\_cont_read1.fastq reads/f01$i\_cont_read2.fastq; done

REMARK: For paired end samples, the left and the right files have to be separated using a space. If separated by a coma, tophat understands the two files contain reads form the same sample sequenced in a single end protocol.

Among different provided options, -r option defines the mate-inner-dist and is used for pair-end reads. It is set based on the experimental setting provided. -o option provides the output path for the TopHat results. There are additional parameters that could be passed to the TopHat to optimize its performance:

-G: supply GFF with transcript model info (preferred!)

-g: ignore all alginments with >g matches

-p: number of threads to use for alignment step

-i/-I: min/max intron lengths

--segment-length: length of split reads (25 is default)

Check mapping logs and metrics

A first sanity check to ensure that reads are aligned properly is to look at the logs generated by TopHat after aligning reads. These logs are stored on a separate sub-folder under the main output directory passed earlier to the TopHap. Since we had paired-end reads, we want to look at the stats for both of them:

echo Forward stats:
cat tophat/f021_case_tophat_out/logs/bowtie.left_kept_reads.log
echo end Reverse stats:
cat tophat/f021_case_tophat_out/logs/bowtie.left_kept_reads.log

Which looks like this:

Forward stats:
10000 reads; of these:
  10000 (100.00%) were unpaired; of these:
    627 (6.27%) aligned 0 times
    9261 (92.61%) aligned exactly 1 time
    112 (1.12%) aligned >1 times
93.73% overall alignment rate

Reverse stats:
10000 reads; of these:
  10000 (100.00%) were unpaired; of these:
    627 (6.27%) aligned 0 times
    9261 (92.61%) aligned exactly 1 time
    112 (1.12%) aligned >1 times
93.73% overall alignment rate

In addition, we can use samtools flagstat to have a summary of alignment statistics:

samtools flagstat tophat/f021_case_tophat_out/accepted_hits.bam

This results in:

13294 + 0 in total (QC-passed reads + QC-failed reads)
134 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
13294 + 0 mapped (100.00%:-nan%)
13160 + 0 paired in sequencing
6580 + 0 read1
6580 + 0 read2
5606 + 0 properly paired (42.60%:-nan%)
8524 + 0 with itself and mate mapped
4636 + 0 singletons (35.23%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Handling and manipulating SAM/BAM files

There are two main file formats to store alignments:

Sequence Alignment/Map (SAM): a simple human-readable text format that is produced by many aligners. This format is not ideal for using with other programs, since searching through it is not efficient. Moreover, these files typically are much larger.
Binary SAM (BAM): a binarized version of the SAM file which is smaller and much easier to automatically search through. It can be sorted based on different criteria (read name, genomic position, etc.) and indexed (to create a corresponding *.bai file) for ease of use.

For a detailed description of these formats, you can refer to the Sequence Alignment/Map Format Specification.

Recovering the SAM file

Tophat by default converts alignment output to the BAM format. To recover the original SAM file, we can use samtools view. To recover the SAM files for the first FastQ file, run:

   samtools view tophat/f021_case_tophat_out/accepted_hits.bam  > sam/f031_case.sam

Or to do the same for all BAM files, run:

echo extract SAM files back from BAM files
for i in {1..3}; do 
   samtools view tophat/f02$i\_case_tophat_out/accepted_hits.bam  > sam/f03$i\_case.sam
done

for i in {4..6}; do 
	samtools view tophat/f02$i\_cont_tophat_out/accepted_hits.bam  > sam/f03$i\_cont.sam
done

Sorting, indexing, and storing alignments

Different downstream tools might need different "types" of alignment file formats as input. For example, the htseq-count tool requires BAM files sorted by read names. On the other hand, the IGV visualization tool requires the BAM files sorted by chromosome position.

Sorting BAM files by read names

You can sort a single bam file as follows:

samtools sort -n tophat/f021_case_tophat_out/accepted_hits.bam  bam_name_sorted/g031_case_sorted_n

Or, you can sort all bam files generated by TopHat as follows:

# Sort 'case' bam file by name
for i in {1..3}; do samtools sort -n tophat/f02$i\_case_tophat_out/accepted_hits.bam  bam_name_sorted/g03$i\_case_sorted_n; done

# Sort 'control' bam file by name
for i in {4..6}; do samtools sort -n tophat/f02$i\_cont_tophat_out/accepted_hits.bam  bam_name_sorted/g03$i\_cont_sorted_n; done

Sorting and indexing BAM files by chromosome position

We can use samtools to sort the BAM files by chromosome position (standard/default sorting), and then index them:

samtools sort tophat/f021_case_tophat_out/accepted_hits.bam  bam_sorted_indexed/g031_case_sorted
samtools index bam_sorted_indexed/g031_case_sorted.bam

Or again you can do it all in one step using the following loops:

for i in {1..3}; do 
   samtools sort tophat/f02$i\_case_tophat_out/accepted_hits.bam  bam_sorted_indexed/g03$i\_case_sorted_n;
   samtools index bam_sorted_indexed/g03$i\_case_sorted_n.bam;
done

# Sort 'control' bam file by name
for i in {4..6}; do 
    samtools sort tophat/f02$i\_cont_tophat_out/accepted_hits.bam  bam_sorted_indexed/g03$i\_cont_sorted_n;
    samtools index bam_sorted_indexed/g03$i\_cont_sorted_n.bam;
done

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alignment of RNA Reads to Reference Genome

Overview of Available Methods

Setup

Preparing an Indexed Reference Genome

Installing a pre-built index

Building a new index from scratch

Downloading the reference genome

Indexing the reference genome (bowtie2)

Read Alignment

Inspecting FastQ files

Aligning short reads using TopHat

Check mapping logs and metrics

Handling and manipulating SAM/BAM files

Recovering the SAM file

Sorting, indexing, and storing alignments

Sorting BAM files by read names

Sorting and indexing BAM files by chromosome position

Clone this wiki locally