Expression Quantification

In this section we will use the aligned reads in the sorted bam format from the alignment step and count the number of reads that are mapped to annotated genomic features.

Read Counting

There are three main tools that allow the quantification of reads that overlap with genomic features of interest:

The summarizeOverlaps() function in the GenomicRanges Bioconductor package.
Cuffquant/cuffnorm within the Tuxedo pipeline
Simple bash scripts to use htseq-count in the HTSeq Python package.

Due to time, we will focus on option 1, the summarizeOverlaps() function in the GenomicsRanges R package.

Counting Reads with GenomicsRanges R Package

Counting modes

Three "modes" of counting are provided to resolve reads that overlap multiple features:

"Union"
"IntersectionStrict"
"IntersectionNotEmpty"

The following figure illustrates the effect of these three modes.

http://www.bioconductor.org/packages/release/bioc/vignettes/GenomicAlignments/inst/doc/summarizeOverlaps.pdf Reference

Reads mapped to a genome can be counted with summarizeOverlaps() function in the GenomicRanges R package. Reads are counted only once; there is no double-counting.

Workflow

Load the GenomicFeatures package, which contains genomic locations of desired features (transcripts, exons, genes, etc.). Extract transcript database information from GTF file.

library(GenomicFeatures)
txdb <- makeTxDbFromGFF(file="file.gtf", format="gtf")

Load GenomicRanges package, which contains the summarizeOverlaps() function to be used for read-counting.

library(GenomicRanges)

Define the features, which can be exons, transcripts, genes or any region of interest. In this case, our features of interest are genes.

eByg <- exonsBy(txdb, by="gene")

Create list of .bam files to work with.

bamdir <- "bam_sorted_indexed"
fls <- BamFileList(dir(bamdir, ".bam$", full = TRUE)) #GenomicAlignment
names(fls) <- basename(names(fls))

Count reads mapping to genes.

genehits <- summarizeOverlaps(features=eByg, reads=aln, mode="Union")

Extract the count matrix and label the columns.

counts <- assays(genehits)$counts
counts <- as.data.frame(counts)
names(counts) <- c("g031_case", "g032_case",  "g033_case", "g034_cont", "g035_c\
ont", "g036_cont")

Remarks on Count Normalization

The counts quantified as above represent the raw number of reads per gene or transcript, and may be directly used as input for further studies of differential analysis between groups. However, these raw counts should not be used to report an absolute expression value; this is due to a number of factors that affect the magnitude of raw counts, the most prominent of which are

the total number of reads produced (library size)
the length of genomic features

For these reasons, raw counts must first be normalized by the above factors. This can be done using

RPKM: reads per kilobase per million mapped reads
FPKM: fragments per kilobase per million (for paired-end reads)

Note, however, this normalization does not account for other sources of bias, such as the following:

between-sample normalization issues
variations in transcript length distribution in the sample
coverage dependence on local sequence due to GC content
priming and other experimental biases
variability in mappability of different regions.

Important note: When the final goal is to perform differential analysis, raw counts should not be transformed into RPKM/FPKM values as this will decrease the power of statistical methods.