Skip to content

Differential Expression Analysis

fayezor edited this page May 19, 2015 · 6 revisions

In this section, we will perform differential expression analysis of RNA-seq expression profiles. As input, we will be using the count matrix generated from summarizeOverlaps() in the previous section.

Methods for Testing Differential Expression

There are many methods to choose from for testing genewise differential expression between groups. The most commonly used include:

  • edgeR R package
  • DESeq2 R package
  • Cuffdiff in the Tuxedo pipeline

In addition, there are numerous other R packages such as limmaVoom, baySeq, PoissonSeq - all with different strengths and drawbacks. A comprehensive evaluation of these different methods may be found in (this reference).

For this workshop we will be using edgeR, as it performs well, is highly popular, and easy to implement.

edgeR Statistical Methods

The edgeR package implements exact statistical methods for multigroup experiments developed by Robinson and Smyth 2007 and 2008. edgeR can be applied to differential expression at the gene, exon, transcript or tag level, i.e. the read counts it takes as input may be summarized by any genomic feature.

http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf

Normalization for RNA composition

One important technical artifact that could affect tests for differential expression is RNA composition. There are cases when a small number of genes are very highly expressed in one sample, but not in another. The highly expressed genes can consume a substantial proportion of the total library size, causing the remaining genes to be under-sampled in that sample. Unless this RNA composition effect is adjusted for, the remaining genes may falsely appear to be down-regulated in that sample

The calcNormFactors() function normalizes for RNA composition by finding a set of scaling factors for the library sizes that minimize the log-fold changes between the samples for most genes. We multiply the scaling factors by the original library size, to get an effective library size. This effective library size replaces the original library size in all downstream analyses.

Negative Binomial models

Since the data are presented as counts (of raw reads mapping to a genomic feature), they may be naively modeled as following a Poisson distribution. However, the Poisson distribution imposes a constraint that the variance of the data should be equal to the mean. This is not the case in many contexts, particularly for RNA-Seq data, in which the variance is often much higher than the mean.

In these cases, it is natural to introduce an overdispersion parameter (often just called the dispersion parameter), a scaling factor on the usual Poisson variance. The Negative Binomial distribution is used as a much better alternative to the Poisson distribution when the data are counts and overdispersion is expected.

Dispersion parameter estimation

The accurate, robust estimation of the dispersion parameter of the Negative Binomial model is a challenging statistical problem, with different methods using different estimation strategies.

In RNA-seq differential expression analysis, a dispersion parameter for each gene must be estimated. edgeR first uses the estimateCommonDisp() function to estimate a common dispersion parameter over all the genes by using a quantile-adjusted conditional maximum likelihood (qCML) method on "pseudo-counts" which assume common library sizes (details may be found in the two references).

Once the common dispersion has been estimated, the genewise (or tagwise, as the edgeR package calls it) dispersions are estimated (via the estimateTagwiseDisp() function) by "shrinking" their estimates towards the common dispersion using an empirical Bayes method.

Testing for differentially expressed genes

Once the Negative Binomial models are fitted and dispersion estimates are obtained for each gene, we can proceed with testing procedures for determining differential expression using an exact test to directly calculate p-values. The exact test for the negative binomial distribution has strong parallels with Fisher’s exact test.


edgeR Workflow

Clean the working space, and load the library.

rm (list = ls ())
library(edgeR)

Read in the count matrix file.

data <- read.table("count_matrix.txt")
head(data)

Specify the groups.

grps <- c(rep("case", 3), rep("cont", 3))

Put the counts and other information into a DGEList object to be passed to the edgeR functions.

data <- DGEList(counts=data, group=grps)
dim(data)

Compute effective library sizes that adjust for RNA composition.

data <-  calcNormFactors(data)
data$samples

An MDS plots shows "distances" between the samples, in terms of biological coefficient of variation (BCV).

plotMDS(data)

The common dispersion estimates the overall dispersion of the dataset, averaged over all genes.

data <- estimateCommonDisp(data, verbose=TRUE)

Now estimate gene-specific dispersions.

data <- estimateTagwiseDisp(data)

Compute exact genewise tests for differential expression between groups.

res <- exactTest (data)

Report the most significant genes.

topTags (res)

Report all the significant genes.

padj <- p.adjust (res$table$PValue, "BH")
touse <- padj < 0.05
table (touse)
res[touse,]$table