Loading count data

In this section of the tutorial we will be using R to analyse count data of the type generated in the previous tutorial. First we will change directories and start up an instance of R:

$ cd ~/scw/scw2015/differential_expression
$ R

We can now read in our table of counts using the following R commands:

counts <- read.delim("~/scw/scw2015/alignment/subset/combined.counts",skip=0,header=T,sep="\t",stringsAsFactors=F,row.names=1)

Examining the dimensionality of this matrix via the command dim(counts), we can see that there are 23,337 genes detected (rows in the matrix), and 4 cells.

In order to analyse a larger dataset, we will load the counts for the ES/MEF mouse dataset of Islam et al., which contains 48 embyronic stem (ES) cells, and 44 mouse embryonic fibroblast (MEF) cells. For efficiency, this table has been saved as a compressed .RData file.

load("ES_MEF_raw.RData")
# clean up column (cell) names
colnames(counts) <- gsub("L139_Sample(\\d+).counts","\\1",colnames(counts))
colnames(counts)[1:10]

##  [1] "MEF_54" "ESC_47" "MEF_61" "MEF_67" "ESC_19" "MEF_91" "MEF_62"
##  [8] "ESC_20" "ESC_13" "MEF_70"

The rows of the counts data frame represent genes, and the columns represent cells. However, the genes in our count file are named according to their Ensembl ID. In order to map these IDs to more informative gene names, we can use the R interface to BioMart.

We will not run this procedure during this session, so as to avoid overloading the servers with multiple simultaneous requests, but this is how it might be done:

library(biomaRt)
mart <- useMart(biomart = 'ensembl', dataset = 'mmusculus_gene_ensembl')
bm.query <- getBM(values=rownames(counts),attributes=c("ensembl_gene_id", "external_gene_name"),filters=c("ensembl_gene_id"),mart=mart)
genes <- list(ids=rownames(counts),names=bm.query[match(rownames(counts), bm.query$ensembl_gene_id),]$external_gene_name)

Gene identifiers

Due to the reasons mentioned above, we will simply read in the map from Ensembl IDs to gene names from a saved table

load("gene_name_map.RData")

The row names of the count table can then be set to these gene names:

rownames(counts) <- make.unique(as.character(genes$names))

Cell identifiers

The identities of the different cells are known in this case, and we can obtain this information from the first three characters of the column labels:

cell.labels <- substr(colnames(counts),0,3)

The dataset contains some cells that were used as negative controls, so we will remove these before further analysis:

counts <- counts[-which(cell.labels=="EMP")]
cell.labels <- cell.labels[-which(cell.labels=="EMP")]

We will also reorder the cells so that the ES cells (columns) are all in the first half of the matrix, to help with visualisation later:

counts <- cbind(counts[,cell.labels=="ESC"],counts[,cell.labels=="MEF"])
cell.labels <- substr(colnames(counts),0,3)

Preliminary examination of the dataset

We can then examine some features of the dataset:

length(counts[,1]) # number of genes

## [1] 39017

length(cell.labels) # number of cells

## [1] 92

nES <- sum(cell.labels=="ESC") # number of ES cells
nMEF <- sum(cell.labels=="MEF") # number of MEF cells

A factor object can be defined to index the different cell types.

groups <- factor(cell.labels,levels=c("ESC","MEF"))
table(groups)

## groups
## ESC MEF 
##  48  44

Before proceeding, we will restrict our attention to those genes that had non-zero expression:

counts <- counts[rowSums(counts)>0,]
nGenes <- length(counts[,1])

We’ll also save the full cleaned up count matrix for use in subsequent analyses:

save(counts,file="counts.RData")

Let’s examine the effective library size (estimated by the average read count per gene) for the different cells.

coverage <- colSums(counts)/nGenes
ord <- order(groups)
bar.positions <- barplot(coverage[ord],col=groups[ord],xaxt='n',ylab="Counts per gene")
axis(side=1,at=c(bar.positions[nES/2],bar.positions[nES+nMEF/2]),labels=c("ESC","MEF"),tick=FALSE)

plot of chunk unnamed-chunk-14

We can see that on average the expression is higher for the MEF cells. We’ll filter out those cells with very low coverage

counts <- counts[,coverage>1]
cell.labels <- cell.labels[coverage>1]
groups <- factor(cell.labels,levels=c("ESC","MEF"))
coverage <- coverage[coverage>1]
nCells <- length(cell.labels)

We can use a violin plot to visualize the distributions of the normalized counts for the most highly expressed genes.

counts.norm <- t(apply(counts,1,function(x) x/coverage)) # simple normalization method
top.genes <- tail(order(rowSums(counts.norm)),10)
expression <- log2(counts.norm[top.genes,]+1) # add a pseudocount of 1 

library(caroline)

violins(as.data.frame(t(expression)),connect=c(),deciles=FALSE,xlab="",ylab="log2 expression") 

## Loading required package: sm
## Package 'sm', version 2.2-5.4: type help(sm) for summary information
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:sm':
## 
##     muscle

plot of chunk unnamed-chunk-18

Some of these genes show a clear bimodal pattern of expression, indicating the presence of two subpopulations of cells.

Negative binomial model for counts

Perhaps the simplest statistical model for count data is the Poisson, which has only one parameter. Under a Poisson model, the variance of the expression for a particular gene is equal to its mean expression. However, due to a variety of types of noise (both biological and technical), a better fit for read count data is usually obtained by using a negative binomial model, for which the variance can be written as:

variance = mean + overdispersion x mean^2

Since the overdispersion is a positive number, the variance under the negative binomial model is always higher than for the Poisson.

On our dataset, the overdispersion is clearly greater than zero for almost all genes, suggesting that the negative binomial will indeed be a better fit:

means <- apply(counts.norm,1,mean)
excess.var <- apply(counts,1,var)-means
excess.var[excess.var < 0] <- NA
overdispersion <- excess.var / means^2
hist(log2(overdispersion),main="Variance of read counts is higher than Poisson")

plot of chunk unnamed-chunk-19

Running DESeq

library(DESeq)

Several of the steps in the following analyses takes around 5 minutes to run on the full 92-cell dataset, using 2 cores on Orchestra. In order to save time for the purposes of this tutorial session, we will work with a subset of the data, sampling 20 cells from each of the groups:

es.cell.subset <- which(groups=="ESC")[(1:20)*2]
mef.cell.subset <- which(groups=="MEF")[(1:20)*2]
cell.subset <- c(es.cell.subset,mef.cell.subset)
counts <- counts[,cell.subset]
cell.labels <- cell.labels[cell.subset]
groups <- factor(cell.labels,levels=c("ESC","MEF"))
nCells <- length(cell.labels)

First we will use our count and groups objects to create a DESeq CountDataSet object:

cds <- newCountDataSet(counts, groups)

DESeq estimates size factors to measure the library size for each cell.

cds <- estimateSizeFactors(cds)

Generally the estimated size factors are linearly related to the average read count per gene (coverage), but are estimated using a more robust method:

plot(sizeFactors(cds),colSums(counts)/nGenes)

plot of chunk unnamed-chunk-25

Estimating the overdispersion

In order to compute $p$-values for differential expression, DESeq requires an estimate of the overdispersion for each gene. Since we have $40$ cells in total, we may wish to rely on the empirical estimates of the dispersion:

cds <- estimateDispersions(cds, sharingMode="gene-est-only")

## Warning in .local(object, ...): in estimateDispersions: sharingMode=='gene-
## est-only' will cause inflated numbers of false positives unless you have
## many replicates.

In cases where the empirical estimates may not be so reliable (e.g. when there are very few cells), we can pool information from the other cells in each group in order to obtain a more robust estimator for the dispersion

cds.pooled <- estimateDispersions(cds, method="per-condition",fitType="local")
# check the fit
plotDispEsts(cds.pooled,name="ESC") 

plot of chunk unnamed-chunk-27

plotDispEsts(cds.pooled,name="MEF") 

plot of chunk unnamed-chunk-27

Differential expression analysis

We are now ready to carry out the differential expression analysis.

de.test <- nbinomTest(cds, "ESC", "MEF")

We will also repeat the analysis using the pooled estimates for the dispersion parameters, to see what kind of difference this makes to the results:

de.test.pooled <- nbinomTest(cds.pooled, "ESC", "MEF")

We can then extract the top differentially expressed genes, using a false discovery rate of $0.05$:

find.significant.genes <- function(de.result,alpha=0.05) {
  # filter out significant genes based on FDR adjusted p-values
  filtered <- de.result[(de.result$padj < alpha) & !is.infinite(de.result$log2FoldChange) & !is.nan(de.result$log2FoldChange),]
  # order by p-value, and print out only the gene name, mean count, and log2 fold change
  sorted <- filtered[order(filtered$pval),c(1,2,6)]
}
de.genes <- find.significant.genes(de.test)
de.genes.pooled <- find.significant.genes(de.test.pooled)

This procedure pulls out a lot of genes as differentially expressed:

length(de.genes[,1])

## [1] 2974

length(de.genes.pooled[,1])

## [1] 778

Examining differentially expressed genes

We can examine the top $15$ of these

head(de.genes.pooled,n=15)

##                   id    baseMean log2FoldChange
## 11228         Dppa5a   630.24357      -6.901726
## 11100           Rpl4 24974.59204      -3.094853
## 12725          Brca1    43.80824     -11.325013
## 13612         Gm2373    43.59149     -11.883117
## 11928  D930048N14Rik    35.79928     -11.465529
## 5668           Anxa3   346.09488       7.073363
## 1985            Prnp   330.81187       7.271655
## 10457            Tdh   169.22606      -8.306162
## 7112           Myadm   414.66382       6.417668
## 7775           Fanci    28.98334     -10.730172
## 15969         Pou5f1    73.01968      -7.278936
## 5056         Gm13242    48.38516      -7.926445
## 10507 RP23-103I12.13   271.73314       6.696415
## 6967           Ccnd2   677.09732       6.169517
## 13700            Fst   263.88969       5.866691

In this case, three genes show an obvious relationship to stem-cell differentiation:

Gene Function
—— ——–
Dppa5a developmental pluripotency associated
Pou5f1 self-renewal of undifferentiated ES cells
Tdh mitochondrial; highly expressed in ES cells

Running DESeq2

The previous analysis showed you all the different steps involved in carrying out a differential expression analysis with DESeq. The authors of the package recently released an updated version, which includes some modifications to the models, and functions for simplifying the above pipeline.

library(DESeq2)

With DESeq2, we first create a DESeqDataSet object from the count data, using the group allocation factor object:

dds <- DESeqDataSetFromMatrix(counts,DataFrame(groups), ~groups)

We’ll now allow DESeq2 to use two cores, to speed up the subsequent steps

library("BiocParallel")
register(MulticoreParam(2))

We can then carry out differential expression analysis between the two groups. This will automatically carry out the model fitting steps, which may take a few minutes.

dds <- DESeq(dds,parallel=TRUE)

## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates: 2 workers
## mean-dispersion relationship
## final dispersion estimates, MLE betas: 2 workers
## fitting model and testing: 2 workers
## -- replacing outliers and refitting for 7382 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing

res <- results(dds)

As before, we can then extract the top differentially expressed genes, using a false discovery rate of $0.05$. We’ll first modify the filter function, since the column names are different for DESeq2 output:

find.significant.genes <- function(de.result,alpha=0.05) {
  # filter out significant genes based on FDR adjusted p-values
  filtered <- de.result[(de.result$padj < alpha) & !is.infinite(de.result$log2FoldChange) & !is.nan(de.result$log2FoldChange) & !is.na(de.result$padj),]
  # order by p-value, and print out only the gene name, mean count, and log2 fold change
  sorted <- filtered[order(filtered$padj),c(1,2,6)]
}

de2.genes <- find.significant.genes(res)

We can then examine how different the top genes are between DESeq and DESeq2

intersect(de.genes.pooled[1:20,1],rownames(de2.genes)[1:20])

## [1] "Dppa5a" "Prnp"   "Tdh"    "Ccnd2"

Although there is some agreement, the disparity here shows that the results of these types of analyses can be sensitive to the model specification, hence it is important to assess whether the results are reproducible using different methods.

Running SCDE

An alternative approach to modeling the error noise in single cells and testing for differential expression is implemented by the scde package.

One of the major sources of technical noise in single-cell RNA seq data are dropout events, whereby genes with a non-zero expression are not detected in some cells due to failure to amplify the RNA.

The package SCDE (Single-Cell Differential Expression) explicitly models this type of event, estimating the probability of a dropout event for each gene, in each cell. The probability of differential expression is then computed after accounting for dropouts.

library(scde)

First we will set the number of cores to the number we selected when opening up the interactive queue:

n.cores <- 2

We then fit the SCDE model to the data. Since we fit models separately for each cell in SCDE, we pass in unnormalized counts.

The fitting process relies on a subset of robust genes that are detected in multiple cross-cell comparisons. Since the groups argument is supplied, the error models for the two cell types are fit independently (using two different sets of robust genes). If the groups argument is omitted, the models will be fit using a common set.

For the purposes of this tutorial, we will not run the model fitting step now, since it takes around 10 minutes when only using two cores.

scde.fitted.model <- scde.error.models(counts=counts,groups=groups,n.cores=n.cores,save.model.plots=F)
save(scde.fitted.model,file="scde_fit.RData")

Instead we will load a pre-computed fit

load("/groups/pklab/scw/scw2015/data/scde_fit.RData")

We also need to define a prior distribution for the gene expression magnitude. We will use the default provided by SCDE

scde.prior <- scde.expression.prior(models=scde.fitted.model,counts=counts)

Differential expression analysis

Using the fitted model and prior, we can now compute $p$-values for differential expression for each gene

ediff <- scde.expression.difference(scde.fitted.model,counts,scde.prior,groups=groups,n.cores=n.cores)
p.values <- 2*pnorm(abs(ediff$Z),lower.tail=F) # 2-tailed p-value
p.values.adj <- 2*pnorm(abs(ediff$cZ),lower.tail=F) # Adjusted to control for FDR
significant.genes <- which(p.values.adj<0.05)
length(significant.genes)

## [1] 2124

The adjusted $p$-values are rescaled in order to control for false discovery rate (FDR) rather than the proportion of false positives. This is one way of dealing with the issue of multiple hypothesis testing. As well as correcting for multiple testing, we can also instruct SCDE to correct for any known batch effects. Examples of this procedure can be found in the online tutorial for SCDE.

We can now extract fold differences for the differentially expressed genes, with lower and upper bounds, and FDR-adjusted p-values:

ord <- order(p.values.adj[significant.genes]) # order by p-value
de <- cbind(ediff[significant.genes,1:3],p.values.adj[significant.genes])[ord,]
colnames(de) <- c("Lower bound","log2 fold change","Upper bound","p-value")

Examining differentially expressed genes

Examining the top $15$ most significant differentially expressed genes, there are now additional genes related to stem cell differentiation

de[1:15,]

##        Lower bound log2 fold change Upper bound      p-value
## Thbs1    -4.360279        -3.494635   -2.628992 8.935458e-10
## S100a6   -4.520583        -3.687000   -2.821357 8.935458e-10
## Cald1    -5.129740        -3.815244   -2.821357 8.935458e-10
## Efcc1     3.815244         5.514470    7.758731 8.935458e-10
## Gapdh     3.013722         3.943487    4.841192 8.935458e-10
## Apoe      4.520583         5.835079    7.534305 8.935458e-10
## Zfp42     4.232035         5.803018    9.361775 8.935458e-10
## Hmgb2     4.232035         5.578592    7.021331 8.935458e-10
## Tdh       5.578592         7.053392   10.002993 8.935458e-10
## Tpm1     -4.584705        -3.558757   -2.532809 8.935458e-10
## Dppa5a    5.642714         6.572479    7.406062 8.935458e-10
## Sparc    -4.520583        -3.494635   -2.596931 8.935458e-10
## Col1a1   -5.290044        -3.815244   -2.725174 8.935458e-10
## Timp2    -5.161801        -3.879366   -2.821357 8.935458e-10
## Gm2373    7.117514         8.880862    9.842688 8.935458e-10

Gene Function
—— ——–
Dppa5a developmental pluripotency associated
Pou5f1 self-renewal of undifferentiated ES cells
Tdh mitochondrial; highly expressed in ES cells
Zfp42 used as a marker for undifferentiated pluripotent stem cells
Utf1 undifferentiated ES cell transcription factor

Overlap between genes found by DESeq and SCDE

We can also examine how many of the top $20$ genes found by both methods are the same:

intersect(rownames(de)[1:20],de.genes[1:20,1])

## [1] "Cald1"  "Dppa5a"

intersect(rownames(de)[1:20],de.genes.pooled[1:20,1])

## [1] "Tdh"    "Dppa5a" "Gm2373" "Pou5f1" "Utf1"

For our example, estimating the dispersion using the pooled method in DESeq yields more genes in common with SCDE, and the four that are annotated all have some connection to stem-cell differentiation.

Expression differences for individual genes

SCDE also provides facilities for more closely examining the expression differences for individual genes. Examining two of the genes shown above, we can see clear differences in the posterior distributions for expression magnitude.

scde.test.gene.expression.difference("Tdh",models=scde.fitted.model,counts=counts,prior=scde.prior)

plot of chunk unnamed-chunk-52

##           lb     mle       ub       ce        Z       cZ
## Tdh 5.482409 6.98927 10.22742 5.482409 7.160813 7.160813

scde.test.gene.expression.difference("Pou5f1",models=scde.fitted.model,counts=counts,prior=scde.prior)

plot of chunk unnamed-chunk-52

##              lb      mle       ub       ce        Z       cZ
## Pou5f1 4.969435 9.265592 9.970932 4.969435 7.160813 7.160813

On these plots, the coloured curves in the background show the distributions inferred for individual cells, and the dark curves denote overall distributions. By combining information from all the cells, the model is able to infer the overall distribution with high confidence.

Single Cell Workshop

Loading count data

Gene identifiers

Cell identifiers

Preliminary examination of the dataset

Negative binomial model for counts

Running DESeq

Estimating the overdispersion

Differential expression analysis

Examining differentially expressed genes

Running DESeq2

Running SCDE

Differential expression analysis

Examining differentially expressed genes

Overlap between genes found by DESeq and SCDE

Expression differences for individual genes