Loading count data
In this section of the tutorial we will be using R to analyse count data of the type generated in the previous tutorial. First we will change directories and start up an instance of R:
$ cd ~/scw/scw2015/differential_expression
$ R
We can now read in our table of counts using the following R commands:
counts <- read.delim("~/scw/scw2015/alignment/subset/combined.counts",skip=0,header=T,sep="\t",stringsAsFactors=F,row.names=1)
Examining the dimensionality of this matrix via the command dim(counts)
, we can see that there are 23,337 genes detected (rows in the matrix), and 4 cells.
In order to analyse a larger dataset, we will load the counts for the ES/MEF mouse dataset of Islam et al., which contains 48 embyronic stem (ES) cells, and 44 mouse embryonic fibroblast (MEF) cells. For efficiency, this table has been saved as a compressed .RData
file.
load("ES_MEF_raw.RData")
# clean up column (cell) names
colnames(counts) <- gsub("L139_Sample(\\d+).counts","\\1",colnames(counts))
colnames(counts)[1:10]
## [1] "MEF_54" "ESC_47" "MEF_61" "MEF_67" "ESC_19" "MEF_91" "MEF_62"
## [8] "ESC_20" "ESC_13" "MEF_70"
The rows of the counts
data frame represent genes, and the columns represent cells. However, the genes in our count file are named according to their Ensembl ID. In order to map these IDs to more informative gene names, we can use the R interface to BioMart.
We will not run this procedure during this session, so as to avoid overloading the servers with multiple simultaneous requests, but this is how it might be done:
library(biomaRt)
mart <- useMart(biomart = 'ensembl', dataset = 'mmusculus_gene_ensembl')
bm.query <- getBM(values=rownames(counts),attributes=c("ensembl_gene_id", "external_gene_name"),filters=c("ensembl_gene_id"),mart=mart)
genes <- list(ids=rownames(counts),names=bm.query[match(rownames(counts), bm.query$ensembl_gene_id),]$external_gene_name)
Gene identifiers
Due to the reasons mentioned above, we will simply read in the map from Ensembl IDs to gene names from a saved table
load("gene_name_map.RData")
The row names of the count table can then be set to these gene names:
rownames(counts) <- make.unique(as.character(genes$names))
Cell identifiers
The identities of the different cells are known in this case, and we can obtain this information from the first three characters of the column labels:
cell.labels <- substr(colnames(counts),0,3)
The dataset contains some cells that were used as negative controls, so we will remove these before further analysis:
counts <- counts[-which(cell.labels=="EMP")]
cell.labels <- cell.labels[-which(cell.labels=="EMP")]
We will also reorder the cells so that the ES cells (columns) are all in the first half of the matrix, to help with visualisation later:
counts <- cbind(counts[,cell.labels=="ESC"],counts[,cell.labels=="MEF"])
cell.labels <- substr(colnames(counts),0,3)
Preliminary examination of the dataset
We can then examine some features of the dataset:
length(counts[,1]) # number of genes
## [1] 39017
length(cell.labels) # number of cells
## [1] 92
nES <- sum(cell.labels=="ESC") # number of ES cells
nMEF <- sum(cell.labels=="MEF") # number of MEF cells
A factor object can be defined to index the different cell types.
groups <- factor(cell.labels,levels=c("ESC","MEF"))
table(groups)
## groups
## ESC MEF
## 48 44
Before proceeding, we will restrict our attention to those genes that had non-zero expression:
counts <- counts[rowSums(counts)>0,]
nGenes <- length(counts[,1])
We’ll also save the full cleaned up count matrix for use in subsequent analyses:
save(counts,file="counts.RData")
Let’s examine the effective library size (estimated by the average read count per gene) for the different cells.
coverage <- colSums(counts)/nGenes
ord <- order(groups)
bar.positions <- barplot(coverage[ord],col=groups[ord],xaxt='n',ylab="Counts per gene")
axis(side=1,at=c(bar.positions[nES/2],bar.positions[nES+nMEF/2]),labels=c("ESC","MEF"),tick=FALSE)
We can see that on average the expression is higher for the MEF cells. We’ll filter out those cells with very low coverage
counts <- counts[,coverage>1]
cell.labels <- cell.labels[coverage>1]
groups <- factor(cell.labels,levels=c("ESC","MEF"))
coverage <- coverage[coverage>1]
nCells <- length(cell.labels)
We can use a violin plot to visualize the distributions of the normalized counts for the most highly expressed genes.
counts.norm <- t(apply(counts,1,function(x) x/coverage)) # simple normalization method
top.genes <- tail(order(rowSums(counts.norm)),10)
expression <- log2(counts.norm[top.genes,]+1) # add a pseudocount of 1
library(caroline)
violins(as.data.frame(t(expression)),connect=c(),deciles=FALSE,xlab="",ylab="log2 expression")
## Loading required package: sm
## Package 'sm', version 2.2-5.4: type help(sm) for summary information
## Loading required package: MASS
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:sm':
##
## muscle
Some of these genes show a clear bimodal pattern of expression, indicating the presence of two subpopulations of cells.
Negative binomial model for counts
Perhaps the simplest statistical model for count data is the Poisson, which has only one parameter. Under a Poisson model, the variance of the expression for a particular gene is equal to its mean expression. However, due to a variety of types of noise (both biological and technical), a better fit for read count data is usually obtained by using a negative binomial model, for which the variance can be written as:
variance = mean + overdispersion x mean^2
Since the overdispersion is a positive number, the variance under the negative binomial model is always higher than for the Poisson.
On our dataset, the overdispersion is clearly greater than zero for almost all genes, suggesting that the negative binomial will indeed be a better fit:
means <- apply(counts.norm,1,mean)
excess.var <- apply(counts,1,var)-means
excess.var[excess.var < 0] <- NA
overdispersion <- excess.var / means^2
hist(log2(overdispersion),main="Variance of read counts is higher than Poisson")
Running DESeq
library(DESeq)
Several of the steps in the following analyses takes around 5 minutes to run on the full 92-cell dataset, using 2 cores on Orchestra. In order to save time for the purposes of this tutorial session, we will work with a subset of the data, sampling 20 cells from each of the groups:
es.cell.subset <- which(groups=="ESC")[(1:20)*2]
mef.cell.subset <- which(groups=="MEF")[(1:20)*2]
cell.subset <- c(es.cell.subset,mef.cell.subset)
counts <- counts[,cell.subset]
cell.labels <- cell.labels[cell.subset]
groups <- factor(cell.labels,levels=c("ESC","MEF"))
nCells <- length(cell.labels)
First we will use our count
and groups
objects to create a DESeq CountDataSet
object:
cds <- newCountDataSet(counts, groups)
DESeq estimates size factors to measure the library size for each cell.
cds <- estimateSizeFactors(cds)
Generally the estimated size factors are linearly related to the average read count per gene (coverage), but are estimated using a more robust method:
plot(sizeFactors(cds),colSums(counts)/nGenes)
Estimating the overdispersion
In order to compute $p$-values for differential expression, DESeq requires an estimate of the overdispersion for each gene. Since we have $40$ cells in total, we may wish to rely on the empirical estimates of the dispersion:
cds <- estimateDispersions(cds, sharingMode="gene-est-only")
## Warning in .local(object, ...): in estimateDispersions: sharingMode=='gene-
## est-only' will cause inflated numbers of false positives unless you have
## many replicates.
In cases where the empirical estimates may not be so reliable (e.g. when there are very few cells), we can pool information from the other cells in each group in order to obtain a more robust estimator for the dispersion
cds.pooled <- estimateDispersions(cds, method="per-condition",fitType="local")
# check the fit
plotDispEsts(cds.pooled,name="ESC")
plotDispEsts(cds.pooled,name="MEF")
Differential expression analysis
We are now ready to carry out the differential expression analysis.
de.test <- nbinomTest(cds, "ESC", "MEF")
We will also repeat the analysis using the pooled estimates for the dispersion parameters, to see what kind of difference this makes to the results:
de.test.pooled <- nbinomTest(cds.pooled, "ESC", "MEF")
We can then extract the top differentially expressed genes, using a false discovery rate of $0.05$:
find.significant.genes <- function(de.result,alpha=0.05) {
# filter out significant genes based on FDR adjusted p-values
filtered <- de.result[(de.result$padj < alpha) & !is.infinite(de.result$log2FoldChange) & !is.nan(de.result$log2FoldChange),]
# order by p-value, and print out only the gene name, mean count, and log2 fold change
sorted <- filtered[order(filtered$pval),c(1,2,6)]
}
de.genes <- find.significant.genes(de.test)
de.genes.pooled <- find.significant.genes(de.test.pooled)
This procedure pulls out a lot of genes as differentially expressed:
length(de.genes[,1])
## [1] 2974
length(de.genes.pooled[,1])
## [1] 778
Examining differentially expressed genes
We can examine the top $15$ of these
head(de.genes.pooled,n=15)
## id baseMean log2FoldChange
## 11228 Dppa5a 630.24357 -6.901726
## 11100 Rpl4 24974.59204 -3.094853
## 12725 Brca1 43.80824 -11.325013
## 13612 Gm2373 43.59149 -11.883117
## 11928 D930048N14Rik 35.79928 -11.465529
## 5668 Anxa3 346.09488 7.073363
## 1985 Prnp 330.81187 7.271655
## 10457 Tdh 169.22606 -8.306162
## 7112 Myadm 414.66382 6.417668
## 7775 Fanci 28.98334 -10.730172
## 15969 Pou5f1 73.01968 -7.278936
## 5056 Gm13242 48.38516 -7.926445
## 10507 RP23-103I12.13 271.73314 6.696415
## 6967 Ccnd2 677.09732 6.169517
## 13700 Fst 263.88969 5.866691
In this case, three genes show an obvious relationship to stem-cell differentiation:
Gene Function
—— ——–
Dppa5a developmental pluripotency associated
Pou5f1 self-renewal of undifferentiated ES cells
Tdh mitochondrial; highly expressed in ES cells
Running DESeq2
The previous analysis showed you all the different steps involved in carrying out a differential expression analysis with DESeq. The authors of the package recently released an updated version, which includes some modifications to the models, and functions for simplifying the above pipeline.
library(DESeq2)
With DESeq2, we first create a DESeqDataSet object from the count data, using the group allocation factor object:
dds <- DESeqDataSetFromMatrix(counts,DataFrame(groups), ~groups)
We’ll now allow DESeq2 to use two cores, to speed up the subsequent steps
library("BiocParallel")
register(MulticoreParam(2))
We can then carry out differential expression analysis between the two groups. This will automatically carry out the model fitting steps, which may take a few minutes.
dds <- DESeq(dds,parallel=TRUE)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates: 2 workers
## mean-dispersion relationship
## final dispersion estimates, MLE betas: 2 workers
## fitting model and testing: 2 workers
## -- replacing outliers and refitting for 7382 genes
## -- DESeq argument 'minReplicatesForReplace' = 7
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
res <- results(dds)
As before, we can then extract the top differentially expressed genes, using a false discovery rate of $0.05$. We’ll first modify the filter function, since the column names are different for DESeq2 output:
find.significant.genes <- function(de.result,alpha=0.05) {
# filter out significant genes based on FDR adjusted p-values
filtered <- de.result[(de.result$padj < alpha) & !is.infinite(de.result$log2FoldChange) & !is.nan(de.result$log2FoldChange) & !is.na(de.result$padj),]
# order by p-value, and print out only the gene name, mean count, and log2 fold change
sorted <- filtered[order(filtered$padj),c(1,2,6)]
}
de2.genes <- find.significant.genes(res)
We can then examine how different the top genes are between DESeq and DESeq2
intersect(de.genes.pooled[1:20,1],rownames(de2.genes)[1:20])
## [1] "Dppa5a" "Prnp" "Tdh" "Ccnd2"
Although there is some agreement, the disparity here shows that the results of these types of analyses can be sensitive to the model specification, hence it is important to assess whether the results are reproducible using different methods.
Running SCDE
An alternative approach to modeling the error noise in single cells and testing for differential expression is implemented by the scde package.
One of the major sources of technical noise in single-cell RNA seq data are dropout events, whereby genes with a non-zero expression are not detected in some cells due to failure to amplify the RNA.
The package SCDE (Single-Cell Differential Expression) explicitly models this type of event, estimating the probability of a dropout event for each gene, in each cell. The probability of differential expression is then computed after accounting for dropouts.
library(scde)
First we will set the number of cores to the number we selected when opening up the interactive queue:
n.cores <- 2
We then fit the SCDE model to the data. Since we fit models separately for each cell in SCDE, we pass in unnormalized counts.
The fitting process relies on a subset of robust genes that are detected in multiple cross-cell comparisons. Since the groups
argument is supplied, the error models for the two cell types are fit independently (using two different sets of robust genes). If the groups
argument is omitted, the models will be fit using a common set.
For the purposes of this tutorial, we will not run the model fitting step now, since it takes around 10 minutes when only using two cores.
scde.fitted.model <- scde.error.models(counts=counts,groups=groups,n.cores=n.cores,save.model.plots=F)
save(scde.fitted.model,file="scde_fit.RData")
Instead we will load a pre-computed fit
load("/groups/pklab/scw/scw2015/data/scde_fit.RData")
We also need to define a prior distribution for the gene expression magnitude. We will use the default provided by SCDE
scde.prior <- scde.expression.prior(models=scde.fitted.model,counts=counts)
Differential expression analysis
Using the fitted model and prior, we can now compute $p$-values for differential expression for each gene
ediff <- scde.expression.difference(scde.fitted.model,counts,scde.prior,groups=groups,n.cores=n.cores)
p.values <- 2*pnorm(abs(ediff$Z),lower.tail=F) # 2-tailed p-value
p.values.adj <- 2*pnorm(abs(ediff$cZ),lower.tail=F) # Adjusted to control for FDR
significant.genes <- which(p.values.adj<0.05)
length(significant.genes)
## [1] 2124
The adjusted $p$-values are rescaled in order to control for false discovery rate (FDR) rather than the proportion of false positives. This is one way of dealing with the issue of multiple hypothesis testing. As well as correcting for multiple testing, we can also instruct SCDE to correct for any known batch effects. Examples of this procedure can be found in the online tutorial for SCDE.
We can now extract fold differences for the differentially expressed genes, with lower and upper bounds, and FDR-adjusted p-values:
ord <- order(p.values.adj[significant.genes]) # order by p-value
de <- cbind(ediff[significant.genes,1:3],p.values.adj[significant.genes])[ord,]
colnames(de) <- c("Lower bound","log2 fold change","Upper bound","p-value")
Examining differentially expressed genes
Examining the top $15$ most significant differentially expressed genes, there are now additional genes related to stem cell differentiation
de[1:15,]
## Lower bound log2 fold change Upper bound p-value
## Thbs1 -4.360279 -3.494635 -2.628992 8.935458e-10
## S100a6 -4.520583 -3.687000 -2.821357 8.935458e-10
## Cald1 -5.129740 -3.815244 -2.821357 8.935458e-10
## Efcc1 3.815244 5.514470 7.758731 8.935458e-10
## Gapdh 3.013722 3.943487 4.841192 8.935458e-10
## Apoe 4.520583 5.835079 7.534305 8.935458e-10
## Zfp42 4.232035 5.803018 9.361775 8.935458e-10
## Hmgb2 4.232035 5.578592 7.021331 8.935458e-10
## Tdh 5.578592 7.053392 10.002993 8.935458e-10
## Tpm1 -4.584705 -3.558757 -2.532809 8.935458e-10
## Dppa5a 5.642714 6.572479 7.406062 8.935458e-10
## Sparc -4.520583 -3.494635 -2.596931 8.935458e-10
## Col1a1 -5.290044 -3.815244 -2.725174 8.935458e-10
## Timp2 -5.161801 -3.879366 -2.821357 8.935458e-10
## Gm2373 7.117514 8.880862 9.842688 8.935458e-10
Gene Function
—— ——–
Dppa5a developmental pluripotency associated
Pou5f1 self-renewal of undifferentiated ES cells
Tdh mitochondrial; highly expressed in ES cells
Zfp42 used as a marker for undifferentiated pluripotent stem cells
Utf1 undifferentiated ES cell transcription factor
Overlap between genes found by DESeq and SCDE
We can also examine how many of the top $20$ genes found by both methods are the same:
intersect(rownames(de)[1:20],de.genes[1:20,1])
## [1] "Cald1" "Dppa5a"
intersect(rownames(de)[1:20],de.genes.pooled[1:20,1])
## [1] "Tdh" "Dppa5a" "Gm2373" "Pou5f1" "Utf1"
For our example, estimating the dispersion using the pooled method in DESeq yields more genes in common with SCDE, and the four that are annotated all have some connection to stem-cell differentiation.
Expression differences for individual genes
SCDE also provides facilities for more closely examining the expression differences for individual genes. Examining two of the genes shown above, we can see clear differences in the posterior distributions for expression magnitude.
scde.test.gene.expression.difference("Tdh",models=scde.fitted.model,counts=counts,prior=scde.prior)
## lb mle ub ce Z cZ
## Tdh 5.482409 6.98927 10.22742 5.482409 7.160813 7.160813
scde.test.gene.expression.difference("Pou5f1",models=scde.fitted.model,counts=counts,prior=scde.prior)
## lb mle ub ce Z cZ
## Pou5f1 4.969435 9.265592 9.970932 4.969435 7.160813 7.160813
On these plots, the coloured curves in the background show the distributions inferred for individual cells, and the dark curves denote overall distributions. By combining information from all the cells, the model is able to infer the overall distribution with high confidence.