Single-Cell Differential Expression Analysis

In this vignette, we show you how perform single cell differential expression analysis using single cell RNA-seq data with the scde package.

The scde package implements routines for fitting individual error models for single-cell RNA-seq measurements. Briefly, the read counts observed for each gene are modeled using a mixture of a negative binomial (NB) distribution (for the amplified/detected transcripts) and low-level Poisson distribution (for the unobserved or background-level signal of genes that failed to amplify or were not detected for other reasons). These models can then be used to identify robustly differentially expressed genes between groups of cells. For more information, please refer to the original manuscript by Kharchenko et al..

Preparing data

The analysis starts with a matrix of read counts. Depending on the protocol, these may be raw numbers of reads mapped to each gene, or count values adjusted for potential biases (sequence dependency, splice variant coverage, etc. - the values must be integers). The scde package includes a subset of the ES/MEF cell dataset published by Islam et al.. The subset includes first 20 ES and MEF cells. Here we load the cells and define a factor separating ES and MEF cell types:

# load example dataset
data(es.mef.small)

# factor determining cell types
sg <- factor(gsub("(MEF|ESC).*", "\\1", colnames(es.mef.small)), levels = c("ESC", "MEF"))
# the group factor should be named accordingly
names(sg) <- colnames(es.mef.small)  
table(sg)

## sg
## ESC MEF 
##  20  20

# clean up the dataset
cd <- clean.counts(es.mef.small, min.lib.size=1000, min.reads = 1, min.detected = 1)

Fitting error models

As a next step we fit the error models on which all subsequent calculations will rely. The fitting process relies on a subset of robust genes that are detected in multiple cross-cell comparisons. Here we supply the groups = sg argument, so that the error models for the two cell types are fit independently (using two different sets of "robust" genes). If the groups argument is omitted, the models will be fit using a common set.

Note this step takes a considerable amount of time unless multiple cores are used.

# EVALUATION NOT NEEDED
# calculate models
o.ifm <- scde.error.models(counts = cd, groups = sg, n.cores = 1, threshold.segmentation = TRUE, save.crossfit.plots = FALSE, save.model.plots = FALSE, verbose = 1)

For the purposes of this vignette, the model has been precomputed and can simply be loaded.

data(o.ifm)

The o.ifm is a dataframe with error model coefficients for each cell (rows).

head(o.ifm)

Here, corr.a and corr.b are slope and intercept of the correlated component fit, conc.* refer to the concomitant fit, corr.theta is the NB over-dispersion, and fail.r is the background Poisson rate (fixed).

Particularly poor cells may result in abnormal fits, most commonly showing negative corr.a, and should be removed.

# filter out cells that don't show positive correlation with
# the expected expression magnitudes (very poor fits)
valid.cells <- o.ifm$corr.a > 0
table(valid.cells)

## valid.cells
## TRUE 
##   40

o.ifm <- o.ifm[valid.cells, ]

Here, all the fits were valid.

Finally, we need to define an expression magnitude prior for the genes. Its main function, however, is to define a grid of expression magnitude values on which the numerical calculations will be carried out.

# estimate gene expression prior
o.prior <- scde.expression.prior(models = o.ifm, counts = cd, length.out = 400, show.plot = FALSE)

Here we used a grid of 400 points, and let the maximum expression magnitude be determined by the default 0.999 quantile (use max.value parameter to specify the maximum expression magnitude explicitly - on log10 scale).

Testing for differential expression

To test for differential expression, we first define a factor that specifies which two groups of cells are to be compared. The factor elements correspond to the rows of the model matrix (o.ifm), and can contain NA values (i.e. cells that won't be included in either group). Here we key off the the ES and MEF names.

# define two groups of cells
groups <- factor(gsub("(MEF|ESC).*", "\\1", rownames(o.ifm)), levels  =  c("ESC", "MEF"))
names(groups) <- row.names(o.ifm)
# run differential expression tests on all genes.
ediff <- scde.expression.difference(o.ifm, cd, o.prior, groups  =  groups, n.randomizations  =  100, n.cores  =  1, verbose  =  1)

## comparing groups:
## 
## ESC MEF 
##  20  20 
## calculating difference posterior
## summarizing differences

# top upregulated genes (tail would show top downregulated ones)
head(ediff[order(ediff$Z, decreasing  =  TRUE), ])

##                     lb      mle        ub       ce        Z       cZ
## Dppa5a        8.075220 9.984631 11.575807 8.075220 7.160813 5.989598
## Pou5f1        5.370220 7.200073  9.189043 5.370220 7.160328 5.989598
## Gm13242       5.688455 7.677425  9.785734 5.688455 7.159979 5.989598
## Tdh           5.807793 8.075220 10.302866 5.807793 7.159589 5.989598
## Ift46         5.449779 7.359190  9.228822 5.449779 7.150242 5.989598
## 4930509G22Rik 5.409999 7.478528  9.785734 5.409999 7.115605 5.978296

The columns in the table above are:

mle: log2 fold expression difference value, reported as maximum likelihood estimate (mle)
lb, ub: 95% lower bound (lb) and upper bound (ub) of the log2 fold expression difference value
ce: conservative estiamte of the log2 fold expression difference value (upper or lower bound, whichever is closer to 0, or 0 if the 95% confidence interval includes 0)
Z: statistical significance of the expression differences, expressed as a signed Z score (can be directly converted to P values; roughly a p-value of 1e-2 would correspond to a Z score around 2.3, and a p-value of 1e-3 to 3.1; One can convert Z scores back to p-values using pnorm(Z,lower.tail=F) for upregulated genes and pnorm(Z,lower.tail=T) for downregulated genes; Z scores are signed, so that the positive values correspond to up-regulated genes, negative to downregulated)
cZ: Z score adjusted for multiple hypothesis testing

# write out a table with all the results, showing most significantly different genes (in both directions) on top
write.table(ediff[order(abs(ediff$Z), decreasing = TRUE), ], file = "results.txt", row.names = TRUE, col.names = TRUE, sep = "\t", quote = FALSE)

One can also browse the differentially expressed genes, along with the associated posterior distributions, using a web browser application. The command below will start an internal web server accessible on port 1299 and call up a browser to point to the resulting page (note: internal browser doesn't work under RStudio)

scde.browse.diffexp(ediff, o.ifm, cd, o.prior, groups = groups, name = "diffexp1", port = 1299)

Alternatively we can run the differential expression on a single gene, and visualize the results:

scde.test.gene.expression.difference("Tdh", models = o.ifm, counts = cd, prior = o.prior)

##           lb     mle       ub       ce        Z       cZ
## Tdh 5.728235 8.03544 10.30287 5.728235 7.151425 7.151425

The top and the bottom plots show expression posteriors derived from individual cells (colored lines) and joint posteriors (black lines). The middle plot shows posterior of the expression fold difference between the two cell groups, highlighting the 95% credible interval by the red shading.

Correcting for batch effects

When the data combines cells that were measured in different batches, it is sometimes necessary to explicitly account for the expression differences that could be explained by the batch composition of the cell groups being compared. The example below makes up a random batch composition for the ES/MEF cells, and re-test the expression difference.

batch <- as.factor(ifelse(rbinom(nrow(o.ifm), 1, 0.5) == 1, "batch1", "batch2"))
# check the interaction between batches and cell types (shouldn't be any)
table(groups, batch)

##       batch
## groups batch1 batch2
##    ESC     11      9
##    MEF      8     12

# test the Tdh gene again
scde.test.gene.expression.difference("Tdh", models = o.ifm, counts = cd, prior = o.prior, batch = batch)

##           lb      mle       ub       ce        Z       cZ
## Tdh 3.659705 7.796764 12.01338 3.659705 3.782082 3.782082

In the plot above, the grey lines are used to show posterior distributions based on the batch composition alone. The expression magnitude posteriors (top and bottom plots) look very similar, and as a result the log2 expression ratio posterior is close to 0. The thin black line shows log2 expression ratio posterior before correction. The batch correction doesn't shift the location, but increases uncertainty in the ratio estimate (since we're controlling for another factor).

Similarly, batch correction can be performed when calculating expression differences for the entire dataset:

# test for all of the genes
ediff.batch <- scde.expression.difference(o.ifm, cd, o.prior, groups = groups, batch = batch, n.randomizations = 100, n.cores = 1, return.posteriors = TRUE, verbose = 1)

## controlling for batch effects. interaction:
##       batch
## groups batch1 batch2
##    ESC     11      9
##    MEF      8     12
## calculating batch posteriors
## calculating batch differences
## calculating difference posterior
## summarizing differences
## adjusting for batch effects

More detailed functions

The scde.expression.difference method can return a more extensive set of results, including joint posteriors and the expression fold difference posteriors for all of the exam ined genes:

The joint posteriors can also be obtained explicitly for a particular set of cells:

# calculate joint posterior for ESCs (set return.individual.posterior.modes=T if you need p.modes)
jp <- scde.posteriors(models = o.ifm[grep("ESC",rownames(o.ifm)), ], cd, o.prior, n.cores = 1)

The error models fit the intercept and the slope of the NB "correlated" component, providing more consistent expression magnitude estimates among the cells. These can be obtain ed with a quick helper function:

# get expression magntiude estimates
o.fpm <- scde.expression.magnitude(o.ifm, counts = cd)

Drop-out probabilities (as a function of expression magnitudes) for different cells are useful for assessing the quality of the measurements:

# get failure probabilities on the expresison range
o.fail.curves <- scde.failure.probability(o.ifm, magnitudes = log((10^o.prior$x)-1))
par(mfrow = c(1,1), mar = c(3.5,3.5,0.5,0.5), mgp = c(2.0,0.65,0), cex = 1)
plot(c(), c(), xlim=range(o.prior$x), ylim=c(0,1), xlab="expression magnitude (log10)", ylab="drop-out probability")
invisible(apply(o.fail.curves[, grep("ES",colnames(o.fail.curves))], 2, function(y) lines(x = o.prior$x, y = y,col = "orange")))
invisible(apply(o.fail.curves[, grep("MEF", colnames(o.fail.curves))], 2, function(y) lines(x = o.prior$x, y = y, col = "dodgerblue")))

The drop-out probabilities (at a given expression magnitude, or at an observed count) can be useful in subsequent analysis

# get failure probabilities on the expresison range
o.fail.curves <- scde.failure.probability(o.ifm, magnitudes = log((10^o.prior$x)-1))
# get self-fail probabilities (at a given observed count)
p.self.fail <- scde.failure.probability(models = o.ifm, counts = cd)

Adjusted distance meaures

The dependency of drop-out probability on the average expression magntiude captured by the cell-speicifc models can be used to adjust cell-to-cell similarity measures, for insta nce in the context of cell clustering. Several such measures are explored below.

Direct drop-out

Direct weighting downweights the contribution of a given gene to the cell-to-cell distance based on the probability that the given measurement is a drop-out event (i.e. belongs to the drop-out component) - the "self-fail" probability shown in the previous section. To estimate the adjusted distance, we will simulate the drop-out events, replacing them with NA values, and calculating correlation using the remaining points:

p.self.fail <- scde.failure.probability(models = o.ifm, counts = cd)
# simulate drop-outs
# note: using 10 sampling rounds for illustration here. ~500 or more should be used.
n.simulations <- 10; k <- 0.9;
cell.names <- colnames(cd); names(cell.names) <- cell.names;
dl <- mclapply(1:n.simulations,function(i) {
  scd1 <- do.call(cbind,lapply(cell.names,function(nam) {
    x <- cd[,nam];
    # replace predicted drop outs with NAs
    x[!as.logical(rbinom(length(x),1,1-p.self.fail[,nam]*k))] <- NA;
    x;
    }))
  rownames(scd1) <- rownames(cd); 
  # calculate correlation on the complete observation pairs
  cor(log10(scd1+1),use="pairwise.complete.obs");
}, mc.cores = 1)
# calculate average distance across sampling rounds
direct.dist <- as.dist(1-Reduce("+",dl)/length(dl))

Reciprocal weighting

The reciprocal weighting of the Pearson correlation will give increased weight to pairs of observations where a gene expressed (on average) at a level x1 observed in a cell c1 would not be likely to fail in a cell c2, and vice versa:

# load boot package for the weighted correlation implementation
require(boot)
k <- 0.95;
reciprocal.dist <- as.dist(1 - do.call(rbind, mclapply(cell.names, function(nam1) {
  unlist(lapply(cell.names, function(nam2) {
    # reciprocal probabilities
    f1 <- scde.failure.probability(models = o.ifm[nam1,,drop = FALSE], magnitudes = o.fpm[, nam2])
    f2 <- scde.failure.probability(models = o.ifm[nam2,,drop = FALSE], magnitudes = o.fpm[, nam1])
    # weight factor
    pnf <- sqrt((1-f1)*(1-f2))*k +(1-k); 
    boot::corr(log10(cbind(cd[, nam1], cd[, nam2])+1), w = pnf)
    }))
},mc.cores = 1)), upper = FALSE)

Mode-relative weighting

A more reliable reference magnitude against which drop-out likelihood could be assessed would be an estimate of the average expression magnitude, such as joint posterior mode. Below we estimate p.mode.fail, a probability that a drop-out event could be observed at the level of average expression magntiude in a given cell. For each measurement we then reduce it weight if it indeed dropped out in a cell where we expect it to drop-out given its average expression magnitude (p.self.fail*p.mode.fail). However we do want to give high weight to measurements where the drop-out was not observed, even though it was exected based on the average expression magnitude, so the overall weight expression is (1-p.self.fail*sqrt(p.self.fail*p.mode.fail)) (other formulations are clearly possible here).

# reclculate posteriors with the individual posterior modes 
jp <- scde.posteriors(models = o.ifm, cd, o.prior, return.individual.posterior.modes = TRUE, n.cores = 1)
# find joint posterior modes for each gene - a measure of MLE of group-average expression
jp$jp.modes <- log(as.numeric(colnames(jp$jp)))[max.col(jp$jp)]
p.mode.fail <- scde.failure.probability(models = o.ifm, magnitudes = jp$jp.modes)
# weight matrix
matw <- 1-sqrt(p.self.fail*sqrt(p.self.fail*p.mode.fail))
# magnitude matrix (using individual posterior modes here)
mat <- log10(exp(jp$modes)+1);
# weighted distance
mode.fail.dist <- as.dist(1-do.call(rbind,mclapply(cell.names,function(nam1) {
  unlist(lapply(cell.names,function(nam2) {
    corr(cbind(mat[, nam1], mat[, nam2]), w = sqrt(sqrt(matw[, nam1]*matw[, nam2])))
  }))
}, mc.cores = 1)), upper = FALSE)

SCDE

The SCDE package implements a set of statistical methods for analyzing single-cell RNA-seq data, including differential expression analysis (Kharchenko et al.) and pathway and geneset overdispersion analysis (Fan et al.)