Exercise: Day 5 - Expression Analysis with RNA-seq
Get the data
Download the following data of from https://www.dropbox.com/sh/gcsyyf4wumou1gq/AADTP8Z3Vgv7Teja1iEI6ls-a?dl=0
PRAD_raw_counts.txt - Raw counts on 86 matched tumor-normal samples and 55635 genes
PRAD_sample_info.txt - Labels associated to the samples (“Primary Tumor” and “Solid Tissue”)
Load the data into R
Do this by either first downloading the file and providing the local path:
rm(list=ls()) # clean the workspace
setwd("path_to_files") #replace location with local path_to_file
prad_data <- read.table("PRAD_raw_counts.txt",sep="\t",header=T)
RPKM values
Assuming the gene lengths defined below, write a function that, receiving in input the counts from each sample and the gene lengths, calculate the RPKM values
head(prad_data[,1:5])
## TCGA_HC_8260_11A TCGA_HC_8259_11A TCGA_EJ_7123_11A TCGA_G9_6496_01A TCGA_EJ_7781_01A
## TSPAN6 3829 3990 2770 2666 4454
## TNMD 10 23 24 5 6
## DPM1 1555 2108 1987 551 1531
## SCYL3 1096 1598 1477 426 1792
## C1orf112 236 279 307 75 273
## FGR 222 382 765 203 338
gene_lengths<-rep(c(100,1000),dim(prad_data)[1])[dim(prad_data)[1]]
returnRPKM <- function(counts, gene_lengths) {
....
return(rpkm)
}
prad_data_rpkm<-apply(prad_data,2,function(x) returnRPKM(x,gene_lengths))
head(prad_data_rpkm[,1:5])
## TCGA_HC_8260_11A TCGA_HC_8259_11A TCGA_EJ_7123_11A TCGA_G9_6496_01A TCGA_EJ_7781_01A
## TSPAN6 689.840938 602.398597 233.542303 638.641255 666.9722899
## TNMD 1.801622 3.472473 2.023471 1.197752 0.8984809
## DPM1 280.152170 318.259710 167.526555 131.992247 229.2623655
## SCYL3 197.457735 241.261393 124.527791 102.048453 268.3462828
## C1orf112 42.518271 42.122609 25.883569 17.966277 40.8808790
## FGR 39.996001 57.673249 64.498145 48.628723 50.6144216
Data handling and diagnostic plots
A. Starting from the raw count data, filter out the genes characterized by a maximum expression level across the samples which is not above 10 and log2 transform the data, using an offset =1
....
dim(prad_data_filt)
## [1] 29741 86
B. Perform the PCA analysis and plot the first two components, highlighting Primary Tumor vs Solid Tissue Normal classification (NOTE: use pch=19 and col=condition)
samp_info<-read.table("PRAD_sample_info.txt",sep="\t",header=T)
condition<-as.factor(samp_info[,2])
....
C. Write a function able to display, for a sample given as input, its MA plot versus the “median sample”, i.e. the sample of the median expression values across the genes (NOTE: use the parameter pch=”.” in the plot, and abline to highlight the 0 line)
MAplot<-function(data,n) #n = number indicating the sample
{
....
}
MAplot(prad_data_filt,n=1)
D. Install and load the package “arrayQualityMetrics”, follow the instructions below and look at the resulting report.
source("https://bioconductor.org/biocLite.R")
biocLite("arrayQualityMetrics")
library(arrayQualityMetrics)
condition<-samp_info[,2]
ggt<-data.frame(Group=condition)
row.names(ggt)<-samp_info[,1]
lab<-new("AnnotatedDataFrame",data=ggt)
ddt<-new("ExpressionSet",exprs=as.matrix(prad_data_filt),phenoData=lab)
datx<-prepdata(ddt,intgroup="Group",do.logtransform=FALSE)
arrayQualityMetrics(ddt,outdir="PRAD_raw_counts",intgroup="Group")
Normalization with DESeq2
Install and load the library DESeq2 and use the functions “DESeqDataSetFromMatrix”,”estimateSizeFactors”” and “counts” to obtain the normalized count, starting from the filtered raw count data, NOT log2 transformed. Generate the QC report (using the log2 transformed data plus offset=1) for these data and look how the dignostic plots change with respect to the report obtained from the raw data.
source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")
library("DESeq2")
ll<-apply(prad_data,1,max)
ind<-which(ll>10)
prad_data_filt<-prad_data[ind,]
condition<-as.factor(samp_info[,2])
ddsMat <- DESeqDataSetFromMatrix(countData = prad_data_filt,DataFrame(condition), ~ condition)
.... #create matnorm
head(matnorm[,1:5])
## TCGA_HC_8260_11A TCGA_HC_8259_11A TCGA_EJ_7123_11A TCGA_G9_6496_01A
## TSPAN6 4043.35607 3585.40222 1801.94582 5351.86640
## TNMD 10.55982 20.66773 15.61253 10.03726
## DPM1 1642.05241 1894.24258 1292.58713 1106.10592
## SCYL3 1157.35656 1435.95808 960.82093 855.17445
## C1orf112 249.21181 250.70858 199.71024 150.55888
## FGR 234.42806 343.26407 497.64930 407.51271
TCGA_EJ_7781_01A
TSPAN6 3956.467495
TNMD 5.329772
DPM1 1359.980183
SCYL3 1591.825270
C1orf112 242.504631
FGR 300.243829
condition<-as.character(samp_info[,2])
ggt<-data.frame(Group=condition)
row.names(ggt)<-as.character(samp_info[,1])
lab<-new("AnnotatedDataFrame",data=ggt)
ddt<-new("ExpressionSet",exprs=log2(matnorm+1),phenoData=lab)
datx<-prepdata(ddt,intgroup="Group",do.logtransform=FALSE)
arrayQualityMetrics(ddt,outdir="PRAD_norm_counts",intgroup="Group")
Differential analysis with DESeq2
Firstly, it is better to be sure that the “Solid Tissue Normal” is the first level in the condition factor, so that the default log2 fold changes are calculated as tumor over normal. The function relevel achieves this:
ddsMat$condition <- relevel(dds$condition, "Solid Tissue Normal")
The DESeq2 analysis can now be run with a single call to the function DESeq:
dds <- DESeq(ddsMat)
Calling “results” without any arguments will extract the estimated log2 fold changes and p values:
res <- results(dds)
As res is a DataFrame object, it carries metadata with information on the meaning of the columns:
mcols(res, use.names=TRUE)
## DataFrame with 6 rows and 2 columns
## type
## <character>
## baseMean intermediate
## log2FoldChange results
## lfcSE results
## stat results
## pvalue results
## padj results
## description
## <character>
## baseMean mean of normalized counts for all samples
## log2FoldChange log2 fold change (MAP): condition Primary Tumor vs Solid Tissue Normal
## lfcSE standard error: condition Primary Tumor vs Solid Tissue Normal
## stat Wald statistic: condition Primary Tumor vs Solid Tissue Normal
## pvalue Wald test p-value: condition Primary Tumor vs Solid Tissue Normal
## padj BH adjusted p-values
The first column, baseMean, is a just the average of the normalized count values, dividing by size factors, taken over all samples. The remaining four columns refer to a specific contrast, namely the comparison of the tumor samples over the normal samples for the factor variable condition. Look at the help page for results.
The column log2FoldChange is the effect size estimate. It tells us how much the gene’s expression seems to have changed due to treatment with dexamethasone in comparison to untreated samples. This value is reported on a logarithmic scale to base 2: for example, a log2 fold change of 1.5 means that the gene’s expression is increased by a multiplicative factor of (2^1.5)≈2.82.
This estimate has an uncertainty associated with it, which is available in the column lfcSE, the standard error estimate for the log2 fold change estimate. DESeq2 performs for each gene a hypothesis test to see whether evidence is sufficient to decide against the null hypothesis that there is no effect of the tumor on the gene and that the observed difference between tumor and control was merely caused by experimental variability (i.e., the type of variability that you can just as well expect between different samples in the same condition group). As usual in statistics, the result of this test is reported as a p value, and it is found in the column pvalue.
To look at the genes with a p value below 0.05:
sum(res$pvalue < 0.05, na.rm=TRUE)
## [1] 16044
Now, assume for a moment that the null hypothesis is true for all genes, i.e., no gene is affected by the condition tumor vs normal. Then, by the definition of p value, we expect up to 5% of the genes to have a p value below 0.05. However, we have more genes selected than the 5% of the total. DESeq2 uses the Benjamini-Hochberg (BH) adjustment as described in the base R p.adjust function in order to decrease the number of false positives.
A. Look at the number of genes with adjusted p-value below 5%.
....
## [1] 14369
To select the number of significant genes:
resSig <- subset(res, padj < 0.05)
B. Use DEseq2 function plotMA and use a threshold for alpha = 0.05
Final Exercise Download the GBM data from the same link reported above and repeat the pipeline. Use DESeq2 with the condition tumor vs. normal and try to combine this condition with the age or the gender as in the example:
ddsMat <- DESeqDataSetFromMatrix(countData = gbm_data_filt,.... ~ condition+gender)