Exercise: Day 10 - Gene Set Enrichment
Day 10: Gene set enrichment and pathways
Getting the data for the exercises
For these exercises, we’ll be working with differential gene expression from the TCGA Prostate Cancer Dataset. You may already have the data stored in an .RData
file - if that’s the case, please load it in.
Otherwise, you can load it in with the following commands, which will produce two objects, the dataset prad
and the TCGA annotations annot
. You can find this file here.
pradde <- read.table('RES_DESeq2_PRAD.txt', header=T, sep='\t', quote='"')
Be sure to familiarize yourself with the characteristics of the data.
Exercise 1: Gene Ontology and Gene Set Enrichment Analysis
As you learned in class today, you can apply GO analysis to short lists of genes to determine if they share a specific function or context. We’ll be using the topGO
package throughout these exercises, so please install it before continuing.
topGO
is fairly complex, due to the ability to run many different analyses - we’ll be running two different analyses, GO Enrichment and a GSEA-type analysis (Kolmogorov-Smirnov).
1.a. Building a topGO
object
At the end of this exercise, we’ll have an object of type topGOdata
, which we can subject to a variety of enrichment tests. We’ll begin by creating each of the data structures required for the object definition. Here’s the template of a new topGOdata
object.
topGOdata <- new( # New creates an object of a defined class
'topGOdata', # The class of object to create
ontology = 'BP', # We can choose from Molecular Function (MF),
# Biological Process (BP),
# and Cellular Component (CC)
allGenes = geneList, # The p-values of all genes in your analysis
geneSel = function(), # A function for defining the subset you'd like to analyze
annot = annotFUN(), # A function defining your annotation
# and some other arguments
)
First, we’ll build the allGenes
input. Create a named vector, with the p-values for differential expression as the data, and the gene names as names. It should look like this:
## TSPAN6 TNMD DPM1 SCYL3 C1orf112
## 9.533633e-01 2.546213e-05 2.868157e-03 2.519265e-02 7.661085e-01
## FGR
## 3.308286e-02
NB: Be sure to remove NA values!
Next, let’s build the geneSel
function. We’ll name it selFun
. We’d like it to return a boolean vector where the top 250 most differentially expressed genes are marked as TRUE
and all others are FALSE
. For this function do not reference the names, only the value! Define it and check that your output looks like this:
# Get the first 5 genes from selFun
head(selFun(geneList))
## TSPAN6 TNMD DPM1 SCYL3 C1orf112 FGR
## FALSE FALSE FALSE FALSE FALSE FALSE
Now that we’ve gotten this far, we need to put together our object! We can fill in the call to new()
with our newly developed inputs. We’re also going to add some extra arguments that help define our object. These arguments help topGO
understand what our annotations for each score are (they’re Gene Symbols in this case). See ?annFUN.org
for more information.
GOdata <- new('topGOdata',
ontology='BP', # We'll use Biological Process for this example
allGenes=geneList,
geneSel = selFun,
annot=annFUN.org,
mapping = 'org.Hs.eg.db', # The annotation package for the human genome
ID = 'symbol' # We're using gene symbols
)
Be sure to take a good look at the output of the object creation call, and get a feel for what it provides!
1.b. Running enrichment tests using topGO
Now that we have our topGOdata
object, we can very straightforwardly run analyses. The specific analyses you run are up to you, but please try at least two. As an example, the classical Fisher’s Exact Test and KS Test we learned about in class are run below. Information on additional types of analysis are available in the package’s vignette.
Let’s take a look at the outputs:
resultFisher
##
## Description:
## Ontology: BP
## 'classic' algorithm with the 'fisher' test
## 14718 GO terms scored: 62 terms with p < 0.01
## Annotation data:
## Annotated genes: 15234
## Significant genes: 183
## Min. no. of genes annotated to a GO: 1
## Nontrivial nodes: 3519
resultKS
##
## Description:
## Ontology: BP
## 'classic' algorithm with the 'ks' test
## 14718 GO terms scored: 500 terms with p < 0.01
## Annotation data:
## Annotated genes: 15234
## Significant genes: 183
## Min. no. of genes annotated to a GO: 1
## Nontrivial nodes: 14718
There’s a lot of information here, including the number of GO terms that were scored and the number that were significant! Luckily, there’s a quick way to compare the top predictions between different methods in topGO
:
# Make the table
resultTable <- GenTable(GOdata,
classicFisher = resultFisher,
classicKS = resultKS,
topNodes = 50,
ranksOf = 'classicFisher'
)
# Look at the first few lines
head(resultTable)
## GO.ID Term Annotated
## 1 GO:0042127 regulation of cell proliferation 1459
## 2 GO:0070252 actin-mediated cell contraction 92
## 3 GO:0001765 membrane raft assembly 10
## 4 GO:0030048 actin filament-based movement 111
## 5 GO:0019065 receptor-mediated endocytosis of virus b... 3
## 6 GO:0033484 nitric oxide homeostasis 3
## Significant Expected Rank in classicFisher classicFisher classicKS
## 1 36 17.53 1 2.2e-05 0.00325
## 2 7 1.11 2 0.00012 3.8e-07
## 3 3 0.12 3 0.00019 0.17693
## 4 7 1.33 4 0.00038 1.5e-05
## 5 2 0.04 5 0.00043 0.07405
## 6 2 0.04 6 0.00043 0.04322
Exercise 2. Adding your results to your shiny application
Now that we’ve run through the GO Analysis techniques, we’re going to add on to the shiny applications from Day 9. I’ve produced an updated version of the shiny app with one potential direction you can go for visualization. I’ve used the navlistPanel
layout feature to separate our Day 10 from our Day 9 work. I’ve also included a tabular version of the results from today.
At this point, you should be familiar enough with how Shiny works to give this a try yourself. Since this is the last day, feel free to explore what Shiny has to offer and add as many new features as you’d like. The code for my version will be available at 8pm as usual.