Exercise: Day 10 - Gene Set Enrichment

Back to Lecture

Day 10: Gene set enrichment and pathways

Getting the data for the exercises

For these exercises, we’ll be working with differential gene expression from the TCGA Prostate Cancer Dataset. You may already have the data stored in an .RData file - if that’s the case, please load it in.

Otherwise, you can load it in with the following commands, which will produce two objects, the dataset prad and the TCGA annotations annot. You can find this file here.

pradde <- read.table('RES_DESeq2_PRAD.txt', header=T, sep='\t', quote='"')

Be sure to familiarize yourself with the characteristics of the data.

Exercise 1: Gene Ontology and Gene Set Enrichment Analysis

As you learned in class today, you can apply GO analysis to short lists of genes to determine if they share a specific function or context. We’ll be using the topGO package throughout these exercises, so please install it before continuing.

topGO is fairly complex, due to the ability to run many different analyses - we’ll be running two different analyses, GO Enrichment and a GSEA-type analysis (Kolmogorov-Smirnov).

1.a. Building a topGO object

At the end of this exercise, we’ll have an object of type topGOdata, which we can subject to a variety of enrichment tests. We’ll begin by creating each of the data structures required for the object definition. Here’s the template of a new topGOdata object.

topGOdata <- new( # New creates an object of a defined class
    'topGOdata', # The class of object to create
    ontology = 'BP', # We can choose from Molecular Function (MF), 
    # Biological Process (BP), 
    # and Cellular Component (CC)
    allGenes = geneList, # The p-values of all genes in your analysis
    geneSel = function(), # A function for defining the subset you'd like to analyze
    annot = annotFUN(), # A function defining your annotation
    # and some other arguments
)

First, we’ll build the allGenes input. Create a named vector, with the p-values for differential expression as the data, and the gene names as names. It should look like this:

##       TSPAN6         TNMD         DPM1        SCYL3     C1orf112 
## 9.533633e-01 2.546213e-05 2.868157e-03 2.519265e-02 7.661085e-01 
##          FGR 
## 3.308286e-02

NB: Be sure to remove NA values!

Next, let’s build the geneSel function. We’ll name it selFun. We’d like it to return a boolean vector where the top 250 most differentially expressed genes are marked as TRUE and all others are FALSE. For this function do not reference the names, only the value! Define it and check that your output looks like this:

# Get the first 5 genes from selFun
head(selFun(geneList))

##   TSPAN6     TNMD     DPM1    SCYL3 C1orf112      FGR 
##    FALSE    FALSE    FALSE    FALSE    FALSE    FALSE

Now that we’ve gotten this far, we need to put together our object! We can fill in the call to new() with our newly developed inputs. We’re also going to add some extra arguments that help define our object. These arguments help topGO understand what our annotations for each score are (they’re Gene Symbols in this case). See ?annFUN.org for more information.

GOdata <- new('topGOdata',
               ontology='BP', # We'll use Biological Process for this example
               allGenes=geneList,
               geneSel = selFun,
               annot=annFUN.org,
               mapping = 'org.Hs.eg.db', # The annotation package for the human genome
               ID = 'symbol' # We're using gene symbols
)

Be sure to take a good look at the output of the object creation call, and get a feel for what it provides!

1.b. Running enrichment tests using topGO

Now that we have our topGOdata object, we can very straightforwardly run analyses. The specific analyses you run are up to you, but please try at least two. As an example, the classical Fisher’s Exact Test and KS Test we learned about in class are run below. Information on additional types of analysis are available in the package’s vignette.

Let’s take a look at the outputs:

resultFisher

## 
## Description:  
## Ontology: BP 
## 'classic' algorithm with the 'fisher' test
## 14718 GO terms scored: 62 terms with p < 0.01
## Annotation data:
##     Annotated genes: 15234 
##     Significant genes: 183 
##     Min. no. of genes annotated to a GO: 1 
##     Nontrivial nodes: 3519

resultKS

## 
## Description:  
## Ontology: BP 
## 'classic' algorithm with the 'ks' test
## 14718 GO terms scored: 500 terms with p < 0.01
## Annotation data:
##     Annotated genes: 15234 
##     Significant genes: 183 
##     Min. no. of genes annotated to a GO: 1 
##     Nontrivial nodes: 14718

There’s a lot of information here, including the number of GO terms that were scored and the number that were significant! Luckily, there’s a quick way to compare the top predictions between different methods in topGO:

# Make the table
resultTable <- GenTable(GOdata, 
         classicFisher = resultFisher, 
         classicKS = resultKS, 
         topNodes = 50, 
         ranksOf = 'classicFisher'
)

# Look at the first few lines
head(resultTable)

##        GO.ID                                        Term Annotated
## 1 GO:0042127            regulation of cell proliferation      1459
## 2 GO:0070252             actin-mediated cell contraction        92
## 3 GO:0001765                      membrane raft assembly        10
## 4 GO:0030048               actin filament-based movement       111
## 5 GO:0019065 receptor-mediated endocytosis of virus b...         3
## 6 GO:0033484                    nitric oxide homeostasis         3
##   Significant Expected Rank in classicFisher classicFisher classicKS
## 1          36    17.53                     1       2.2e-05   0.00325
## 2           7     1.11                     2       0.00012   3.8e-07
## 3           3     0.12                     3       0.00019   0.17693
## 4           7     1.33                     4       0.00038   1.5e-05
## 5           2     0.04                     5       0.00043   0.07405
## 6           2     0.04                     6       0.00043   0.04322

Exercise 2. Adding your results to your shiny application

Now that we’ve run through the GO Analysis techniques, we’re going to add on to the shiny applications from Day 9. I’ve produced an updated version of the shiny app with one potential direction you can go for visualization. I’ve used the navlistPanel layout feature to separate our Day 10 from our Day 9 work. I’ve also included a tabular version of the results from today.

At this point, you should be familiar enough with how Shiny works to give this a try yourself. Since this is the last day, feel free to explore what Shiny has to offer and add as many new features as you’d like. The code for my version will be available at 8pm as usual.

Updated: