# Exercise: Day 7 - Gene Correlation Networks

## Data Setup and Installation

### Install the WGCNA package

We will be using tools from the R package WGCNA, which was developed by Steve Horvath and Peter Langfelder at UCLA. Use the following code to install this package and its dependencies:

```
source("http://bioconductor.org/biocLite.R") # run if necessary
biocLite(c("AnnotationDbi", "impute", "GO.db", "preprocessCore")) # run if necessary
install.packages("WGCNA")
```

### Get the data

For this exercise, we will be constructing gene correlation networks using the prostate cancer expression dataset (The Molecular Taxonomy of Primary Prostate Cancer. Cell 163:1011 (2015). <doi:10.1016/j.cell.2015.10.025>)

The dataset contains read counts for 86 matched tumor-normal samples and 55635 genes. You should have generated a normalized version of this dataset on day 5. If not, download the file “PRAD_norm_counts.txt” from the dropbox.

### Load the data into R

```
setwd("path_to_files") #replace location with local path_to_file
prad_data <- read.table("PRAD_norm_counts.txt", sep="\t", header=T)
```

## Finding Significant Correlations

First, let’s use the expression data to discover significant correlations between genes. For simplicity, we’ll reduce the dataset to the first 24 rows:

```
reducedExpr <- t(prad_data[1:24,]) # samples as rows, genes as columns
```

**A.** Calculate all pairwise correlations for this set of genes. Which gene pair has the highest absolute correlation?

**B.** Use the Fisher *z*-transform to evaluate the statistical significance of each correlation. How many are correlations are significant after correcting for multiple testing (using a Bonferroni correction)?

**C.** Visualize the correlation matrix with a heatmap. How do different clustering approaches affect the ordering of rows and columns?

**D.** Take the top 10 correlated pairs of genes, and re-evaluate the correlation by bootstrapping. Calculate the 95% confidence intervals for each correlation.

## Visualizing Correlation Networks

Next, we’ll use the R package igraph to build and display a correlation network.

```
# install.packages("igraph") # if necessary
library(igraph)
```

**A.** Create and plot a weighted igraph using the previously calculated pairwise correlation matrix. How many vertices are there? How many edges?

**B.** Try modifiying the *layout* argument when plotting the igraph. How does this change the relationship between verticies and edges?

**C.** How many loop edges exist in this network? Simplify the graph by removing them, and plot the results.

**D.** Color edges as red or green corresponding with negative and postitve correlations, respectively. (Check the docs if you need help)

**E.** If you remove edges with weights less than 0.5, how many edges remain? What about if you remove edges that have a weight less than the absolute smallest significant correlation? Plot your results.

**F.** Examine the node degrees of the resulting network. Which vertex has the highest degree? How many different groups are formed by the vertices in this network, and how many vertices are included in the largest group?

## Building Gene Correlation Networks With WGCNA

Finally, we’ll use WGCNA to build a gene correlation network on the reduced expression dataset.

**A.** First, pick a soft thresholding power ß for the network. Plot the mean connectivity and scale-free topology fit index as a function of ß. Try to find the lowest power at which the scale-free topology fit curve flattens out.

**B.** Construct the gene network, using the power determined in step A and a minimum module size of 3. How many modules are identified by the algorithim?

**C.** Plot the network dendrogram with the module assignments and gene labels.