Exercise: Day 1 - Introduction to R

Get the data

Download the following data of from https://tcga-data.nci.nih.gov/docs/publications/gbm_exp/.

unifiedScaledFiltered.txt - Filtered unified gene expression estimate for 202 samples and 1740 genes

Load the data into R

Do this by either first downloading the file and providing the local path:

#replace location with local path_to_file
my_data <- read.table("../unifiedScaledFiltered.txt",sep="\t",header=1)

..or by reading the file directly from the remore location:

#source("https://bioconductor.org/biocLite.R") #run if necessary
#biocLite("curl") #run if necessary
library(curl)
my_data <- read.table(curl("https://tcga-data.nci.nih.gov/docs/publications/gbm_exp/unifiedScaledFiltered.txt"),sep="\t",header=1)

Data handling

A. Verify the number of columns and the number of rows. Are the samples the rows or the columns?

## [1] 1740  202

## [1] 202

## [1] 1740

B. What is the meaning of each number in the matrix? Verify your answer by computing and evaluating the mean for each column and row. Show a small subset of the results.

##         FSTL1          MMP2         BBOX1          GCSH          EDN1 
##  2.970297e-07 -9.900990e-08  3.465347e-07 -2.475248e-07  2.475248e-07 
##         CXCR4 
## -9.900990e-08

## TCGA.02.0001.01C.01 TCGA.02.0002.01A.01 TCGA.02.0003.01A.01 
##         0.177161391         0.312827282         0.286643540 
## TCGA.02.0004.01A.01 TCGA.02.0006.01B.01 TCGA.02.0007.01A.01 
##         0.142784023         0.394513443         0.002934747

C. What are the minimum and maximum values for each gene? For each sample? Show a small subset of the results for each calculation.

##   FSTL1    MMP2   BBOX1    GCSH    EDN1   CXCR4 
## 2.57283 3.05396 1.84297 1.62990 2.09741 2.15870

##    FSTL1     MMP2    BBOX1     GCSH     EDN1    CXCR4 
## -2.92885 -2.80437 -4.53120 -1.89212 -1.28751 -3.04210

## TCGA.02.0001.01C.01 TCGA.02.0002.01A.01 TCGA.02.0003.01A.01 
##             3.60942             3.74107             3.99387 
## TCGA.02.0004.01A.01 TCGA.02.0006.01B.01 TCGA.02.0007.01A.01 
##             6.92775             3.94497             5.00777

## TCGA.02.0001.01C.01 TCGA.02.0002.01A.01 TCGA.02.0003.01A.01 
##            -4.41469            -4.67118            -2.75794 
## TCGA.02.0004.01A.01 TCGA.02.0006.01B.01 TCGA.02.0007.01A.01 
##            -3.49348            -4.75532            -4.34894

D. Which gene has the largest standard deviation?

## RPS4Y1 
##    567

## [1] 3.3467

E. What is the overall highest expression value? Which gene and for which sample?

## [1] 6.92775

## TCGA.02.0004.01A.01 
##                   4

## ASPN 
##  989

## [1] 6.92775

## [1] 6.92775

Plotting

Use the original data matrix of 202 samples and 1740 genes.

A. Plot a histogram of the gene expression values across samples for the EGFR gene.

B. Plot an x-y scatterplot of EGFR versus IDH1 expression levels.

C. Gather a subset of genes that are zinc-finger nucleases whose gene names contain the letters “ZFN”. Hint: the function “grep” can be used to search for string matches. Also, you can use the function “rownames” to return a list of the row names.

How many ZFN’s are there?

## [1] 17

What are their row numbers?

##  [1]   24   56  238  322  626  629  634  756  758  968  993 1085 1342 1361
## [15] 1492 1615 1641

What are their names?

##  [1] "ZNF83"  "ZNF228" "ZNF43"  "ZNF184" "ZNF423" "ZNF573" "ZNF415"
##  [8] "ZNF124" "ZNF217" "ZNF536" "ZNF84"  "ZNF22"  "ZNF711" "ZNF177"
## [15] "ZNF91"  "ZNF292" "ZNF659"

Create a boxplot showing thier expression levels versus all of the genes. Use average values for each sample.

Functions

Use the original data matrix of 202 samples and 1740 genes, or first test on a smaller vector:

my_vector <- c(10,5,2,6,8,4,1,9,3,7)

A. Create a function that performs a selection sort on a vector and returns the vector in ascending order.

##  [1]  1  2  3  4  5  6  7  8  9 10

B. Modify the function to return the original vector’s indexes from sorting the values in ascending order

##  [1]  7  3  9  6  2  4 10  5  8  1