Single Cell Workshop


Setting up Orchestra

To get on Orchestra you want to connect via ssh, and to turn X11 forwarding on. Turning X11 forwarding will let the Orchestra machines open windows on your local machine, which is useful for looking at the data.

Note for Mac users: you can use the Terminal application/utility to connect to Orchestra as described below. In addition, you will also need to download and install XQuartz for X11 forwarding.

&

Note for Windows users: you will need to download and install Putty to connect to Orchestra, and XMing for X11 forwarding. [Follow these instructions.]

Connect to Orchestra using X11 forwarding:

$ ssh -X your_user_name@orchestra.med.harvard.edu Orchestra is set up with separate login nodes and compute nodes. You don't want to be doing any work on the login node, as that is set aside for doing non computationally intensive tasks and running code on there will make Orchestra slow for everybody. Here is how to connect to a compute node in interactive mode:

$ bsub -n 2 -Is -q interactive bash

Do that and you’re ready to roll. You should see that you are now connected to a node named by an instrument like clarinet or bassoon.

Notice that we used the -n 2 option to allow two cores to be used for the analysis (in general you can set this to larger numbers if required, but we’ll leave it at 2 for today so as to avoid overloading the system).

Setup

The first thing we will do is to checkout a copy of the workshop material from Github into your home directories.

$ git clone https://github.com/hms-dbmi/scw.git
$ cd scw/scw2015

This repository will remain accessible after the workshop, so you can download the code onto your own machines later.

Next we will run a small setup script to create the correct environment variables for running the programs we will need on Orchestra

$ source setup.sh

If you now list the directory contents with the ls command, you will see a directory for each of the four tutorials we will be covering this afternoon, containing R markdown code that you can use to re-run the analyses later. Today we will interactively go through all the steps outlined in these files.

Since the data files needed for the analyses are fairly large, these are not stored in the repository, and we must copy them over from another directory on Orchestra.

Introduction to the data

The raw data we will be using for this part of the workshop lives here /groups/pklab/scw/scw2015/ES.MEF.data/subset:

$ ls /groups/pklab/scw/scw2015/ES.MEF.data/subset

FASTQ files are the standard format for sequenced reads, and that is the format you will receive from the sequencing center after they sequence your cDNA libraries.


The 4 FASTQ files that we will be working with for this module contain 1000 reads each from ~100 samples. The samples are from a single-cell RNA-seq experiment where researchers were looking at differences between expression in mouse embryonic fibroblasts (MEF) and embryonic stem (ES) cells from mice.



Note on mutiplexing and demultiplexing

Libraries prepared from hundreds of single cells can be sequenced in the same lane (multiplexed). During the preparation of these libraries each cell is given a distinct “barcode”, a short nucleotide sequence that will enable us to separate them out after sequencing. Illumina provides adaptors with a barcodes, and software that can demultiplex the data for you. So, if you use these Illumina barcodes, the sequencing center will return demultiplexed fastq files to you.

However, Illumina offers only ~96 distinct barcode combinations (as of March 2015). For single cell work where we are interested in simultaneously sequencing more than 96 cells such as with many of the more recent droplet-based microfluidics approaches, we need additional barcodes. To this end, many groups design their own sets of barcodes; since Illumina’s software is unable to use these to separate the samples, you will have to perform demultiplexing after receiving the data from the sequencing center.

This is outside the scope of this workshop, but it is important to note that this will add an additional step prior to the three steps listed below.



We’ll be taking this small subset of reads and performing the following steps:

  1. looking at them to make sure they are of good quality
    • aligning them to the mouse genome
    • producing a table of number of reads aligning to each gene for each sample

The counts table generated from step 3 will be the starting point for the more interesting downstream functional analyses. We’ll use these subsetted ES and MEF files to demonstrate the workflow; then, we’ll look at pre-computed counts results on a full set of samples for the functional analyses.

Copy across data into your own directories

We will now copy over the test data over into your alignment directory.

$ cd alignment
$ cp -r /groups/pklab/scw/scw2015/ES.MEF.data/subset .

These commands mean:

Quality control

With the start of any data analysis it is important to poke around at your data to see where the warts are. We are expecting single-cell datasets to be extra messy; we should be expecting failures in preparing the libraries, failures in adequately capturing the transcriptome of the cell, libraries that sequence the same reads repeatedly and other issues. In particular we expect the libraries to not be very complex; these are just tags of the end of a transcript and so for each transcript there are a limited number of positions where a read can be placed. We’ll see how this feature skews some of the more commonly used quality metrics now.

For RNA-Seq data many common issues can be detected right off the bat just by looking at some features of the raw reads. The most commonly used program to look at the raw reads is FastQC. To run FastQC on the cluster we have to load the necessary module:

$ module load seq/fastqc/0.11.3

FastQC is pretty fast, especially on small files, so we can run FastQC on one of the full files (instead of just on the subset). First lets copy one of those files over:

$ mkdir ~/scw/scw2015/alignment/fastq
$ cp /groups/pklab/scw/scw2015/ES.MEF.data/fastq/L139_ESC_1.fq ~/scw/scw2015/alignment/fastq/
$ cd ~/scw/scw2015/alignment/fastq Now we can run FastQC on the file by typing:

$ fastqc L139_ESC_1.fq And look at the nice HTML report it generates with Firefox:

$ firefox L139_ESC_1_fastqc.html

Let’s analyse some of the plots:

What are those sequences? We can search for the reads that have one of those enriched sequences with grep (globally search a regular expression and print) which print out every line in a file that matches a search string. grep the L139_ESC_1.fq file like this:

$ grep ACTTGAA L139_ESC_1.fq

You should see a lot of sequences that look like this: TTGCTAGATATCAGTTGCCTTCTTTTGTTCAGCTAAGGAACTTGAA

If we BLAST this sequence to the mouse genome, we come up empty, so it is some kind of contaminant sequence, it isn’t clear where it comes from. The protocol listed here doesn’t have too many clues either. If we could figure out what these sequences are, it would help troubleshoot the preparation protocol and we might be able to align more reads. As it is, these sequences are unlikely to align to the mouse genome, so they mostly represent wasted sequencing.



FastQC reports and the implications for data quality (more information)



Alignment

For aligning RNA-seq reads it is necessary to use an aligner that is splice-aware; reads crossing splice junctions have gaps when aligned to the genome and the aligner has to be able to handle that possibility. There are a wide variety of aligners to choose from that handle spliced reads but the two most commonly used are Tophat and STAR. They both have similar accuracy, but STAR is much, much faster than Tophat at the cost of using much more RAM; to align to the human genome you need ~40 GB of RAM, so it isn’t something you will be able to run on your laptop or another type of RAM-restricted computing environment.

For this exercise we will use Tophat, so let’s load the module first.

$ module load seq/tophat/2.1.0

To align reads with Tophat you need three things.

  1. The genome sequence of the organism you are working with in FASTA format.
    • A gene annotation for your genome in Gene Transfer Format (GTF). For example from UCSC.
    • The FASTQ file of reads you want to align.

First you must make an Bowtie2 index of the genome sequence; this allows the Tophat algorithm to quickly find regions of the genome where each read might align. We have done this step already, so don’t type in these commands, but if you need to do it on your own, here is how to do it:

# don't type this in
$ bowtie2-build your_fasta_file genome_name

We will be using precomputed indices for the mm10 genome today:

/groups/shared_databases/igenome/Mus_musculus/UCSC/mm10/Sequence/Bowtie2Index/
/groups/shared_databases/igenome/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf

Finally, we’ve precomputed an index of the gene sequences from the ref-transcripts.gtf file here:

 /groups/shared_databases/igenome/Mus_musculus/UCSC/mm10/Annotation/Genes/tophat2_trans/

We will use this precomputed index of the gene sequences instead of the GTF file because it is much faster; you can use either and they have the same output.

Now we’re ready to align the reads to the mm10 genome. We will align two ESC samples and two MEF samples:

$ cd ~/scw/scw2015/alignment/subset
    
$ bsub -J L139_ESC_1 -W 00:20 -n 2 -q training "tophat -p 2 -o L139_ESC_1-tophat --no-coverage-search --transcriptome-index=/groups/shared_databases/igenome/Mus_musculus/UCSC/mm10/Annotation/Genes/tophat2_trans/genes /groups/shared_databases/igenome/Mus_musculus/UCSC/mm10/Sequence/Bowtie2Index/genome L139_ESC_1.subset.fastq; mv L139_ESC_1-tophat/accepted_hits.bam L139_ESC_1-tophat/L139_ESC_1.bam"
    
$ bsub -J L139_ESC_2 -W 00:20 -n 2 -q training "tophat -p 2 -o L139_ESC_2-tophat --no-coverage-search --transcriptome-index=/groups/shared_databases/igenome/Mus_musculus/UCSC/mm10/Annotation/Genes/tophat2_trans/genes /groups/shared_databases/igenome/Mus_musculus/UCSC/mm10/Sequence/Bowtie2Index/genome L139_ESC_2.subset.fastq; mv L139_ESC_2-tophat/accepted_hits.bam L139_ESC_2-tophat/L139_ESC_2.bam"
    
$ bsub -J L139_MEF_49 -W 00:20 -n 2 -q training "tophat -p 2 -o L139_MEF_49-tophat --no-coverage-search --transcriptome-index=/groups/shared_databases/igenome/Mus_musculus/UCSC/mm10/Annotation/Genes/tophat2_trans/genes /groups/shared_databases/igenome/Mus_musculus/UCSC/mm10/Sequence/Bowtie2Index/genome L139_MEF_49.subset.fastq; mv L139_MEF_49-tophat/accepted_hits.bam L139_MEF_49-tophat/L139_MEF_49.bam"
    
$ bsub -J L139_MEF_50 -W 00:20 -n 2 -q training "tophat -p 2 -o L139_MEF_50-tophat --no-coverage-search --transcriptome-index=/groups/shared_databases/igenome/Mus_musculus/UCSC/mm10/Annotation/Genes/tophat2_trans/genes /groups/shared_databases/igenome/Mus_musculus/UCSC/mm10/Sequence/Bowtie2Index/genome L139_MEF_50.subset.fastq; mv L139_MEF_50-tophat/accepted_hits.bam L139_MEF_50-tophat/L139_MEF_50.bam"

Each of these should complete in about seven to ten minutes. Since we ran them all in parallel on the cluster, the whole set should take about seven to ten minutes instead of 30 - 40. Full samples would take hours. *** -J names the job so you can see what it is when you run bjobs to check the status of the jobs.

-W 00:20 tells the scheduler the job should take about 20 minutes.

-q training submits the job to the training queue. [Note: If you are using your own account, please use -q short to submit the jobs to the short queue; the training queue can only be used by people using the training accounts.]

The syntax of the tophat command is tophat -p <NUMBER OF THREADS> --no-coverage-search --transcriptome-index=<PATH TO TRASNCRIPTOME INDEX> <PATH TO GENOME INDEX> <PATH TO INPUT FASTQ>

At the end we tack on (after the “;”) a command to rename (mv) the Tophat output filename accepted_hits.bam to something more evocative. *** This method of submitting one job at a time is fine for a small number of samples, but if you wanted to run a full set of hundreds of cells, doing this manually for every sample is a waste of time and prone to errors. You can run all of these automatically by writing a loop:

# don't type this in, it is here for future reference
for file in *.fastq; do
    samplename=$(basename $file .fastq)
    bsub -W 00:20 -n 2 -q short "tophat -p 2 -o $samplename-tophat --no-coverage-search --transcriptome-index=/groups/shared_databases/igenome/Mus_musculus/UCSC/mm10/Annotation/Genes/tophat2_trans/genes /groups/shared_databases/igenome/Mus_musculus/UCSC/mm10/Sequence/Bowtie2Index/genome $file; mv $samplename-tophat/accepted_hits.bam $samplename-tophat/$samplename.bam"
    done

This will loop over all of the files with a .fastq extension in the current directory and align them with Tophat. We’ll skip ahead now to doing some quality control of the alignments and finally counting the reads mapping to each feature.

Quality checking the alignments

There are several tools to spot check the alignments, it is common to run RNA-SeQC on the alignment files to generate some alignment stats and to determine the overall coverage, how many genes were detected and so on. Another option for a suite of quality checking tools is RSeQC. For real experiments it is worth it to look at the output of these tools and see if anything seems amiss.

Counting reads with featureCounts

The last step is to count the number of reads mapping to the features are are interested in. Quantification can be done at multiple levels; from the level of counting the number of reads supporting a specific splicing event, to the number of reads aligning to an isoform of a gene or the total reads mapping to a known gene. We’ll be quantifying the latter, i.e. the total number of reads that can uniquely be assigned to a known gene; basically looking at the location of read alignment on the genome and putting it together with the location of the gene on the genome (this information is contained in the GTF/annotation file). There are several tools to do this, we will use featureCounts because it is very fast and accurate.

$ module load seq/subread/1.4.6-p3			#featureCounts is part of the subread package
    
$ featureCounts --primary -a /groups/shared_databases/igenome/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf -o combined.featureCounts L139_ESC_1-tophat/L139_ESC_1.bam L139_ESC_2-tophat/L139_ESC_2.bam L139_MEF_49-tophat/L139_MEF_49.bam L139_MEF_50-tophat/L139_MEF_50.bam

--primary tells featureCounts to only count the primary alignment for reads that map multiple times to the genome. This ensures we do not double count reads that map to multiple places in the genome. ***

We need to massage the format of this file so we can use it. We’ll take the first column, which is the gene ids and every column after the 6th, which has the counts of each sample.

$ sed 1d combined.featureCounts | cut -f1,7- | sed s/Geneid/id/ > combined.counts

This command means steam edit (sed) the file combined.featureCounts by *** 1d deleting the first line

-f1,7- keeping the first column and everything from the 7th column on

s/Geneid/id/ changing the phrase “Geneid” to “id”. *** This outputs a table with “I” rows of genes and the “J” columns of samples. Each entry in the table is the number of reads that can be uniquely assigned to the gene “i” in the sample “j”. This file is of now ready and in the correct format for loading into R.