1 Introduction

1.1 Example datasets and single cell RNA-seq techniques

This document aims to provide a workflow for processing and QC of single cell RNA-seq data. Because this is a relatively new area with many on-going development of new methods/appraoches. This document will be updated to reflect such new developmets. There are many other important topics for scRNA-seq analysis, such as imputatoin or clustring. We will provide separate workflows for those topics. One publicly available datasets will be used for this workflow.

PBMC.68k: around 68k peripheral blood mononuclear cells (PBMCs) from one donor (donor A) (Zheng et al. 2017).

Multiple techniques are available to generate scRNA-seq data. For example, Ziegenhain et al. (2017) have compared 6 popular techniques (see Figure 2 and Table S1 of their paper for more details):

  • CEL-seq2/C1
  • Drop-seq
  • MARS-seq
  • SCRB-seq
  • Smart-seq/C1
  • Smart-seq2

Among the many differences of these techniques, the most important ones are:

  • the trade-off between using unique molecular identifiers (UMIs) versus sequecing full lenght RNA, and
  • the trade-off between capturing more cells or more genes per cell.

An UMI is a randomly generated barcode (4-10bp) to label each transcript molecule before amplification. Therefore, by counting UMIs instead of actual reads, one can remove most noise and bias due to amplification. However, UMI is only available for techniques that sequence 5’ or 3’ ends of the transcript molecule, and thus cannot sequence full length RNA. Among the aformentioned techniques, Smart-seq/C1 and Smart-seq2 sequence full length RNA, hence cannot use UMI.

Drop-seq (e.g., prducts from 10x Genomics) can capture much more cells than other techiniques, but has low depth within each cell. In other words, compared with otehr techniques, Drop-seq captures more cells and samller number of genes witin each cell.

1.2 An overview of workflow

The analysis of scRNA-seq data may be divided into two steps. The first step is to generate a count matrix for all the genes and all the cells of one individual (the left panel of the following flowchart). For example, a matrix where each row corresponds to a gene and each column corresponds to a cell. The first step also invovles generation of fastq files and map the reads in the fastq file and obtain a bam file, as well as QC on fastq files, bam files, and the read count data matrix.

The second step is data analysis based on this count matrix, typicially include imputation, clustering, and differential expression (the right panel of the following flowchart). These data analysis steps are very active reasearch topics and many computational methods are avaialble. In the followign figure, we list a few popular methods for each step.

A flow chart of the workflow for analyzing scRNA-seq data.

A flow chart of the workflow for analyzing scRNA-seq data.

2 Generate read count matrix

2.1 Use Cell Ranger to process scRNA data from 10x Genomics

Cell Ranger are a set of pipelines to process and analzye 10x Genomics’s scRNA-seq data. It was developed and maintained by 10x Genomics. Here we provide some example code for data analysis and more details can be found in the support webpage by 10x Genomics. https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger

2.1.1 Generate fastq files

The raw outputs of a study is often saved in one folder (e.g., 180605_D00300_0562_BCCKLNANXX), and they include several files and folders like the following.

-rw-r--r--  392069  180605_D00300_0562_BCCKLNANXX.jpg
drwxr-xr-x  0   BarcodeImages/
-rw-r--r--  46  Basecalling_Netcopy_complete_Read1.txt
-rw-r--r--  46  Basecalling_Netcopy_complete_Read2.txt
-rw-r--r--  46  Basecalling_Netcopy_complete_Read3.txt
-rw-r--r--  46  Basecalling_Netcopy_complete.txt
drwxr-xr-x  181 Config/
drwxr-xr-x  85  Data/
-rw-r--r--  3458    First_Base_Report.htm
-rw-r--r--  46  ImageAnalysis_Netcopy_complete_Read1.txt
-rw-r--r--  46  ImageAnalysis_Netcopy_complete_Read2.txt
-rw-r--r--  46  ImageAnalysis_Netcopy_complete_Read3.txt
-rw-r--r--  46  ImageAnalysis_Netcopy_complete.txt
drwxr-xr-x  268 InterOp/
drwxr-xr-x  9477    Logs/
drwxr-xr-x  41  PeriodicSaveRates/
drwxr-xr-x  71  Recipe/
-rw-r--r--  46  RTAComplete.txt
-rw-r--r--  826 RunInfo.xml
-rw-r--r--  5838    runParameters.xml

The raw data are saved in foldrer Data in the format of Illumina sequencer’s base call files (BCLs). We need to use cellranger mkfastq command to generate fastq files:

#!/bin/bash

ml boost/1.55.0
ml bcl2fastq
path_input='/jane_doe/illumina/180605_D00300_0562_BCCKLNANXX'
path_output='/jane_doe/study_x/ABC'

path_to_cellranger/cellranger mkfastq \
  --run="$path_to_input" \
  --csv=path_to_csv/cellranger_sample_index.csv \
  --output-dir="$path_output" \
  --localcores=6 \
  --localmem=300

The above code was run on a linux cluster with bash shell. The command ml is a shortcut for module load. The modules of boost and Illumina’s bcl2fastq are needed to run mkfastq.

The file cellranger_sample_index.csv gives the mapping between sample name (Sample) and sample index for library construction (Index) . The column Lane specifies the lane(s) of the flowcell where one sample has used. It can be a number (e.g., 1), a range (e.g., 2-4) or ’*’ for all lanes in the flowcell. Here is an example of the file cellranger_sample_index.csv:

Lane,Sample,Index
*,ABC_baseline1,SI-GA-D2
*,ABC_baseline2,SI-GA-E2
...

localcores and localmem set the maximum number of cores and maximum memory (GB) for the pipeline.

2.1.2 Mapping RNA-seq reads and generate read counts

Next we use cell ranger count command to map the RNA-seq reads to reference genome (human hg38 in this example) and count the number of RNA-seq fragments per gene per cell.

#!/bin/bash
                   
path_to_cellranger/cellranger count \
  --id=ABC_baseline1 \
  --sample=ABC_baseline1 \
  --transcriptome=path_to_reference_genome/refdata-cellranger-GRCh38-1.2.0 \
  --fastqs=/jane_doe/study_x/ABC \
  --localcores=6 \
  --localmem=300

where

  • id is a unique ID to be used to name the output.
  • sample should match to one sample name in the file cellranger_sample_index.csv.
  • transcriptome is the location where the reference genome data was saved. This reference genome was downloaded from https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest.
  • fastqs specifies the location where fastq files were saved, here it matches with the output-dir of mkfastq command.

2.2 Alternative pipeline

Without using Cell Ranger, one can obtain the read counts in three steps.

  1. Use bcl2fastq to obtain the fastq files. Then map RNA-seq reads using

  2. Map RNA-seq reads to reference genome using alignment software such as STAR or tophat, or pseudo-alignment methods (e.g. Kallisto, Salmon) for well annotated transcriptomes, such as human or mouse.

  3. Count the number of reads per gene per sample, for example, by softare HT-seq or R function summarizeOverlaps, as shown in the following codes.

library("GenomicFeatures")
library("GenomicAlignments")

path = "ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_27/"
file = "gencode.v27.annotation.gtf.gz"
gtfFile = paste(path, file, sep="")

txdb = makeTxDbFromGFF(file=gtfFile, format="gtf",  organism="Homo sapiens")
saveDb(txdb, file="Homo_sapiens_gencode_v27.sqlite")

txdb  = loadDb("Homo_sapiens_gencode_v27.sqlite")
genes = exonsBy(txdb, by="gene")

filenames = list.files(pattern=".bam$")
bamfiles  = BamFileList(filenames, yieldSize=1000000)

se = summarizeOverlaps(features=genes, reads=bamfiles, mode="Union",
             singleEnd=FALSE, ignore.strand=FALSE, fragments=TRUE )

as1 = assay(se)

write.table(as1, file = "gene_level_counts.txt", append = FALSE,
  quote = FALSE, sep = "\t", eol = "\n", row.names = TRUE,
  col.names = TRUE)

The quality control for fastq or bam files can be performed using pipelines developed for bulk RNAseq data, such as FastQC for fastq files and RSeQC for bam files.

3 Quality control and normalization for count matrix

A few libraries are needed in the followign code. The main QC anlaysis is carried out by scater.

library(DropletUtils)
library(biomaRt)
library(dplyr)
library(scater)

3.1 Zheng et al. (2017) 68k PBMC data

3.1.1 Data preparation

This dataset was generated using 10x Genomics platform (Zheng et al. 2017) The read count data were downloaded from the link of “Gene / cell matrix (raw)” from https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/fresh_68k_pbmc_donor_a.

The downloaded zip file were unzipped, renamed as raw_gene_bc_matrices, and saved in folder ~/research/scRNAseq/data/Zheng_2016/donorA_hg19/outs/. In addition, we also downloaded the “Summary CSV” file, renamed it as metrics_summary.csv, and saved in the same folder. Renaming and orgnizing the folder strucrture is to recontrcut the file struture of cell ranger output.

└── outs
    ├── metrics_summary.csv
    └── raw_gene_bc_matrices
        └── hg19
            ├── barcodes.tsv
            ├── genes.tsv
            └── matrix.mtx

The count matrix was saved as three files, where barcodes.tsv saves barcode information, genes.tsv saves gene information, and matrix.mtx saves the count data in MatrixMarket format.

This dataset can be loaded into R by function load_cellranger_matrix from R package cellrangerRkit, which requires the file metrics_summary.csv. Here we load them using function read10xCounts from R package DropletUtils, and obtain gene anntation using R package biomaRt.

path1 = "~/research/scRNAseq/data/Zheng_2016/donorA_hg19/"
path2 = paste0(path1, "outs/raw_gene_bc_matrices/hg19/")
sce   = read10xCounts(path2, col.names=TRUE)
sce
## class: SingleCellExperiment 
## dim: 32738 5898240 
## metadata(0):
## assays(1): counts
## rownames(32738): ENSG00000243485 ENSG00000237613 ...
##   ENSG00000215616 ENSG00000215611
## rowData names(2): ID Symbol
## colnames(5898240): AAACATACAAAACG-1 AAACATACAAAAGC-1 ...
##   TTTGCATGTTTGGG-8 TTTGCATGTTTGTC-8
## colData names(2): Sample Barcode
## reducedDimNames(0):
## spikeNames(0):
anno.file = "~/research/scRNAseq/workflow/data/gene.annoation.rds"
if(file.exists(anno.file)){
  gene.annotation = readRDS(anno.file)
}else{
  ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
  
  attr.string = c('ensembl_gene_id', 'hgnc_symbol', 'chromosome_name')
  attr.string = c(attr.string, 'start_position', 'end_position', 'strand')
  attr.string = c(attr.string, 'description', 'percentage_gene_gc_content')
  attr.string = c(attr.string, 'gene_biotype')
  
  rowData(sce)[1:2,]
  gene.annotation = getBM(attributes=attr.string, 
                          filters =  'ensembl_gene_id', 
                          values = rowData(sce)$ID, 
                          mart = ensembl)
}

dim(gene.annotation)
## [1] 31188     9
gene.annotation[1:2,]
##   ensembl_gene_id hgnc_symbol chromosome_name start_position end_position
## 1 ENSG00000004487       KDM1A               1       23019443     23083689
## 2 ENSG00000007923     DNAJC11               1        6634168      6701924
##   strand
## 1      1
## 2     -1
##                                                                             description
## 1                             lysine demethylase 1A [Source:HGNC Symbol;Acc:HGNC:29079]
## 2 DnaJ heat shock protein family (Hsp40) member C11 [Source:HGNC Symbol;Acc:HGNC:25570]
##   percentage_gene_gc_content   gene_biotype
## 1                      39.17 protein_coding
## 2                      46.60 protein_coding
t1 = table(gene.annotation$ensembl_gene_id)
t2 = t1[t1 > 1]
t2 
## named integer(0)
gene.annotation[which(gene.annotation$ensembl_gene_id %in% names(t2)),]
## [1] ensembl_gene_id            hgnc_symbol               
## [3] chromosome_name            start_position            
## [5] end_position               strand                    
## [7] description                percentage_gene_gc_content
## [9] gene_biotype              
## <0 rows> (or 0-length row.names)
gene.annotation = distinct(gene.annotation, ensembl_gene_id, 
                           .keep_all = TRUE)

dim(gene.annotation)
## [1] 31188     9
gene.annotation[1:2,]
##   ensembl_gene_id hgnc_symbol chromosome_name start_position end_position
## 1 ENSG00000004487       KDM1A               1       23019443     23083689
## 2 ENSG00000007923     DNAJC11               1        6634168      6701924
##   strand
## 1      1
## 2     -1
##                                                                             description
## 1                             lysine demethylase 1A [Source:HGNC Symbol;Acc:HGNC:29079]
## 2 DnaJ heat shock protein family (Hsp40) member C11 [Source:HGNC Symbol;Acc:HGNC:25570]
##   percentage_gene_gc_content   gene_biotype
## 1                      39.17 protein_coding
## 2                      46.60 protein_coding
table(gene.annotation$chromosome_name)
## 
##                  1                 10                 11 
##               2996               1181               1848 
##                 12                 13                 14 
##               1659                590               1042 
##                 15                 16                 17 
##               1018               1422               1727 
##                 18                 19                  2 
##                627               1951               2132 
##                 20                 21                 22 
##                807                435                674 
##                  3                  4                  5 
##               1685               1331               1602 
##                  6                  7                  8 
##               1548               1384               1275 
##                  9 CHR_HSCHR22_1_CTG7 CHR_HSCHR4_6_CTG12 
##               1147                  1                  1 
##         KI270713.1                 MT                  X 
##                  1                 13                986 
##                  Y 
##                105
table(gene.annotation$gene_biotype)
## 
##           3prime_overlapping_ncRNA                          antisense 
##                                 10                               5095 
##      bidirectional_promoter_lncRNA                          IG_V_gene 
##                                 25                                  1 
##                    IG_V_pseudogene                            lincRNA 
##                                  1                               6584 
##                              miRNA                           misc_RNA 
##                                  1                                  5 
##             polymorphic_pseudogene               processed_pseudogene 
##                                  5                                 18 
##               processed_transcript                     protein_coding 
##                                119                              19120 
##                              scRNA                     sense_intronic 
##                                  1                                 17 
##                  sense_overlapping                             snoRNA 
##                                  5                                  6 
##                                TEC   transcribed_processed_pseudogene 
##                                  5                                 34 
##     transcribed_unitary_pseudogene transcribed_unprocessed_pseudogene 
##                                 36                                 72 
##             unprocessed_pseudogene                           vaultRNA 
##                                 27                                  1
## some genes do not have annotation because their ids are retired
gene.missing = setdiff(rowData(sce)$ID, gene.annotation$ensembl_gene_id)
length(gene.missing)
## [1] 1550
gene.missing[1:6]
## [1] "ENSG00000237683" "ENSG00000235249" "ENSG00000236743" "ENSG00000231709"
## [5] "ENSG00000239664" "ENSG00000223659"
w2kp = match(gene.annotation$ensembl_gene_id, rowData(sce)$ID)
sce  = sce[w2kp,]
dim(sce)
## [1]   31188 5898240
table(gene.annotation$ensembl_gene_id == rowData(sce)$ID)
## 
##  TRUE 
## 31188
rowData(sce)  = gene.annotation
rownames(sce) = uniquifyFeatureNames(rowData(sce)$ensembl_gene_id, 
                                     rowData(sce)$hgnc_symbol)

3.1.2 Identify low quallity cells

An important QC step for scRNA-seq data analysis is to identify low quality or empty cells. For 10x Genomics data, The emptyDrops function in R package DropletUtils can be used detect empty cells, given the count matrix of all barcodes. That is why we loaded the raw data matrix intead of filtered data matrix.

bcrank = barcodeRanks(counts(sce))

# Only showing unique points for plotting speed.
uniq = !duplicated(bcrank$rank)

par(mar=c(5,4,2,1), bty="n")
plot(bcrank$rank[uniq], bcrank$total[uniq], log="xy", 
     xlab="Rank", ylab="Total UMI count", cex=0.5, cex.lab=1.2)

abline(h=bcrank$inflection, col="darkgreen", lty=2)
abline(h=bcrank$knee, col="dodgerblue", lty=2)

legend("left", legend=c("Inflection", "Knee"), bty="n", 
       col=c("darkgreen", "dodgerblue"), lty=2, cex=1.2)

bcrank$inflection
## CAGCAATGCATTCT-7 
##              469
bcrank$knee
## TTGACACTCTGCTC-8 
##             1124
summary(bcrank$total)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     0.00     0.00    16.91     0.00 10467.00
table(bcrank$total >= bcrank$knee)
## 
##   FALSE    TRUE 
## 5852487   45753
table(bcrank$total >= bcrank$inflection)
## 
##   FALSE    TRUE 
## 5830401   67839
set.seed(100)
e.out = emptyDrops(counts(sce))
e.out
## DataFrame with 5898240 rows and 5 columns
##                      Total   LogProb    PValue   Limited       FDR
##                  <integer> <numeric> <numeric> <logical> <numeric>
## AAACATACAAAACG-1         5        NA        NA        NA        NA
## AAACATACAAAAGC-1         0        NA        NA        NA        NA
## AAACATACAAACAG-1         4        NA        NA        NA        NA
## AAACATACAAACGA-1         2        NA        NA        NA        NA
## AAACATACAAAGCA-1         0        NA        NA        NA        NA
## ...                    ...       ...       ...       ...       ...
## TTTGCATGTTTCGT-8         0        NA        NA        NA        NA
## TTTGCATGTTTCTG-8         0        NA        NA        NA        NA
## TTTGCATGTTTGCT-8         0        NA        NA        NA        NA
## TTTGCATGTTTGGG-8         0        NA        NA        NA        NA
## TTTGCATGTTTGTC-8         0        NA        NA        NA        NA
is.cell = (e.out$FDR <= 0.01)
par(mar=c(5,4,1,1), mfrow=c(1,2), bty="n")
plot(e.out$Total, -e.out$LogProb, col=ifelse(is.cell, "red", "black"),
    xlab="Total UMI count", ylab="-Log Probability", cex=0.2)
abline(v = bcrank$inflection, col="darkgreen")
abline(v = bcrank$knee, col="dodgerblue")
legend("bottomright", legend=c("Inflection", "Knee"), bty="n", 
       col=c("darkgreen", "dodgerblue"), lty=1, cex=1.2)

plot(e.out$Total, -e.out$LogProb, col=ifelse(is.cell, "red", "black"),
    xlab="Total UMI count", ylab="-Log Probability", cex=0.2, xlim=c(0,2000), ylim=c(0,2000))
abline(v = bcrank$inflection, col="darkgreen")
abline(v = bcrank$knee, col="dodgerblue")

From the above anlaysis, some cells with very small number of UMI’s may also have small FDR suggesting the distribution of UMI counts are different from what is expected from ambient profile. We choose a more conservative strategy, to keep the cells with total number of UMI larger than the inflection point estimate (bcrank$inflection=bcrank$inflection) and FDR < 0.01.

table(colnames(sce) == rownames(e.out))
## 
##    TRUE 
## 5898240
table(e.out$FDR <= 0.01, useNA="ifany")
## 
##   FALSE    TRUE    <NA> 
##   10541   67464 5820235
table(is.cell, e.out$Total >= bcrank$inflection)
##        
## is.cell FALSE  TRUE
##   FALSE  1775  8766
##   TRUE   8391 59073
w2kp = which(is.cell & e.out$Total >= bcrank$inflection)
sce = sce[,w2kp]
dim(sce)
## [1] 31188 59073

Now we have reduced the ~6 million barcodes to around 60,000 (potential) cells. Note that Zheng et al. (2017) use a cutoff of total number of UMI \(\geq\) 396 and obtain around 78,000 cells.

Next step we apply more QC based on a set of features per cell.

library(data.table)
ribo.file = "~/research/scRNAseq/workflow/data/ribosome_genes.txt"
ribo = fread(ribo.file)
dim(ribo)
## [1] 164  12
ribo[1:2,]
##    HGNC ID Approved Symbol          Approved Name   Status
## 1:   10298           RPL10  ribosomal protein L10 Approved
## 2:   10299          RPL10A ribosomal protein L10a Approved
##    Previous Symbols                                Synonyms Chromosome
## 1:                  NOV, QM, DXS648E, DXS648, FLJ23544, L10       Xq28
## 2:            NEDD6                            Csa-19, L10A    6p21.31
##    Accession Numbers RefSeq IDs Gene Family Tag Gene family description
## 1:          AB007170  NM_006013             RPL    L ribosomal proteins
## 2:            U12404  NM_007104             RPL    L ribosomal proteins
##    Gene family ID
## 1:            729
## 2:            729
is.mito = which(rowData(sce)$chromosome_name == "MT")
is.ribo = which(rowData(sce)$hgnc_symbol %in% ribo$`Approved Symbol`)
length(is.mito)
## [1] 13
length(is.ribo)
## [1] 161
sce = calculateQCMetrics(sce, feature_controls=list(Mt=is.mito, Ri=is.ribo))
colnames(colData(sce))
##  [1] "Sample"                                        
##  [2] "Barcode"                                       
##  [3] "is_cell_control"                               
##  [4] "total_features_by_counts"                      
##  [5] "log10_total_features_by_counts"                
##  [6] "total_counts"                                  
##  [7] "log10_total_counts"                            
##  [8] "pct_counts_in_top_50_features"                 
##  [9] "pct_counts_in_top_100_features"                
## [10] "pct_counts_in_top_200_features"                
## [11] "pct_counts_in_top_500_features"                
## [12] "total_features"                                
## [13] "log10_total_features"                          
## [14] "pct_counts_top_50_features"                    
## [15] "pct_counts_top_100_features"                   
## [16] "pct_counts_top_200_features"                   
## [17] "pct_counts_top_500_features"                   
## [18] "total_features_by_counts_endogenous"           
## [19] "log10_total_features_by_counts_endogenous"     
## [20] "total_counts_endogenous"                       
## [21] "log10_total_counts_endogenous"                 
## [22] "pct_counts_endogenous"                         
## [23] "pct_counts_in_top_50_features_endogenous"      
## [24] "pct_counts_in_top_100_features_endogenous"     
## [25] "pct_counts_in_top_200_features_endogenous"     
## [26] "pct_counts_in_top_500_features_endogenous"     
## [27] "total_features_endogenous"                     
## [28] "log10_total_features_endogenous"               
## [29] "pct_counts_top_50_features_endogenous"         
## [30] "pct_counts_top_100_features_endogenous"        
## [31] "pct_counts_top_200_features_endogenous"        
## [32] "pct_counts_top_500_features_endogenous"        
## [33] "total_features_by_counts_feature_control"      
## [34] "log10_total_features_by_counts_feature_control"
## [35] "total_counts_feature_control"                  
## [36] "log10_total_counts_feature_control"            
## [37] "pct_counts_feature_control"                    
## [38] "pct_counts_in_top_50_features_feature_control" 
## [39] "pct_counts_in_top_100_features_feature_control"
## [40] "total_features_feature_control"                
## [41] "log10_total_features_feature_control"          
## [42] "pct_counts_top_50_features_feature_control"    
## [43] "pct_counts_top_100_features_feature_control"   
## [44] "total_features_by_counts_Mt"                   
## [45] "log10_total_features_by_counts_Mt"             
## [46] "total_counts_Mt"                               
## [47] "log10_total_counts_Mt"                         
## [48] "pct_counts_Mt"                                 
## [49] "total_features_Mt"                             
## [50] "log10_total_features_Mt"                       
## [51] "total_features_by_counts_Ri"                   
## [52] "log10_total_features_by_counts_Ri"             
## [53] "total_counts_Ri"                               
## [54] "log10_total_counts_Ri"                         
## [55] "pct_counts_Ri"                                 
## [56] "pct_counts_in_top_50_features_Ri"              
## [57] "pct_counts_in_top_100_features_Ri"             
## [58] "total_features_Ri"                             
## [59] "log10_total_features_Ri"                       
## [60] "pct_counts_top_50_features_Ri"                 
## [61] "pct_counts_top_100_features_Ri"
par(mfrow=c(2,2), mar=c(5, 4, 1, 1), bty="n")
hist(log10(sce$total_counts), xlab="log10(Library sizes)", main="", 
    breaks=20, col="grey80", ylab="Number of cells")

hist(log10(sce$total_features), xlab="log10(# of expressed genes)", 
     main="", breaks=20, col="grey80", ylab="Number of cells")

hist(sce$pct_counts_Ri, xlab="Ribosome prop. (%)",
    ylab="Number of cells", breaks=40, main="", col="grey80")

hist(sce$pct_counts_Mt, xlab="Mitochondrial prop. (%)", 
    ylab="Number of cells", breaks=80, main="", col="grey80")

par(mfrow=c(2,2), mar=c(5, 4, 1, 1), bty="n")
smoothScatter(log10(sce$total_counts), log10(sce$total_features), 
     xlab="log10(Library sizes)", ylab="log10(# of expressed genes)", 
     nrpoints=500, cex=0.5)
smoothScatter(log10(sce$total_counts), sce$pct_counts_Ri, 
     xlab="log10(Library sizes)", ylab="Ribosome prop. (%)",
     nrpoints=500, cex=0.5)
abline(h=10,  lty=1)

smoothScatter(log10(sce$total_counts), sce$pct_counts_Mt, 
     xlab="log10(Library sizes)", ylab="Mitochondrial prop. (%)",
     nrpoints=500, cex=0.5)
abline(h=5,  lty=1)

smoothScatter(sce$pct_counts_Ri, sce$pct_counts_Mt, 
     xlab="Ribosome prop. (%)", ylab="Mitochondrial prop. (%)",
     nrpoints=500, cex=0.5)
abline(h=5,  lty=1)
abline(v=10, lty=1)

It is not clear why some cells have very low proportion of ribosome genes. To be a little bit conservative, we remove those genes with low proportion of ribosomeal genes (\(<10\%\)) or high proportion of Mitochondrial genes (\(>5\%\)).

table(sce$pct_counts_Mt > 5, sce$pct_counts_Ri < 10)
##        
##         FALSE  TRUE
##   FALSE 58763   140
##   TRUE    161     9
sce.lq = sce[,which(sce$pct_counts_Mt > 5 | sce$pct_counts_Ri < 10)]
dim(sce.lq)
## [1] 31188   310
sce = sce[,which(sce$pct_counts_Mt <= 5 | sce$pct_counts_Ri >= 10)]
dim(sce)
## [1] 31188 59064

3.1.3 Summarize gene-level information

rowData(sce)[1:2,]
## DataFrame with 2 rows and 20 columns
##   ensembl_gene_id hgnc_symbol chromosome_name start_position end_position
##       <character> <character>     <character>      <integer>    <integer>
## 1 ENSG00000004487       KDM1A               1       23019443     23083689
## 2 ENSG00000007923     DNAJC11               1        6634168      6701924
##      strand
##   <integer>
## 1         1
## 2        -1
##                                                                             description
##                                                                             <character>
## 1                             lysine demethylase 1A [Source:HGNC Symbol;Acc:HGNC:29079]
## 2 DnaJ heat shock protein family (Hsp40) member C11 [Source:HGNC Symbol;Acc:HGNC:25570]
##   percentage_gene_gc_content   gene_biotype is_feature_control
##                    <numeric>    <character>          <logical>
## 1                      39.17 protein_coding              FALSE
## 2                       46.6 protein_coding              FALSE
##   is_feature_control_Mt is_feature_control_Ri         mean_counts
##               <logical>             <logical>           <numeric>
## 1                 FALSE                 FALSE  0.0130008633385811
## 2                 FALSE                 FALSE 0.00890423713033027
##     log10_mean_counts n_cells_by_counts pct_dropout_by_counts total_counts
##             <numeric>         <integer>             <numeric>    <numeric>
## 1 0.00560981549159529               701       98.813332656205          768
## 2 0.00384994595981834               499      99.1552824471417          526
##   log10_total_counts n_cells_counts pct_dropout_counts
##            <numeric>      <integer>          <numeric>
## 1   2.88592633980143            701    98.813332656205
## 2   2.72181061521255            499   99.1552824471417
min(rowData(sce)$mean_counts)
## [1] 0
min(rowData(sce)$mean_counts[rowData(sce)$mean_counts>0])
## [1] 1.692821e-05
min(rowData(sce)$n_cells_counts)
## [1] 0
par(mfrow=c(1,3), mar=c(5,4,1,1))
hist(log10(rowData(sce)$mean_counts+1e-6), col="grey80",  main="", 
     breaks=40, xlab="log10(ave # of UMI + 1e-6)")
hist(log10(rowData(sce)$n_cells_counts+1), col="grey80", main="", 
     breaks=40, xlab="log10(# of expressed cells + 1)")
smoothScatter(log10(rowData(sce)$mean_counts+1e-6), 
              log10(rowData(sce)$n_cells_counts + 1), 
              xlab="log10(ave # of UMI + 1e-6)", 
              ylab="log10(# of expressed cells + 1)")

tb1 = table(rowData(sce)$n_cells_counts)
tb1[1:11]
## 
##     0     1     2     3     4     5     6     7     8     9    10 
## 11589  1618   841   590   443   333   277   233   241   195   150

We remove those genes that are expressed in zero or only one cell. The variable strand need to be renamed, otherwise there is an error message that such a variabel name cannot be used.

names(rowData(sce))[6] = "strand_n"
sce = sce[which(rowData(sce)$n_cells_counts > 1),]
dim(sce)
## [1] 17981 59064

Next we check those highly expressed genes

par(mar=c(5,4,1,1))
od1 = order(rowData(sce)$mean_counts, decreasing = TRUE)
barplot(rowData(sce)$mean_counts[od1[20:1]], las=1, 
        names.arg=rowData(sce)$hgnc_symbol[od1[20:1]], 
        horiz=TRUE, cex.names=0.5, cex.axis=0.7, 
        xlab="ave # of UMI")

3.1.4 Normalization

A simple solution for normalization and stablizing expression varaince across genes is to tranform the count data by log(count/size.factor + 1). One may calcualte size.factor per cell as the total number of UMIs, and this assumes the total expression are the same across all the cells. However, the total expression of each cell may vary with respect to cell type and/or cell size, and the computeSumFactors function in R package scran provides a more sophisicated way to calcualte size.factor to allow such variaation across cells (Lun, Bach, and Marioni 2016). computeSumFactors can use initial clustering of cells to normalize expression within and beetween clusters. Within a cluster, it estimates the size factor for many groups of cells so that there are more groups than cells, and then it can calcualte the size factor per cell using a lienar deconvolution system.

As shown in the following plot, the final size factor estimation is indeed highly correlated with the naive definition by total count.

Finally, the command normalize(sce) adds the normalized expression into the variable sce.

library(scran)
date()
## [1] "Sun Aug 19 23:54:47 2018"
clusters = quickCluster(sce, min.mean=0.1, method="igraph")
date()
## [1] "Mon Aug 20 00:25:19 2018"
sce      = computeSumFactors(sce, cluster=clusters, min.mean=0.1)
date()
## [1] "Mon Aug 20 00:38:20 2018"
summary(sizeFactors(sce))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1220  0.7673  0.9160  1.0000  1.1168 10.3884
par(mfrow=c(1,2), mar=c(5,4,2,1), bty="n")
smoothScatter(sce$total_counts, sizeFactors(sce), log="xy", 
              xlab="total counts", ylab="size factors")
plot(sce$total_counts, sizeFactors(sce), log="xy", 
     xlab="total counts", ylab="size factors", 
     cex=0.3, pch=20, col=rgb(0.1,0.2,0.7,0.3))

sce = normalize(sce)

3.2 Dimension reduction

For dimension reduction, such as calculating PCA or performing TSNE, we should start by identifying a subset of genes with high level of biological signal relative to background (technical) noise. The decomposeVar function from R/cran is designed for this task.

new.trend = makeTechTrend(x=sce)
fit = trendVar(sce, use.spikes=FALSE, loess.args=list(span=0.05))

par(mfrow=c(1,1), mar=c(5,4,2,1), bty="n")
plot(fit$mean, fit$var, pch=20, col=rgb(0.1,0.2,0.7,0.6), 
     xlab="log(mean)", ylab="var")
curve(fit$trend(x), col="orange", lwd=2, add=TRUE)
curve(new.trend(x), col="red", lwd=2, add=TRUE)
legend("topright", legend=c("Poisson noise", "observed trend"), 
       lty=1, lwd=2, col=c("red", "orange"), bty="n")

fit$trend = new.trend
dec = decomposeVar(fit=fit)
top.dec = dec[order(dec$bio, decreasing=TRUE),]
plotExpression(sce, features=rownames(top.dec)[1:10])

When performing PCA, we can use all the genes or just those genes with high signal-to-noise ratio. TSNE analysis is usually based on the top PCs rather than the original gene expression data. We first perform PCA using all the genes and the function denoisePCA can automatically select the PCs based on modeling of technical noise.

date()
## [1] "Mon Aug 20 00:40:47 2018"
sce = denoisePCA(sce, technical=new.trend, approx=TRUE)
date()
## [1] "Mon Aug 20 00:50:11 2018"
dim(reducedDim(sce, "PCA"))
## [1] 59064    45
plot(log10(attr(reducedDim(sce), "percentVar")), xlab="PC",
     ylab="log10(Prop of variance explained)", pch=20, cex=0.6, 
     col=rgb(0.8, 0.2, 0.2, 0.5))
abline(v=ncol(reducedDim(sce, "PCA")), lty=2, col="red")

df_pcs = data.frame(reducedDim(sce, "PCA"))
df_pcs$log10_total_features = colData(sce)$log10_total_features

gp1 = ggplot(df_pcs, aes(PC1,PC2,col=log10_total_features)) + 
  geom_point(size=0.2,alpha=0.6) + theme_classic() + 
  scale_colour_gradient(low="lightblue",high="red") +
  guides(color = guide_legend(override.aes = list(size=3)))
gp1

date()
## [1] "Mon Aug 20 00:50:13 2018"
sce = runTSNE(sce, use_dimred="PCA", perplexity=30, rand_seed=100)
date()
## [1] "Mon Aug 20 01:08:44 2018"
df_tsne = data.frame(reducedDim(sce, "TSNE"))
df_tsne$log10_total_features = colData(sce)$log10_total_features

gp1 = ggplot(df_tsne, aes(X1,X2,col=log10_total_features)) + 
  geom_point(size=0.2,alpha=0.6) + theme_classic() + 
  scale_colour_gradient(low="lightblue",high="red") +
  guides(color = guide_legend(override.aes = list(size=3)))
gp1

Next we only select around top 1000 genes for the PCA and use the top 50 PCs for TSNE projection.

library(svd)
library(Rtsne)

summary(dec$bio)
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.0011881  0.0000261  0.0003834  0.0050907  0.0018510  2.0471504
dec1 = dec
dec1$bio[which(dec$bio < 1e-8)] = 1e-8
dec1$FDR[which(dec$FDR < 1e-100)] = 1e-100

par(mfrow=c(1,2))
hist(log10(dec1$bio), breaks=100, main="")
hist(log10(dec1$FDR), breaks=100, main="")

summary(dec$FDR[dec$bio > 0.001])
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000000 0.0000000 0.0000000 0.0000339 0.0000000 0.0693136
table(dec$FDR < 1e-10, dec$bio > 0.01)
##        
##         FALSE  TRUE
##   FALSE  3852     5
##   TRUE  13089  1035
w2kp = which(dec$FDR < 1e-10 & dec$bio > 0.01)
sce_sub = sce[w2kp,]
sce_sub
## class: SingleCellExperiment 
## dim: 1035 59064 
## metadata(1): log.exprs.offset
## assays(2): counts logcounts
## rownames(1035): TNFRSF1B CDC42 ... HMGN1 ATP5PO
## rowData names(20): ensembl_gene_id hgnc_symbol ... n_cells_counts
##   pct_dropout_counts
## colnames(59064): AAACATACACCCAA-1 AAACATACCCCTCA-1 ...
##   TTTGCATGCTGCAA-8 TTTGCATGTGGTAC-8
## colData names(61): Sample Barcode ...
##   pct_counts_top_50_features_Ri pct_counts_top_100_features_Ri
## reducedDimNames(2): PCA TSNE
## spikeNames(0):
edat = t(as.matrix(logcounts(sce_sub)))
edat = scale(edat)
dim(edat)
## [1] 59064  1035
edat[1:2,1:3]
##                    TNFRSF1B      CDC42       ENO1
## AAACATACACCCAA-1 -0.2220264 -0.3758056 -0.7806183
## AAACATACCCCTCA-1 -0.2220264 -0.3758056 -0.7806183
date()
## [1] "Mon Aug 20 01:09:10 2018"
ppk  = propack.svd(edat,neig=50)
date()
## [1] "Mon Aug 20 01:11:13 2018"
pca = t(ppk$d*t(ppk$u))

df_pcs = data.frame(pca)
df_pcs$log10_total_features = colData(sce_sub)$log10_total_features
df_pcs[1:2,]
##          X1         X2         X3         X4        X5        X6
## 1 -1.938437 2.14847107  0.9914546 -0.5089818 -1.400226 -2.205035
## 2 -6.093417 0.08501296 -0.5933283 -0.5807430 -1.655475  1.886403
##           X7       X8         X9       X10       X11        X12       X13
## 1 -0.8143478 2.783945 -2.9836858 1.1779709 0.9085121 -0.9685727 -1.421720
## 2 -2.1541784 1.413880 -0.5692589 0.8448131 2.6189810  1.5308592 -1.651077
##         X14        X15       X16        X17        X18       X19
## 1  1.341527 -0.1550409  1.421038 -0.1670475 -0.1453368 0.9407139
## 2 -1.432342 -0.9404134 -2.862906 -0.1586555  1.0596564 0.9069979
##          X20       X21        X22       X23        X24        X25
## 1  0.6789775 -2.208472 -0.7156601 0.1852175  1.7249985 -2.1441272
## 2 -0.6383841  1.238179  0.9927511 0.6050383 -0.9835777  0.7632581
##          X26      X27       X28      X29        X30       X31       X32
## 1  0.5064031 0.172789 0.3513667 2.265742 -0.3584804 1.4068027 0.2128577
## 2 -0.1382440 1.353238 2.0299458 1.189431  0.2047557 0.4277797 1.3450210
##          X33        X34       X35        X36         X37        X38
## 1  0.0124918 -0.9527684 0.6947430 -0.8310609 -0.07669705 -2.1599937
## 2 -2.3776169  0.4908630 0.5240087  1.7471721  0.07878119 -0.5164361
##          X39        X40         X41         X42        X43        X44
## 1  0.7357511  0.6094890 -2.41474788 -0.41392187 -0.5895618 0.05231965
## 2 -0.2466020 -0.7048429 -0.02699798 -0.07840753  0.8481294 1.37930577
##         X45        X46       X47        X48       X49        X50
## 1 0.9999528 0.46641941  1.730882 -0.9973974 2.1186948 -1.3535233
## 2 1.3630837 0.04804352 -1.725742 -0.2842565 0.2822561 -0.3334095
##   log10_total_features
## 1             2.696356
## 2             2.673021
gp1 = ggplot(df_pcs, aes(X1,X2,col=log10_total_features)) + 
  geom_point(size=0.2,alpha=0.6) + theme_classic() + 
  scale_colour_gradient(low="lightblue",high="red") +
  guides(color = guide_legend(override.aes = list(size=3)))
gp1

set.seed(100)
date()
## [1] "Mon Aug 20 01:11:15 2018"
tsne = Rtsne(pca, pca = FALSE)
date()
## [1] "Mon Aug 20 01:32:07 2018"
df_tsne = data.frame(tsne$Y)
df_tsne$log10_total_features = colData(sce_sub)$log10_total_features
dim(df_tsne)
## [1] 59064     3
df_tsne[1:2,]
##           X1       X2 log10_total_features
## 1   1.577535 -8.10810             2.696356
## 2 -16.022960 11.87136             2.673021
gp1 = ggplot(df_tsne, aes(X1,X2,col=log10_total_features)) + 
  geom_point(size=0.2,alpha=0.6) + theme_classic() + 
  scale_colour_gradient(low="lightblue",high="red") +
  guides(color = guide_legend(override.aes = list(size=3)))
gp1

reducedDims(sce_sub) = SimpleList(PCA=pca, TSNE=tsne$Y)
sce_sub
## class: SingleCellExperiment 
## dim: 1035 59064 
## metadata(1): log.exprs.offset
## assays(2): counts logcounts
## rownames(1035): TNFRSF1B CDC42 ... HMGN1 ATP5PO
## rowData names(20): ensembl_gene_id hgnc_symbol ... n_cells_counts
##   pct_dropout_counts
## colnames(59064): AAACATACACCCAA-1 AAACATACCCCTCA-1 ...
##   TTTGCATGCTGCAA-8 TTTGCATGTGGTAC-8
## colData names(61): Sample Barcode ...
##   pct_counts_top_50_features_Ri pct_counts_top_100_features_Ri
## reducedDimNames(2): PCA TSNE
## spikeNames(0):

3.2.1 Clustering

There are many methods for clustering of single cell RNA-seq data. The performance of each method may also depend on pre-processing steps, such as performing imputation or not. We wil compare these methods in a seperate document. Here we just illustrate the clustering reuslts using a simple kmeans method on the top 50 PCs.

k10_50_pcs = kmeans(reducedDim(sce_sub, "PCA"), centers=10, 
                    iter.max=150, algorithm="MacQueen")
names(k10_50_pcs)
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
dim(k10_50_pcs$centers)
## [1] 10 50
df_tsne$cluster_kmean = as.factor(k10_50_pcs$cluster)
cols = c("#FB9A99","#FF7F00","yellow","orchid","grey",
         "red","dodgerblue2","tan4","green4","#99c9fb")

gp1 = ggplot(df_tsne, aes(X1,X2,col=cluster_kmean)) + 
  geom_point(size=0.2,alpha=0.6) + theme_classic() + 
  scale_color_manual(values=cols) + 
  guides(color = guide_legend(override.aes = list(size=3)))

gp1

An alternative popular clustreing method is a graph based method that first construct a graph for all the cells and then identify clusters of cells by identifying densely connected subgraphs (C. Xu and Su 2015). One possible implementation is the following approach

snn.gr   = buildSNNGraph(sce_sub, use.dimred="PCA")
clusters = igraph::cluster_walktrap(snn.gr)

Though this implementation is very slow for this large dataset. It took more than 3 hours on a MacbookPro with 2.8 G i7 core and 16G memmory. The results of graphical model based on clustering and Kmeans have large overlap in some cluters. Though in general, graph-based identify many small clusters.

Compare the clustering resutls of Kmeans vs. a graph based method. Each column corresponds to one of 31 clusters by graph-based method. Each row corresponds to one of the 10 clusters by kmeans

Compare the clustering resutls of Kmeans vs. a graph based method. Each column corresponds to one of 31 clusters by graph-based method. Each row corresponds to one of the 10 clusters by kmeans

Next we identify the most likely cell type of each cell by comparing the expression of each cell versus gene expression for 11 cell types. We chose to use a small set of genes that are known to be differentially expressed across immune cell types. This gene set is constructed by the union of the signature genes used by CIBERSORT (Newman et al. 2015) and 31 known cell type markers.

purified_ref_11 = readRDS("data/expression_11_cell_types_Zheng_2016.rds")
row2kp = rowData(sce)$ensembl_gene_id %in% rownames(purified_ref_11)
sce_ct = sce[which(row2kp),]
dim(sce_ct)
## [1] 17525 59064
lm22 = fread("data/LM22.txt")
dim(lm22)
## [1] 547  23
lm22[1:2,1:5]
##    Gene symbol B cells naive B cells memory Plasma cells T cells CD8
## 1:       ABCB4     555.71345       10.74423     7.225819     4.31128
## 2:       ABCB9      15.60354       22.09479   653.392328    24.22372
genes2use = c("CD19", "IGKC", "IGLC2", "MS4A1", "MME", "HLA-DRB1", 
              "CD80", "CD3E", "CD4", "CD8A", "IL2RA", "SELL", "CCR7", 
              "IL7R", "FOXP3", "CD14", "CD33", "NCAM1", "CXCR3", "FAS", 
              "FASLG", "CD44", "PDCD1", "CD274", "PDCD1LG2", "EOMES", 
              "TBX21", "IFNG", "GZMB", "CD38", "TNFRSF9")
length(genes2use)
## [1] 31
genes2use = union(lm22$`Gene symbol`, genes2use)
length(genes2use)
## [1] 558
genes2use = intersect(genes2use, rowData(sce_ct)$hgnc_symbol)
sce_ct = sce_ct[match(genes2use, rowData(sce_ct)$hgnc_symbol),]
dim(sce_ct)
## [1]   443 59064
mat1 = match(rowData(sce_ct)$ensembl_gene_id, rownames(purified_ref_11))
purified_ref_11 = purified_ref_11[mat1,]
dim(purified_ref_11)
## [1] 443  11
purified_ref_11[1:3,]
##                        CD34+     CD56+ NK CD4+/CD45RA+/CD25- Naive T
## ENSG00000005471 1.392797e-07 2.224760e-07               7.764879e-08
## ENSG00000150967 1.149058e-06 8.157454e-07               2.096517e-06
## ENSG00000072818 1.382003e-04 2.611127e-04               3.502737e-04
##                 CD4+/CD25 T Reg CD8+/CD45RA+ Naive Cytotoxic
## ENSG00000005471    7.703224e-08                 5.542298e-08
## ENSG00000150967    9.243869e-07                 1.053037e-06
## ENSG00000072818    3.631300e-04                 3.087614e-04
##                 CD4+/CD45RO+ Memory CD8+ Cytotoxic T      CD19+ B
## ENSG00000005471        6.178376e-08     6.062095e-08 2.561460e-05
## ENSG00000150967        1.297459e-06     1.091177e-06 7.805004e-07
## ENSG00000072818        3.296781e-04     3.293536e-04 2.230102e-04
##                 CD4+ T Helper2 CD14+ Monocyte    Dendritic
## ENSG00000005471   6.457025e-08   5.463552e-07 2.442730e-06
## ENSG00000150967   2.066248e-06   1.256617e-05 4.885460e-06
## ENSG00000072818   3.481628e-04   4.480113e-05 5.129733e-05

We calculate the rank-based spearman correlation beween each cell and 11 cell type-specifid gene expression profiles. It is interesting to note that for many cells, the second largest correlation is very similar to the largest one and thus this assignment of cell type is with considerable amount of uncertainty.

edat = as.matrix(logcounts(sce_ct))
dim(edat)
## [1]   443 59064
edat[1:2,1:2]
##       AAACATACACCCAA-1 AAACATACCCCTCA-1
## ABCB4                0                0
## ABCB9                0                0
cor.ct = cor(edat, purified_ref_11, method="spearman")
dim(cor.ct)
## [1] 59064    11
summary(c(cor.ct))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.1390  0.1057  0.2241  0.1981  0.2896  0.5280
largest2 <- function(v){
  sort(v, decreasing=TRUE)[2]
}

cor_top1 = apply(cor.ct, 1, max)
cor_top2 = apply(cor.ct, 1, largest2)

summary(cor_top1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01088 0.28193 0.31130 0.30971 0.33982 0.52800
summary(cor_top1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01088 0.28193 0.31130 0.30971 0.33982 0.52800
summary(cor_top1 - cor_top2)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 6.000e-08 1.964e-03 6.242e-03 1.634e-02 1.949e-02 2.223e-01
summary((cor_top1 - cor_top2)/cor_top1)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000002 0.0065773 0.0204692 0.0524379 0.0611228 1.2657452
par(mar=c(5,4,1,1), bty="n")
plot(cor_top1, cor_top2, pch=20, col=rgb(0.8, 0.2, 0.2, 0.5), 
     cex=0.2, xlab="largest correlation", 
     ylab="2nd largest correlation")
abline(0,1)
abline(0, 0.8, lty=2)
abline(v=0.2, lty=2)
text(0.5, 0.38, pos=1, "y=0.8x")

df_tsne$ct = colnames(cor.ct)[apply(cor.ct, 1, which.max)]

.set_pbmc_color_11<-function() {
  myColors <- c( "dodgerblue2",
                 "green4", 
                 "#6A3D9A", # purple
                 "grey",
                 "tan4",
                 "yellow", 
                 "#FF7F00", # orange
                 "black",
                 "#FB9A99", # pink
                 "orchid",
                 "red")
  id <- c("CD19+ B", 
          "CD14+ Monocyte",
          "Dendritic",
          "CD56+ NK",
          "CD34+", 
          "CD4+/CD25 T Reg",
          "CD4+/CD45RA+/CD25- Naive T",
          "CD4+/CD45RO+ Memory",
          "CD4+ T Helper2",
          "CD8+/CD45RA+ Naive Cytotoxic",
          "CD8+ Cytotoxic T")
  names(myColors)<-id
  scale_colour_manual(name = "ct",values = myColors)
}

gp1 = ggplot(df_tsne,aes(X1,X2,col=ct)) + 
  geom_point(size=0.2,alpha=0.6) + theme_classic() + 
  .set_pbmc_color_11() + 
  guides(color = guide_legend(override.aes = list(size=3)))
gp1

4 Session information

sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] Rtsne_0.13                  svd_0.4.1                  
##  [3] scran_1.8.4                 data.table_1.11.4          
##  [5] scater_1.8.3                ggplot2_3.0.0              
##  [7] dplyr_0.7.6                 biomaRt_2.36.1             
##  [9] DropletUtils_1.0.3          SingleCellExperiment_1.2.0 
## [11] SummarizedExperiment_1.10.1 DelayedArray_0.6.4         
## [13] matrixStats_0.54.0          Biobase_2.40.0             
## [15] GenomicRanges_1.32.6        GenomeInfoDb_1.16.0        
## [17] IRanges_2.14.10             S4Vectors_0.18.3           
## [19] BiocGenerics_0.26.0         BiocParallel_1.14.2        
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6             bit64_0.9-7             
##  [3] progress_1.2.0           httr_1.3.1              
##  [5] rprojroot_1.3-2          dynamicTreeCut_1.63-1   
##  [7] tools_3.5.0              backports_1.1.2         
##  [9] irlba_2.3.2              R6_2.2.2                
## [11] DT_0.4                   KernSmooth_2.23-15      
## [13] vipor_0.4.5              DBI_1.0.0               
## [15] lazyeval_0.2.1           colorspace_1.3-2        
## [17] withr_2.1.2              tidyselect_0.2.4        
## [19] gridExtra_2.3            prettyunits_1.0.2       
## [21] bit_1.1-14               compiler_3.5.0          
## [23] labeling_0.3             scales_1.0.0            
## [25] stringr_1.3.1            digest_0.6.15           
## [27] rmarkdown_1.10           XVector_0.20.0          
## [29] pkgconfig_2.0.1          htmltools_0.3.6         
## [31] limma_3.36.2             highr_0.7               
## [33] htmlwidgets_1.2          rlang_0.2.1             
## [35] RSQLite_2.1.1            FNN_1.1.2.1             
## [37] shiny_1.1.0              DelayedMatrixStats_1.2.0
## [39] bindr_0.1.1              RCurl_1.95-4.11         
## [41] magrittr_1.5             GenomeInfoDbData_1.1.0  
## [43] Matrix_1.2-14            Rcpp_0.12.18            
## [45] ggbeeswarm_0.6.0         munsell_0.5.0           
## [47] Rhdf5lib_1.2.1           viridis_0.5.1           
## [49] stringi_1.2.4            yaml_2.2.0              
## [51] edgeR_3.22.3             zlibbioc_1.26.0         
## [53] rhdf5_2.24.0             plyr_1.8.4              
## [55] grid_3.5.0               blob_1.1.1              
## [57] promises_1.0.1           shinydashboard_0.7.0    
## [59] crayon_1.3.4             lattice_0.20-35         
## [61] cowplot_0.9.3            hms_0.4.2               
## [63] locfit_1.5-9.1           knitr_1.20              
## [65] pillar_1.3.0             igraph_1.2.2            
## [67] rjson_0.2.20             reshape2_1.4.3          
## [69] XML_3.98-1.15            glue_1.3.0              
## [71] evaluate_0.11            httpuv_1.4.5            
## [73] gtable_0.2.0             purrr_0.2.5             
## [75] assertthat_0.2.0         mime_0.5                
## [77] xtable_1.8-2             later_0.7.3             
## [79] viridisLite_0.3.0        tibble_1.4.2            
## [81] AnnotationDbi_1.42.1     beeswarm_0.2.3          
## [83] memoise_1.1.0            tximport_1.8.0          
## [85] bindrcpp_0.2.2           statmod_1.4.30

Reference

Lun, Aaron TL, Karsten Bach, and John C Marioni. 2016. “Pooling Across Cells to Normalize Single-Cell Rna Sequencing Data with Many Zero Counts.” Genome Biology 17 (1). BioMed Central: 75.

Newman, Aaron M, Chih Long Liu, Michael R Green, Andrew J Gentles, Weiguo Feng, Yue Xu, Chuong D Hoang, Maximilian Diehn, and Ash A Alizadeh. 2015. “Robust Enumeration of Cell Subsets from Tissue Expression Profiles.” Nature Methods 12 (5). Nature Publishing Group: 453.

Xu, Chen, and Zhengchang Su. 2015. “Identification of Cell Types from Single-Cell Transcriptomes Using a Novel Clustering Method.” Bioinformatics 31 (12). Oxford University Press: 1974–80.

Zheng, Grace XY, Jessica M Terry, Phillip Belgrader, Paul Ryvkin, Zachary W Bent, Ryan Wilson, Solongo B Ziraldo, et al. 2017. “Massively Parallel Digital Transcriptional Profiling of Single Cells.” Nature Communications 8. Nature Publishing Group: 14049.

Ziegenhain, Christoph, Beate Vieth, Swati Parekh, Björn Reinius, Amy Guillaumet-Adkins, Martha Smets, Heinrich Leonhardt, Holger Heyn, Ines Hellmann, and Wolfgang Enard. 2017. “Comparative Analysis of Single-Cell Rna Sequencing Methods.” Molecular Cell 65 (4). Elsevier: 631–43.