This document aims to provide a workflow for processing and QC of single cell RNA-seq data. Because this is a relatively new area with many on-going development of new methods/appraoches. This document will be updated to reflect such new developmets. There are many other important topics for scRNA-seq analysis, such as imputatoin or clustring. We will provide separate workflows for those topics. One publicly available datasets will be used for this workflow.
PBMC.68k: around 68k peripheral blood mononuclear cells (PBMCs) from one donor (donor A) (Zheng et al. 2017).
Multiple techniques are available to generate scRNA-seq data. For example, Ziegenhain et al. (2017) have compared 6 popular techniques (see Figure 2 and Table S1 of their paper for more details):
Among the many differences of these techniques, the most important ones are:
An UMI is a randomly generated barcode (4-10bp) to label each transcript molecule before amplification. Therefore, by counting UMIs instead of actual reads, one can remove most noise and bias due to amplification. However, UMI is only available for techniques that sequence 5’ or 3’ ends of the transcript molecule, and thus cannot sequence full length RNA. Among the aformentioned techniques, Smart-seq/C1 and Smart-seq2 sequence full length RNA, hence cannot use UMI.
Drop-seq (e.g., prducts from 10x Genomics) can capture much more cells than other techiniques, but has low depth within each cell. In other words, compared with otehr techniques, Drop-seq captures more cells and samller number of genes witin each cell.
The analysis of scRNA-seq data may be divided into two steps. The first step is to generate a count matrix for all the genes and all the cells of one individual (the left panel of the following flowchart). For example, a matrix where each row corresponds to a gene and each column corresponds to a cell. The first step also invovles generation of fastq files and map the reads in the fastq file and obtain a bam file, as well as QC on fastq files, bam files, and the read count data matrix.
The second step is data analysis based on this count matrix, typicially include imputation, clustering, and differential expression (the right panel of the following flowchart). These data analysis steps are very active reasearch topics and many computational methods are avaialble. In the followign figure, we list a few popular methods for each step.
A flow chart of the workflow for analyzing scRNA-seq data.
Cell Ranger are a set of pipelines to process and analzye 10x Genomics’s scRNA-seq data. It was developed and maintained by 10x Genomics. Here we provide some example code for data analysis and more details can be found in the support webpage by 10x Genomics. https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger
The raw outputs of a study is often saved in one folder (e.g., 180605_D00300_0562_BCCKLNANXX
), and they include several files and folders like the following.
-rw-r--r-- 392069 180605_D00300_0562_BCCKLNANXX.jpg
drwxr-xr-x 0 BarcodeImages/
-rw-r--r-- 46 Basecalling_Netcopy_complete_Read1.txt
-rw-r--r-- 46 Basecalling_Netcopy_complete_Read2.txt
-rw-r--r-- 46 Basecalling_Netcopy_complete_Read3.txt
-rw-r--r-- 46 Basecalling_Netcopy_complete.txt
drwxr-xr-x 181 Config/
drwxr-xr-x 85 Data/
-rw-r--r-- 3458 First_Base_Report.htm
-rw-r--r-- 46 ImageAnalysis_Netcopy_complete_Read1.txt
-rw-r--r-- 46 ImageAnalysis_Netcopy_complete_Read2.txt
-rw-r--r-- 46 ImageAnalysis_Netcopy_complete_Read3.txt
-rw-r--r-- 46 ImageAnalysis_Netcopy_complete.txt
drwxr-xr-x 268 InterOp/
drwxr-xr-x 9477 Logs/
drwxr-xr-x 41 PeriodicSaveRates/
drwxr-xr-x 71 Recipe/
-rw-r--r-- 46 RTAComplete.txt
-rw-r--r-- 826 RunInfo.xml
-rw-r--r-- 5838 runParameters.xml
The raw data are saved in foldrer Data
in the format of Illumina sequencer’s base call files (BCLs). We need to use cellranger mkfastq
command to generate fastq files:
#!/bin/bash
ml boost/1.55.0
ml bcl2fastq
path_input='/jane_doe/illumina/180605_D00300_0562_BCCKLNANXX'
path_output='/jane_doe/study_x/ABC'
path_to_cellranger/cellranger mkfastq \
--run="$path_to_input" \
--csv=path_to_csv/cellranger_sample_index.csv \
--output-dir="$path_output" \
--localcores=6 \
--localmem=300
The above code was run on a linux cluster with bash shell. The command ml
is a shortcut for module load
. The modules of boost
and Illumina’s bcl2fastq
are needed to run mkfastq
.
The file cellranger_sample_index.csv
gives the mapping between sample name (Sample
) and sample index for library construction (Index
) . The column Lane
specifies the lane(s) of the flowcell where one sample has used. It can be a number (e.g., 1), a range (e.g., 2-4) or ’*’ for all lanes in the flowcell. Here is an example of the file cellranger_sample_index.csv
:
Lane,Sample,Index
*,ABC_baseline1,SI-GA-D2
*,ABC_baseline2,SI-GA-E2
...
localcores
and localmem
set the maximum number of cores and maximum memory (GB) for the pipeline.
Next we use cell ranger count
command to map the RNA-seq reads to reference genome (human hg38 in this example) and count the number of RNA-seq fragments per gene per cell.
#!/bin/bash
path_to_cellranger/cellranger count \
--id=ABC_baseline1 \
--sample=ABC_baseline1 \
--transcriptome=path_to_reference_genome/refdata-cellranger-GRCh38-1.2.0 \
--fastqs=/jane_doe/study_x/ABC \
--localcores=6 \
--localmem=300
where
id
is a unique ID to be used to name the output.sample
should match to one sample name in the file cellranger_sample_index.csv
.transcriptome
is the location where the reference genome data was saved. This reference genome was downloaded from https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest.fastqs
specifies the location where fastq files were saved, here it matches with the output-dir
of mkfastq
command.Without using Cell Ranger, one can obtain the read counts in three steps.
Use bcl2fastq to obtain the fastq files. Then map RNA-seq reads using
Map RNA-seq reads to reference genome using alignment software such as STAR or tophat, or pseudo-alignment methods (e.g. Kallisto, Salmon) for well annotated transcriptomes, such as human or mouse.
Count the number of reads per gene per sample, for example, by softare HT-seq or R function summarizeOverlaps
, as shown in the following codes.
library("GenomicFeatures")
library("GenomicAlignments")
path = "ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_27/"
file = "gencode.v27.annotation.gtf.gz"
gtfFile = paste(path, file, sep="")
txdb = makeTxDbFromGFF(file=gtfFile, format="gtf", organism="Homo sapiens")
saveDb(txdb, file="Homo_sapiens_gencode_v27.sqlite")
txdb = loadDb("Homo_sapiens_gencode_v27.sqlite")
genes = exonsBy(txdb, by="gene")
filenames = list.files(pattern=".bam$")
bamfiles = BamFileList(filenames, yieldSize=1000000)
se = summarizeOverlaps(features=genes, reads=bamfiles, mode="Union",
singleEnd=FALSE, ignore.strand=FALSE, fragments=TRUE )
as1 = assay(se)
write.table(as1, file = "gene_level_counts.txt", append = FALSE,
quote = FALSE, sep = "\t", eol = "\n", row.names = TRUE,
col.names = TRUE)
The quality control for fastq or bam files can be performed using pipelines developed for bulk RNAseq data, such as FastQC for fastq files and RSeQC for bam files.
A few libraries are needed in the followign code. The main QC anlaysis is carried out by scater
.
library(DropletUtils)
library(biomaRt)
library(dplyr)
library(scater)
This dataset was generated using 10x Genomics platform (Zheng et al. 2017) The read count data were downloaded from the link of “Gene / cell matrix (raw)” from https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/fresh_68k_pbmc_donor_a.
The downloaded zip file were unzipped, renamed as raw_gene_bc_matrices
, and saved in folder ~/research/scRNAseq/data/Zheng_2016/donorA_hg19/outs/
. In addition, we also downloaded the “Summary CSV” file, renamed it as metrics_summary.csv
, and saved in the same folder. Renaming and orgnizing the folder strucrture is to recontrcut the file struture of cell ranger output.
└── outs
├── metrics_summary.csv
└── raw_gene_bc_matrices
└── hg19
├── barcodes.tsv
├── genes.tsv
└── matrix.mtx
The count matrix was saved as three files, where barcodes.tsv
saves barcode information, genes.tsv
saves gene information, and matrix.mtx
saves the count data in MatrixMarket format.
This dataset can be loaded into R by function load_cellranger_matrix
from R package cellrangerRkit
, which requires the file metrics_summary.csv
. Here we load them using function read10xCounts
from R package DropletUtils
, and obtain gene anntation using R package biomaRt
.
path1 = "~/research/scRNAseq/data/Zheng_2016/donorA_hg19/"
path2 = paste0(path1, "outs/raw_gene_bc_matrices/hg19/")
sce = read10xCounts(path2, col.names=TRUE)
sce
## class: SingleCellExperiment
## dim: 32738 5898240
## metadata(0):
## assays(1): counts
## rownames(32738): ENSG00000243485 ENSG00000237613 ...
## ENSG00000215616 ENSG00000215611
## rowData names(2): ID Symbol
## colnames(5898240): AAACATACAAAACG-1 AAACATACAAAAGC-1 ...
## TTTGCATGTTTGGG-8 TTTGCATGTTTGTC-8
## colData names(2): Sample Barcode
## reducedDimNames(0):
## spikeNames(0):
anno.file = "~/research/scRNAseq/workflow/data/gene.annoation.rds"
if(file.exists(anno.file)){
gene.annotation = readRDS(anno.file)
}else{
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
attr.string = c('ensembl_gene_id', 'hgnc_symbol', 'chromosome_name')
attr.string = c(attr.string, 'start_position', 'end_position', 'strand')
attr.string = c(attr.string, 'description', 'percentage_gene_gc_content')
attr.string = c(attr.string, 'gene_biotype')
rowData(sce)[1:2,]
gene.annotation = getBM(attributes=attr.string,
filters = 'ensembl_gene_id',
values = rowData(sce)$ID,
mart = ensembl)
}
dim(gene.annotation)
## [1] 31188 9
gene.annotation[1:2,]
## ensembl_gene_id hgnc_symbol chromosome_name start_position end_position
## 1 ENSG00000004487 KDM1A 1 23019443 23083689
## 2 ENSG00000007923 DNAJC11 1 6634168 6701924
## strand
## 1 1
## 2 -1
## description
## 1 lysine demethylase 1A [Source:HGNC Symbol;Acc:HGNC:29079]
## 2 DnaJ heat shock protein family (Hsp40) member C11 [Source:HGNC Symbol;Acc:HGNC:25570]
## percentage_gene_gc_content gene_biotype
## 1 39.17 protein_coding
## 2 46.60 protein_coding
t1 = table(gene.annotation$ensembl_gene_id)
t2 = t1[t1 > 1]
t2
## named integer(0)
gene.annotation[which(gene.annotation$ensembl_gene_id %in% names(t2)),]
## [1] ensembl_gene_id hgnc_symbol
## [3] chromosome_name start_position
## [5] end_position strand
## [7] description percentage_gene_gc_content
## [9] gene_biotype
## <0 rows> (or 0-length row.names)
gene.annotation = distinct(gene.annotation, ensembl_gene_id,
.keep_all = TRUE)
dim(gene.annotation)
## [1] 31188 9
gene.annotation[1:2,]
## ensembl_gene_id hgnc_symbol chromosome_name start_position end_position
## 1 ENSG00000004487 KDM1A 1 23019443 23083689
## 2 ENSG00000007923 DNAJC11 1 6634168 6701924
## strand
## 1 1
## 2 -1
## description
## 1 lysine demethylase 1A [Source:HGNC Symbol;Acc:HGNC:29079]
## 2 DnaJ heat shock protein family (Hsp40) member C11 [Source:HGNC Symbol;Acc:HGNC:25570]
## percentage_gene_gc_content gene_biotype
## 1 39.17 protein_coding
## 2 46.60 protein_coding
table(gene.annotation$chromosome_name)
##
## 1 10 11
## 2996 1181 1848
## 12 13 14
## 1659 590 1042
## 15 16 17
## 1018 1422 1727
## 18 19 2
## 627 1951 2132
## 20 21 22
## 807 435 674
## 3 4 5
## 1685 1331 1602
## 6 7 8
## 1548 1384 1275
## 9 CHR_HSCHR22_1_CTG7 CHR_HSCHR4_6_CTG12
## 1147 1 1
## KI270713.1 MT X
## 1 13 986
## Y
## 105
table(gene.annotation$gene_biotype)
##
## 3prime_overlapping_ncRNA antisense
## 10 5095
## bidirectional_promoter_lncRNA IG_V_gene
## 25 1
## IG_V_pseudogene lincRNA
## 1 6584
## miRNA misc_RNA
## 1 5
## polymorphic_pseudogene processed_pseudogene
## 5 18
## processed_transcript protein_coding
## 119 19120
## scRNA sense_intronic
## 1 17
## sense_overlapping snoRNA
## 5 6
## TEC transcribed_processed_pseudogene
## 5 34
## transcribed_unitary_pseudogene transcribed_unprocessed_pseudogene
## 36 72
## unprocessed_pseudogene vaultRNA
## 27 1
## some genes do not have annotation because their ids are retired
gene.missing = setdiff(rowData(sce)$ID, gene.annotation$ensembl_gene_id)
length(gene.missing)
## [1] 1550
gene.missing[1:6]
## [1] "ENSG00000237683" "ENSG00000235249" "ENSG00000236743" "ENSG00000231709"
## [5] "ENSG00000239664" "ENSG00000223659"
w2kp = match(gene.annotation$ensembl_gene_id, rowData(sce)$ID)
sce = sce[w2kp,]
dim(sce)
## [1] 31188 5898240
table(gene.annotation$ensembl_gene_id == rowData(sce)$ID)
##
## TRUE
## 31188
rowData(sce) = gene.annotation
rownames(sce) = uniquifyFeatureNames(rowData(sce)$ensembl_gene_id,
rowData(sce)$hgnc_symbol)
An important QC step for scRNA-seq data analysis is to identify low quality or empty cells. For 10x Genomics data, The emptyDrops
function in R package DropletUtils
can be used detect empty cells, given the count matrix of all barcodes. That is why we loaded the raw data matrix intead of filtered data matrix.
bcrank = barcodeRanks(counts(sce))
# Only showing unique points for plotting speed.
uniq = !duplicated(bcrank$rank)
par(mar=c(5,4,2,1), bty="n")
plot(bcrank$rank[uniq], bcrank$total[uniq], log="xy",
xlab="Rank", ylab="Total UMI count", cex=0.5, cex.lab=1.2)
abline(h=bcrank$inflection, col="darkgreen", lty=2)
abline(h=bcrank$knee, col="dodgerblue", lty=2)
legend("left", legend=c("Inflection", "Knee"), bty="n",
col=c("darkgreen", "dodgerblue"), lty=2, cex=1.2)
bcrank$inflection
## CAGCAATGCATTCT-7
## 469
bcrank$knee
## TTGACACTCTGCTC-8
## 1124
summary(bcrank$total)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 16.91 0.00 10467.00
table(bcrank$total >= bcrank$knee)
##
## FALSE TRUE
## 5852487 45753
table(bcrank$total >= bcrank$inflection)
##
## FALSE TRUE
## 5830401 67839
set.seed(100)
e.out = emptyDrops(counts(sce))
e.out
## DataFrame with 5898240 rows and 5 columns
## Total LogProb PValue Limited FDR
## <integer> <numeric> <numeric> <logical> <numeric>
## AAACATACAAAACG-1 5 NA NA NA NA
## AAACATACAAAAGC-1 0 NA NA NA NA
## AAACATACAAACAG-1 4 NA NA NA NA
## AAACATACAAACGA-1 2 NA NA NA NA
## AAACATACAAAGCA-1 0 NA NA NA NA
## ... ... ... ... ... ...
## TTTGCATGTTTCGT-8 0 NA NA NA NA
## TTTGCATGTTTCTG-8 0 NA NA NA NA
## TTTGCATGTTTGCT-8 0 NA NA NA NA
## TTTGCATGTTTGGG-8 0 NA NA NA NA
## TTTGCATGTTTGTC-8 0 NA NA NA NA
is.cell = (e.out$FDR <= 0.01)
par(mar=c(5,4,1,1), mfrow=c(1,2), bty="n")
plot(e.out$Total, -e.out$LogProb, col=ifelse(is.cell, "red", "black"),
xlab="Total UMI count", ylab="-Log Probability", cex=0.2)
abline(v = bcrank$inflection, col="darkgreen")
abline(v = bcrank$knee, col="dodgerblue")
legend("bottomright", legend=c("Inflection", "Knee"), bty="n",
col=c("darkgreen", "dodgerblue"), lty=1, cex=1.2)
plot(e.out$Total, -e.out$LogProb, col=ifelse(is.cell, "red", "black"),
xlab="Total UMI count", ylab="-Log Probability", cex=0.2, xlim=c(0,2000), ylim=c(0,2000))
abline(v = bcrank$inflection, col="darkgreen")
abline(v = bcrank$knee, col="dodgerblue")
From the above anlaysis, some cells with very small number of UMI’s may also have small FDR suggesting the distribution of UMI counts are different from what is expected from ambient profile. We choose a more conservative strategy, to keep the cells with total number of UMI larger than the inflection point estimate (bcrank$inflection=bcrank$inflection
) and FDR < 0.01.
table(colnames(sce) == rownames(e.out))
##
## TRUE
## 5898240
table(e.out$FDR <= 0.01, useNA="ifany")
##
## FALSE TRUE <NA>
## 10541 67464 5820235
table(is.cell, e.out$Total >= bcrank$inflection)
##
## is.cell FALSE TRUE
## FALSE 1775 8766
## TRUE 8391 59073
w2kp = which(is.cell & e.out$Total >= bcrank$inflection)
sce = sce[,w2kp]
dim(sce)
## [1] 31188 59073
Now we have reduced the ~6 million barcodes to around 60,000 (potential) cells. Note that Zheng et al. (2017) use a cutoff of total number of UMI \(\geq\) 396 and obtain around 78,000 cells.
Next step we apply more QC based on a set of features per cell.
library(data.table)
ribo.file = "~/research/scRNAseq/workflow/data/ribosome_genes.txt"
ribo = fread(ribo.file)
dim(ribo)
## [1] 164 12
ribo[1:2,]
## HGNC ID Approved Symbol Approved Name Status
## 1: 10298 RPL10 ribosomal protein L10 Approved
## 2: 10299 RPL10A ribosomal protein L10a Approved
## Previous Symbols Synonyms Chromosome
## 1: NOV, QM, DXS648E, DXS648, FLJ23544, L10 Xq28
## 2: NEDD6 Csa-19, L10A 6p21.31
## Accession Numbers RefSeq IDs Gene Family Tag Gene family description
## 1: AB007170 NM_006013 RPL L ribosomal proteins
## 2: U12404 NM_007104 RPL L ribosomal proteins
## Gene family ID
## 1: 729
## 2: 729
is.mito = which(rowData(sce)$chromosome_name == "MT")
is.ribo = which(rowData(sce)$hgnc_symbol %in% ribo$`Approved Symbol`)
length(is.mito)
## [1] 13
length(is.ribo)
## [1] 161
sce = calculateQCMetrics(sce, feature_controls=list(Mt=is.mito, Ri=is.ribo))
colnames(colData(sce))
## [1] "Sample"
## [2] "Barcode"
## [3] "is_cell_control"
## [4] "total_features_by_counts"
## [5] "log10_total_features_by_counts"
## [6] "total_counts"
## [7] "log10_total_counts"
## [8] "pct_counts_in_top_50_features"
## [9] "pct_counts_in_top_100_features"
## [10] "pct_counts_in_top_200_features"
## [11] "pct_counts_in_top_500_features"
## [12] "total_features"
## [13] "log10_total_features"
## [14] "pct_counts_top_50_features"
## [15] "pct_counts_top_100_features"
## [16] "pct_counts_top_200_features"
## [17] "pct_counts_top_500_features"
## [18] "total_features_by_counts_endogenous"
## [19] "log10_total_features_by_counts_endogenous"
## [20] "total_counts_endogenous"
## [21] "log10_total_counts_endogenous"
## [22] "pct_counts_endogenous"
## [23] "pct_counts_in_top_50_features_endogenous"
## [24] "pct_counts_in_top_100_features_endogenous"
## [25] "pct_counts_in_top_200_features_endogenous"
## [26] "pct_counts_in_top_500_features_endogenous"
## [27] "total_features_endogenous"
## [28] "log10_total_features_endogenous"
## [29] "pct_counts_top_50_features_endogenous"
## [30] "pct_counts_top_100_features_endogenous"
## [31] "pct_counts_top_200_features_endogenous"
## [32] "pct_counts_top_500_features_endogenous"
## [33] "total_features_by_counts_feature_control"
## [34] "log10_total_features_by_counts_feature_control"
## [35] "total_counts_feature_control"
## [36] "log10_total_counts_feature_control"
## [37] "pct_counts_feature_control"
## [38] "pct_counts_in_top_50_features_feature_control"
## [39] "pct_counts_in_top_100_features_feature_control"
## [40] "total_features_feature_control"
## [41] "log10_total_features_feature_control"
## [42] "pct_counts_top_50_features_feature_control"
## [43] "pct_counts_top_100_features_feature_control"
## [44] "total_features_by_counts_Mt"
## [45] "log10_total_features_by_counts_Mt"
## [46] "total_counts_Mt"
## [47] "log10_total_counts_Mt"
## [48] "pct_counts_Mt"
## [49] "total_features_Mt"
## [50] "log10_total_features_Mt"
## [51] "total_features_by_counts_Ri"
## [52] "log10_total_features_by_counts_Ri"
## [53] "total_counts_Ri"
## [54] "log10_total_counts_Ri"
## [55] "pct_counts_Ri"
## [56] "pct_counts_in_top_50_features_Ri"
## [57] "pct_counts_in_top_100_features_Ri"
## [58] "total_features_Ri"
## [59] "log10_total_features_Ri"
## [60] "pct_counts_top_50_features_Ri"
## [61] "pct_counts_top_100_features_Ri"
par(mfrow=c(2,2), mar=c(5, 4, 1, 1), bty="n")
hist(log10(sce$total_counts), xlab="log10(Library sizes)", main="",
breaks=20, col="grey80", ylab="Number of cells")
hist(log10(sce$total_features), xlab="log10(# of expressed genes)",
main="", breaks=20, col="grey80", ylab="Number of cells")
hist(sce$pct_counts_Ri, xlab="Ribosome prop. (%)",
ylab="Number of cells", breaks=40, main="", col="grey80")
hist(sce$pct_counts_Mt, xlab="Mitochondrial prop. (%)",
ylab="Number of cells", breaks=80, main="", col="grey80")
par(mfrow=c(2,2), mar=c(5, 4, 1, 1), bty="n")
smoothScatter(log10(sce$total_counts), log10(sce$total_features),
xlab="log10(Library sizes)", ylab="log10(# of expressed genes)",
nrpoints=500, cex=0.5)
smoothScatter(log10(sce$total_counts), sce$pct_counts_Ri,
xlab="log10(Library sizes)", ylab="Ribosome prop. (%)",
nrpoints=500, cex=0.5)
abline(h=10, lty=1)
smoothScatter(log10(sce$total_counts), sce$pct_counts_Mt,
xlab="log10(Library sizes)", ylab="Mitochondrial prop. (%)",
nrpoints=500, cex=0.5)
abline(h=5, lty=1)
smoothScatter(sce$pct_counts_Ri, sce$pct_counts_Mt,
xlab="Ribosome prop. (%)", ylab="Mitochondrial prop. (%)",
nrpoints=500, cex=0.5)
abline(h=5, lty=1)
abline(v=10, lty=1)
It is not clear why some cells have very low proportion of ribosome genes. To be a little bit conservative, we remove those genes with low proportion of ribosomeal genes (\(<10\%\)) or high proportion of Mitochondrial genes (\(>5\%\)).
table(sce$pct_counts_Mt > 5, sce$pct_counts_Ri < 10)
##
## FALSE TRUE
## FALSE 58763 140
## TRUE 161 9
sce.lq = sce[,which(sce$pct_counts_Mt > 5 | sce$pct_counts_Ri < 10)]
dim(sce.lq)
## [1] 31188 310
sce = sce[,which(sce$pct_counts_Mt <= 5 | sce$pct_counts_Ri >= 10)]
dim(sce)
## [1] 31188 59064
rowData(sce)[1:2,]
## DataFrame with 2 rows and 20 columns
## ensembl_gene_id hgnc_symbol chromosome_name start_position end_position
## <character> <character> <character> <integer> <integer>
## 1 ENSG00000004487 KDM1A 1 23019443 23083689
## 2 ENSG00000007923 DNAJC11 1 6634168 6701924
## strand
## <integer>
## 1 1
## 2 -1
## description
## <character>
## 1 lysine demethylase 1A [Source:HGNC Symbol;Acc:HGNC:29079]
## 2 DnaJ heat shock protein family (Hsp40) member C11 [Source:HGNC Symbol;Acc:HGNC:25570]
## percentage_gene_gc_content gene_biotype is_feature_control
## <numeric> <character> <logical>
## 1 39.17 protein_coding FALSE
## 2 46.6 protein_coding FALSE
## is_feature_control_Mt is_feature_control_Ri mean_counts
## <logical> <logical> <numeric>
## 1 FALSE FALSE 0.0130008633385811
## 2 FALSE FALSE 0.00890423713033027
## log10_mean_counts n_cells_by_counts pct_dropout_by_counts total_counts
## <numeric> <integer> <numeric> <numeric>
## 1 0.00560981549159529 701 98.813332656205 768
## 2 0.00384994595981834 499 99.1552824471417 526
## log10_total_counts n_cells_counts pct_dropout_counts
## <numeric> <integer> <numeric>
## 1 2.88592633980143 701 98.813332656205
## 2 2.72181061521255 499 99.1552824471417
min(rowData(sce)$mean_counts)
## [1] 0
min(rowData(sce)$mean_counts[rowData(sce)$mean_counts>0])
## [1] 1.692821e-05
min(rowData(sce)$n_cells_counts)
## [1] 0
par(mfrow=c(1,3), mar=c(5,4,1,1))
hist(log10(rowData(sce)$mean_counts+1e-6), col="grey80", main="",
breaks=40, xlab="log10(ave # of UMI + 1e-6)")
hist(log10(rowData(sce)$n_cells_counts+1), col="grey80", main="",
breaks=40, xlab="log10(# of expressed cells + 1)")
smoothScatter(log10(rowData(sce)$mean_counts+1e-6),
log10(rowData(sce)$n_cells_counts + 1),
xlab="log10(ave # of UMI + 1e-6)",
ylab="log10(# of expressed cells + 1)")
tb1 = table(rowData(sce)$n_cells_counts)
tb1[1:11]
##
## 0 1 2 3 4 5 6 7 8 9 10
## 11589 1618 841 590 443 333 277 233 241 195 150
We remove those genes that are expressed in zero or only one cell. The variable strand need to be renamed, otherwise there is an error message that such a variabel name cannot be used.
names(rowData(sce))[6] = "strand_n"
sce = sce[which(rowData(sce)$n_cells_counts > 1),]
dim(sce)
## [1] 17981 59064
Next we check those highly expressed genes
par(mar=c(5,4,1,1))
od1 = order(rowData(sce)$mean_counts, decreasing = TRUE)
barplot(rowData(sce)$mean_counts[od1[20:1]], las=1,
names.arg=rowData(sce)$hgnc_symbol[od1[20:1]],
horiz=TRUE, cex.names=0.5, cex.axis=0.7,
xlab="ave # of UMI")
A simple solution for normalization and stablizing expression varaince across genes is to tranform the count data by log(count/size.factor + 1). One may calcualte size.factor per cell as the total number of UMIs, and this assumes the total expression are the same across all the cells. However, the total expression of each cell may vary with respect to cell type and/or cell size, and the computeSumFactors
function in R package scran provides a more sophisicated way to calcualte size.factor to allow such variaation across cells (Lun, Bach, and Marioni 2016). computeSumFactors
can use initial clustering of cells to normalize expression within and beetween clusters. Within a cluster, it estimates the size factor for many groups of cells so that there are more groups than cells, and then it can calcualte the size factor per cell using a lienar deconvolution system.
As shown in the following plot, the final size factor estimation is indeed highly correlated with the naive definition by total count.
Finally, the command normalize(sce)
adds the normalized expression into the variable sce
.
library(scran)
date()
## [1] "Sun Aug 19 23:54:47 2018"
clusters = quickCluster(sce, min.mean=0.1, method="igraph")
date()
## [1] "Mon Aug 20 00:25:19 2018"
sce = computeSumFactors(sce, cluster=clusters, min.mean=0.1)
date()
## [1] "Mon Aug 20 00:38:20 2018"
summary(sizeFactors(sce))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1220 0.7673 0.9160 1.0000 1.1168 10.3884
par(mfrow=c(1,2), mar=c(5,4,2,1), bty="n")
smoothScatter(sce$total_counts, sizeFactors(sce), log="xy",
xlab="total counts", ylab="size factors")
plot(sce$total_counts, sizeFactors(sce), log="xy",
xlab="total counts", ylab="size factors",
cex=0.3, pch=20, col=rgb(0.1,0.2,0.7,0.3))
sce = normalize(sce)
For dimension reduction, such as calculating PCA or performing TSNE, we should start by identifying a subset of genes with high level of biological signal relative to background (technical) noise. The decomposeVar
function from R/cran is designed for this task.
new.trend = makeTechTrend(x=sce)
fit = trendVar(sce, use.spikes=FALSE, loess.args=list(span=0.05))
par(mfrow=c(1,1), mar=c(5,4,2,1), bty="n")
plot(fit$mean, fit$var, pch=20, col=rgb(0.1,0.2,0.7,0.6),
xlab="log(mean)", ylab="var")
curve(fit$trend(x), col="orange", lwd=2, add=TRUE)
curve(new.trend(x), col="red", lwd=2, add=TRUE)
legend("topright", legend=c("Poisson noise", "observed trend"),
lty=1, lwd=2, col=c("red", "orange"), bty="n")
fit$trend = new.trend
dec = decomposeVar(fit=fit)
top.dec = dec[order(dec$bio, decreasing=TRUE),]
plotExpression(sce, features=rownames(top.dec)[1:10])
When performing PCA, we can use all the genes or just those genes with high signal-to-noise ratio. TSNE analysis is usually based on the top PCs rather than the original gene expression data. We first perform PCA using all the genes and the function denoisePCA
can automatically select the PCs based on modeling of technical noise.
date()
## [1] "Mon Aug 20 00:40:47 2018"
sce = denoisePCA(sce, technical=new.trend, approx=TRUE)
date()
## [1] "Mon Aug 20 00:50:11 2018"
dim(reducedDim(sce, "PCA"))
## [1] 59064 45
plot(log10(attr(reducedDim(sce), "percentVar")), xlab="PC",
ylab="log10(Prop of variance explained)", pch=20, cex=0.6,
col=rgb(0.8, 0.2, 0.2, 0.5))
abline(v=ncol(reducedDim(sce, "PCA")), lty=2, col="red")
df_pcs = data.frame(reducedDim(sce, "PCA"))
df_pcs$log10_total_features = colData(sce)$log10_total_features
gp1 = ggplot(df_pcs, aes(PC1,PC2,col=log10_total_features)) +
geom_point(size=0.2,alpha=0.6) + theme_classic() +
scale_colour_gradient(low="lightblue",high="red") +
guides(color = guide_legend(override.aes = list(size=3)))
gp1
date()
## [1] "Mon Aug 20 00:50:13 2018"
sce = runTSNE(sce, use_dimred="PCA", perplexity=30, rand_seed=100)
date()
## [1] "Mon Aug 20 01:08:44 2018"
df_tsne = data.frame(reducedDim(sce, "TSNE"))
df_tsne$log10_total_features = colData(sce)$log10_total_features
gp1 = ggplot(df_tsne, aes(X1,X2,col=log10_total_features)) +
geom_point(size=0.2,alpha=0.6) + theme_classic() +
scale_colour_gradient(low="lightblue",high="red") +
guides(color = guide_legend(override.aes = list(size=3)))
gp1
Next we only select around top 1000 genes for the PCA and use the top 50 PCs for TSNE projection.
library(svd)
library(Rtsne)
summary(dec$bio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0011881 0.0000261 0.0003834 0.0050907 0.0018510 2.0471504
dec1 = dec
dec1$bio[which(dec$bio < 1e-8)] = 1e-8
dec1$FDR[which(dec$FDR < 1e-100)] = 1e-100
par(mfrow=c(1,2))
hist(log10(dec1$bio), breaks=100, main="")
hist(log10(dec1$FDR), breaks=100, main="")
summary(dec$FDR[dec$bio > 0.001])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000000 0.0000000 0.0000000 0.0000339 0.0000000 0.0693136
table(dec$FDR < 1e-10, dec$bio > 0.01)
##
## FALSE TRUE
## FALSE 3852 5
## TRUE 13089 1035
w2kp = which(dec$FDR < 1e-10 & dec$bio > 0.01)
sce_sub = sce[w2kp,]
sce_sub
## class: SingleCellExperiment
## dim: 1035 59064
## metadata(1): log.exprs.offset
## assays(2): counts logcounts
## rownames(1035): TNFRSF1B CDC42 ... HMGN1 ATP5PO
## rowData names(20): ensembl_gene_id hgnc_symbol ... n_cells_counts
## pct_dropout_counts
## colnames(59064): AAACATACACCCAA-1 AAACATACCCCTCA-1 ...
## TTTGCATGCTGCAA-8 TTTGCATGTGGTAC-8
## colData names(61): Sample Barcode ...
## pct_counts_top_50_features_Ri pct_counts_top_100_features_Ri
## reducedDimNames(2): PCA TSNE
## spikeNames(0):
edat = t(as.matrix(logcounts(sce_sub)))
edat = scale(edat)
dim(edat)
## [1] 59064 1035
edat[1:2,1:3]
## TNFRSF1B CDC42 ENO1
## AAACATACACCCAA-1 -0.2220264 -0.3758056 -0.7806183
## AAACATACCCCTCA-1 -0.2220264 -0.3758056 -0.7806183
date()
## [1] "Mon Aug 20 01:09:10 2018"
ppk = propack.svd(edat,neig=50)
date()
## [1] "Mon Aug 20 01:11:13 2018"
pca = t(ppk$d*t(ppk$u))
df_pcs = data.frame(pca)
df_pcs$log10_total_features = colData(sce_sub)$log10_total_features
df_pcs[1:2,]
## X1 X2 X3 X4 X5 X6
## 1 -1.938437 2.14847107 0.9914546 -0.5089818 -1.400226 -2.205035
## 2 -6.093417 0.08501296 -0.5933283 -0.5807430 -1.655475 1.886403
## X7 X8 X9 X10 X11 X12 X13
## 1 -0.8143478 2.783945 -2.9836858 1.1779709 0.9085121 -0.9685727 -1.421720
## 2 -2.1541784 1.413880 -0.5692589 0.8448131 2.6189810 1.5308592 -1.651077
## X14 X15 X16 X17 X18 X19
## 1 1.341527 -0.1550409 1.421038 -0.1670475 -0.1453368 0.9407139
## 2 -1.432342 -0.9404134 -2.862906 -0.1586555 1.0596564 0.9069979
## X20 X21 X22 X23 X24 X25
## 1 0.6789775 -2.208472 -0.7156601 0.1852175 1.7249985 -2.1441272
## 2 -0.6383841 1.238179 0.9927511 0.6050383 -0.9835777 0.7632581
## X26 X27 X28 X29 X30 X31 X32
## 1 0.5064031 0.172789 0.3513667 2.265742 -0.3584804 1.4068027 0.2128577
## 2 -0.1382440 1.353238 2.0299458 1.189431 0.2047557 0.4277797 1.3450210
## X33 X34 X35 X36 X37 X38
## 1 0.0124918 -0.9527684 0.6947430 -0.8310609 -0.07669705 -2.1599937
## 2 -2.3776169 0.4908630 0.5240087 1.7471721 0.07878119 -0.5164361
## X39 X40 X41 X42 X43 X44
## 1 0.7357511 0.6094890 -2.41474788 -0.41392187 -0.5895618 0.05231965
## 2 -0.2466020 -0.7048429 -0.02699798 -0.07840753 0.8481294 1.37930577
## X45 X46 X47 X48 X49 X50
## 1 0.9999528 0.46641941 1.730882 -0.9973974 2.1186948 -1.3535233
## 2 1.3630837 0.04804352 -1.725742 -0.2842565 0.2822561 -0.3334095
## log10_total_features
## 1 2.696356
## 2 2.673021
gp1 = ggplot(df_pcs, aes(X1,X2,col=log10_total_features)) +
geom_point(size=0.2,alpha=0.6) + theme_classic() +
scale_colour_gradient(low="lightblue",high="red") +
guides(color = guide_legend(override.aes = list(size=3)))
gp1
set.seed(100)
date()
## [1] "Mon Aug 20 01:11:15 2018"
tsne = Rtsne(pca, pca = FALSE)
date()
## [1] "Mon Aug 20 01:32:07 2018"
df_tsne = data.frame(tsne$Y)
df_tsne$log10_total_features = colData(sce_sub)$log10_total_features
dim(df_tsne)
## [1] 59064 3
df_tsne[1:2,]
## X1 X2 log10_total_features
## 1 1.577535 -8.10810 2.696356
## 2 -16.022960 11.87136 2.673021
gp1 = ggplot(df_tsne, aes(X1,X2,col=log10_total_features)) +
geom_point(size=0.2,alpha=0.6) + theme_classic() +
scale_colour_gradient(low="lightblue",high="red") +
guides(color = guide_legend(override.aes = list(size=3)))
gp1
reducedDims(sce_sub) = SimpleList(PCA=pca, TSNE=tsne$Y)
sce_sub
## class: SingleCellExperiment
## dim: 1035 59064
## metadata(1): log.exprs.offset
## assays(2): counts logcounts
## rownames(1035): TNFRSF1B CDC42 ... HMGN1 ATP5PO
## rowData names(20): ensembl_gene_id hgnc_symbol ... n_cells_counts
## pct_dropout_counts
## colnames(59064): AAACATACACCCAA-1 AAACATACCCCTCA-1 ...
## TTTGCATGCTGCAA-8 TTTGCATGTGGTAC-8
## colData names(61): Sample Barcode ...
## pct_counts_top_50_features_Ri pct_counts_top_100_features_Ri
## reducedDimNames(2): PCA TSNE
## spikeNames(0):
There are many methods for clustering of single cell RNA-seq data. The performance of each method may also depend on pre-processing steps, such as performing imputation or not. We wil compare these methods in a seperate document. Here we just illustrate the clustering reuslts using a simple kmeans method on the top 50 PCs.
k10_50_pcs = kmeans(reducedDim(sce_sub, "PCA"), centers=10,
iter.max=150, algorithm="MacQueen")
names(k10_50_pcs)
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
dim(k10_50_pcs$centers)
## [1] 10 50
df_tsne$cluster_kmean = as.factor(k10_50_pcs$cluster)
cols = c("#FB9A99","#FF7F00","yellow","orchid","grey",
"red","dodgerblue2","tan4","green4","#99c9fb")
gp1 = ggplot(df_tsne, aes(X1,X2,col=cluster_kmean)) +
geom_point(size=0.2,alpha=0.6) + theme_classic() +
scale_color_manual(values=cols) +
guides(color = guide_legend(override.aes = list(size=3)))
gp1
An alternative popular clustreing method is a graph based method that first construct a graph for all the cells and then identify clusters of cells by identifying densely connected subgraphs (C. Xu and Su 2015). One possible implementation is the following approach
snn.gr = buildSNNGraph(sce_sub, use.dimred="PCA")
clusters = igraph::cluster_walktrap(snn.gr)
Though this implementation is very slow for this large dataset. It took more than 3 hours on a MacbookPro with 2.8 G i7 core and 16G memmory. The results of graphical model based on clustering and Kmeans have large overlap in some cluters. Though in general, graph-based identify many small clusters.
Compare the clustering resutls of Kmeans vs. a graph based method. Each column corresponds to one of 31 clusters by graph-based method. Each row corresponds to one of the 10 clusters by kmeans
Next we identify the most likely cell type of each cell by comparing the expression of each cell versus gene expression for 11 cell types. We chose to use a small set of genes that are known to be differentially expressed across immune cell types. This gene set is constructed by the union of the signature genes used by CIBERSORT (Newman et al. 2015) and 31 known cell type markers.
purified_ref_11 = readRDS("data/expression_11_cell_types_Zheng_2016.rds")
row2kp = rowData(sce)$ensembl_gene_id %in% rownames(purified_ref_11)
sce_ct = sce[which(row2kp),]
dim(sce_ct)
## [1] 17525 59064
lm22 = fread("data/LM22.txt")
dim(lm22)
## [1] 547 23
lm22[1:2,1:5]
## Gene symbol B cells naive B cells memory Plasma cells T cells CD8
## 1: ABCB4 555.71345 10.74423 7.225819 4.31128
## 2: ABCB9 15.60354 22.09479 653.392328 24.22372
genes2use = c("CD19", "IGKC", "IGLC2", "MS4A1", "MME", "HLA-DRB1",
"CD80", "CD3E", "CD4", "CD8A", "IL2RA", "SELL", "CCR7",
"IL7R", "FOXP3", "CD14", "CD33", "NCAM1", "CXCR3", "FAS",
"FASLG", "CD44", "PDCD1", "CD274", "PDCD1LG2", "EOMES",
"TBX21", "IFNG", "GZMB", "CD38", "TNFRSF9")
length(genes2use)
## [1] 31
genes2use = union(lm22$`Gene symbol`, genes2use)
length(genes2use)
## [1] 558
genes2use = intersect(genes2use, rowData(sce_ct)$hgnc_symbol)
sce_ct = sce_ct[match(genes2use, rowData(sce_ct)$hgnc_symbol),]
dim(sce_ct)
## [1] 443 59064
mat1 = match(rowData(sce_ct)$ensembl_gene_id, rownames(purified_ref_11))
purified_ref_11 = purified_ref_11[mat1,]
dim(purified_ref_11)
## [1] 443 11
purified_ref_11[1:3,]
## CD34+ CD56+ NK CD4+/CD45RA+/CD25- Naive T
## ENSG00000005471 1.392797e-07 2.224760e-07 7.764879e-08
## ENSG00000150967 1.149058e-06 8.157454e-07 2.096517e-06
## ENSG00000072818 1.382003e-04 2.611127e-04 3.502737e-04
## CD4+/CD25 T Reg CD8+/CD45RA+ Naive Cytotoxic
## ENSG00000005471 7.703224e-08 5.542298e-08
## ENSG00000150967 9.243869e-07 1.053037e-06
## ENSG00000072818 3.631300e-04 3.087614e-04
## CD4+/CD45RO+ Memory CD8+ Cytotoxic T CD19+ B
## ENSG00000005471 6.178376e-08 6.062095e-08 2.561460e-05
## ENSG00000150967 1.297459e-06 1.091177e-06 7.805004e-07
## ENSG00000072818 3.296781e-04 3.293536e-04 2.230102e-04
## CD4+ T Helper2 CD14+ Monocyte Dendritic
## ENSG00000005471 6.457025e-08 5.463552e-07 2.442730e-06
## ENSG00000150967 2.066248e-06 1.256617e-05 4.885460e-06
## ENSG00000072818 3.481628e-04 4.480113e-05 5.129733e-05
We calculate the rank-based spearman correlation beween each cell and 11 cell type-specifid gene expression profiles. It is interesting to note that for many cells, the second largest correlation is very similar to the largest one and thus this assignment of cell type is with considerable amount of uncertainty.
edat = as.matrix(logcounts(sce_ct))
dim(edat)
## [1] 443 59064
edat[1:2,1:2]
## AAACATACACCCAA-1 AAACATACCCCTCA-1
## ABCB4 0 0
## ABCB9 0 0
cor.ct = cor(edat, purified_ref_11, method="spearman")
dim(cor.ct)
## [1] 59064 11
summary(c(cor.ct))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.1390 0.1057 0.2241 0.1981 0.2896 0.5280
largest2 <- function(v){
sort(v, decreasing=TRUE)[2]
}
cor_top1 = apply(cor.ct, 1, max)
cor_top2 = apply(cor.ct, 1, largest2)
summary(cor_top1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01088 0.28193 0.31130 0.30971 0.33982 0.52800
summary(cor_top1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01088 0.28193 0.31130 0.30971 0.33982 0.52800
summary(cor_top1 - cor_top2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.000e-08 1.964e-03 6.242e-03 1.634e-02 1.949e-02 2.223e-01
summary((cor_top1 - cor_top2)/cor_top1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000002 0.0065773 0.0204692 0.0524379 0.0611228 1.2657452
par(mar=c(5,4,1,1), bty="n")
plot(cor_top1, cor_top2, pch=20, col=rgb(0.8, 0.2, 0.2, 0.5),
cex=0.2, xlab="largest correlation",
ylab="2nd largest correlation")
abline(0,1)
abline(0, 0.8, lty=2)
abline(v=0.2, lty=2)
text(0.5, 0.38, pos=1, "y=0.8x")
df_tsne$ct = colnames(cor.ct)[apply(cor.ct, 1, which.max)]
.set_pbmc_color_11<-function() {
myColors <- c( "dodgerblue2",
"green4",
"#6A3D9A", # purple
"grey",
"tan4",
"yellow",
"#FF7F00", # orange
"black",
"#FB9A99", # pink
"orchid",
"red")
id <- c("CD19+ B",
"CD14+ Monocyte",
"Dendritic",
"CD56+ NK",
"CD34+",
"CD4+/CD25 T Reg",
"CD4+/CD45RA+/CD25- Naive T",
"CD4+/CD45RO+ Memory",
"CD4+ T Helper2",
"CD8+/CD45RA+ Naive Cytotoxic",
"CD8+ Cytotoxic T")
names(myColors)<-id
scale_colour_manual(name = "ct",values = myColors)
}
gp1 = ggplot(df_tsne,aes(X1,X2,col=ct)) +
geom_point(size=0.2,alpha=0.6) + theme_classic() +
.set_pbmc_color_11() +
guides(color = guide_legend(override.aes = list(size=3)))
gp1
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] Rtsne_0.13 svd_0.4.1
## [3] scran_1.8.4 data.table_1.11.4
## [5] scater_1.8.3 ggplot2_3.0.0
## [7] dplyr_0.7.6 biomaRt_2.36.1
## [9] DropletUtils_1.0.3 SingleCellExperiment_1.2.0
## [11] SummarizedExperiment_1.10.1 DelayedArray_0.6.4
## [13] matrixStats_0.54.0 Biobase_2.40.0
## [15] GenomicRanges_1.32.6 GenomeInfoDb_1.16.0
## [17] IRanges_2.14.10 S4Vectors_0.18.3
## [19] BiocGenerics_0.26.0 BiocParallel_1.14.2
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-6 bit64_0.9-7
## [3] progress_1.2.0 httr_1.3.1
## [5] rprojroot_1.3-2 dynamicTreeCut_1.63-1
## [7] tools_3.5.0 backports_1.1.2
## [9] irlba_2.3.2 R6_2.2.2
## [11] DT_0.4 KernSmooth_2.23-15
## [13] vipor_0.4.5 DBI_1.0.0
## [15] lazyeval_0.2.1 colorspace_1.3-2
## [17] withr_2.1.2 tidyselect_0.2.4
## [19] gridExtra_2.3 prettyunits_1.0.2
## [21] bit_1.1-14 compiler_3.5.0
## [23] labeling_0.3 scales_1.0.0
## [25] stringr_1.3.1 digest_0.6.15
## [27] rmarkdown_1.10 XVector_0.20.0
## [29] pkgconfig_2.0.1 htmltools_0.3.6
## [31] limma_3.36.2 highr_0.7
## [33] htmlwidgets_1.2 rlang_0.2.1
## [35] RSQLite_2.1.1 FNN_1.1.2.1
## [37] shiny_1.1.0 DelayedMatrixStats_1.2.0
## [39] bindr_0.1.1 RCurl_1.95-4.11
## [41] magrittr_1.5 GenomeInfoDbData_1.1.0
## [43] Matrix_1.2-14 Rcpp_0.12.18
## [45] ggbeeswarm_0.6.0 munsell_0.5.0
## [47] Rhdf5lib_1.2.1 viridis_0.5.1
## [49] stringi_1.2.4 yaml_2.2.0
## [51] edgeR_3.22.3 zlibbioc_1.26.0
## [53] rhdf5_2.24.0 plyr_1.8.4
## [55] grid_3.5.0 blob_1.1.1
## [57] promises_1.0.1 shinydashboard_0.7.0
## [59] crayon_1.3.4 lattice_0.20-35
## [61] cowplot_0.9.3 hms_0.4.2
## [63] locfit_1.5-9.1 knitr_1.20
## [65] pillar_1.3.0 igraph_1.2.2
## [67] rjson_0.2.20 reshape2_1.4.3
## [69] XML_3.98-1.15 glue_1.3.0
## [71] evaluate_0.11 httpuv_1.4.5
## [73] gtable_0.2.0 purrr_0.2.5
## [75] assertthat_0.2.0 mime_0.5
## [77] xtable_1.8-2 later_0.7.3
## [79] viridisLite_0.3.0 tibble_1.4.2
## [81] AnnotationDbi_1.42.1 beeswarm_0.2.3
## [83] memoise_1.1.0 tximport_1.8.0
## [85] bindrcpp_0.2.2 statmod_1.4.30
Lun, Aaron TL, Karsten Bach, and John C Marioni. 2016. “Pooling Across Cells to Normalize Single-Cell Rna Sequencing Data with Many Zero Counts.” Genome Biology 17 (1). BioMed Central: 75.
Newman, Aaron M, Chih Long Liu, Michael R Green, Andrew J Gentles, Weiguo Feng, Yue Xu, Chuong D Hoang, Maximilian Diehn, and Ash A Alizadeh. 2015. “Robust Enumeration of Cell Subsets from Tissue Expression Profiles.” Nature Methods 12 (5). Nature Publishing Group: 453.
Xu, Chen, and Zhengchang Su. 2015. “Identification of Cell Types from Single-Cell Transcriptomes Using a Novel Clustering Method.” Bioinformatics 31 (12). Oxford University Press: 1974–80.
Zheng, Grace XY, Jessica M Terry, Phillip Belgrader, Paul Ryvkin, Zachary W Bent, Ryan Wilson, Solongo B Ziraldo, et al. 2017. “Massively Parallel Digital Transcriptional Profiling of Single Cells.” Nature Communications 8. Nature Publishing Group: 14049.
Ziegenhain, Christoph, Beate Vieth, Swati Parekh, Björn Reinius, Amy Guillaumet-Adkins, Martha Smets, Heinrich Leonhardt, Holger Heyn, Ines Hellmann, and Wolfgang Enard. 2017. “Comparative Analysis of Single-Cell Rna Sequencing Methods.” Molecular Cell 65 (4). Elsevier: 631–43.