Comment on page
4. QC and normalization
load R libraries
library(SingleCellExperiment)
library(scater)
options(stringsAsFactors = FALSE)
load the expression matrix and sample annotations
mx <- read.table("GSE71008.NvsCRC.reads.txt", sep = "\t")
anno <- read.table("GSE71008.NvsCRC.anno.txt", sep = "\t", header=T)
# assign class
anno$Class <- "NC"
anno[which(anno$Stage=="1S"),]$Class <- "S1"
anno[which(anno$Stage=="2S"),]$Class <- "S2"
anno[which(anno$Stage=="3S"),]$Class <- "S3"
anno[which(anno$Stage=="4S"),]$Class <- "S4"
Inspect a small portion of the expression matrix and sample annotations
head(mx[,1:5])
s1S1 s1S2 s1S3 s2S1 s2S2
A-NT2RP7011570 11 10 6 0 9
C-ADG04260 1 1 1 1 1
C-ADG07684 2 3 5 3 5
C-ASTRO3000154 9 9 10 5 11
C-BRACE2001543 0 0 0 0 0
C-BRACE2001954 1 2 1 1 1
head(anno)
CancerType Stage Individual Class
1 Colorectal_Cancer 1S s1S1 S1
2 Colorectal_Cancer 1S s1S2 S1
3 Colorectal_Cancer 1S s1S3 S1
4 Colorectal_Cancer 1S s1S4 S1
5 Colorectal_Cancer 1S s1S5 S1
6 Colorectal_Cancer 1S s1S9 S1
We create SCE object to standardize the analysis using both
SingleCellExperiment
(SCE) and scater
packages.anno_NvsEachStage <- anno
mx_NvsEachStage <- mx
reads_NvsEachStage <- SingleCellExperiment(
assays = list(counts = as.matrix(mx_NvsEachStage)),
colData = anno_NvsEachStage)
Remove genes that are not expressed in any samples
keep_feature <- rowSums(counts(reads_NvsEachStage) > 0) > 0
reads_NvsSeach <- reads_NvsEachStage[keep_feature, ]
Define control genes, usually should be ERCC spike-in. In our case, we use three most stably expressed RNA transcripts: miR-99a-5p, miR-30a-5p and miR-221-3p. See more details in paper.
isSpike(reads_NvsEachStage, "stableRNA") <- rownames(reads_NvsEachStage) %in%
c("mature_miRNA:hsa-miR-99a-5p", "mature_miRNA:hsa-miR-30a-5p",
"mature_miRNA:hsa-miR-221-3p")
Considering the heterogeneities and batch effect between samples, we need to filter out untreated samples and genes.
Calculate the quality metrics:
reads_NvsEachStage <- calculateQCMetrics(
reads_NvsEachStage,
feature_controls = list(
stableRNA = isSpike(reads_NvsEachStage, "stableRNA")
)
)
hist(reads_NvsEachStage$total_counts,breaks = 100)
abline(v=990000, col="red")
filter_by_total_counts <- (reads_NvsEachStage$total_counts > 990000)
table(filter_by_total_counts)

filter_by_total_counts
FALSE TRUE
9 141
hist(reads_NvsEachStage$total_features,breaks = 100)
abline(v=2500, col="red")
filter_by_expr_features <- (reads_NvsEachStage$total_features > 2500)
table(filter_by_expr_features)

filter_by_expr_features
FALSE TRUE
10 140
Another measure of cell quality is the ratio between spike-in /control RNAs and endogenous RNAs. This ratio can be used to estimate the total amount of RNA in the samples. Samples with a high level of spike-in / control RNAs had low starting amounts of RNA, likely due to the RNA being degraded.
plotPhenoData(
reads_NvsEachStage,
aes_string(
x = "total_features",
y = "pct_counts_stableRNA",
colour = "Class"
)
)

filter out samples with too high spike-in / control RNA.
filter_by_endoCtrl <- reads_NvsEachStage$pct_counts_stableRNA < 10
table(filter_by_endoCtrl)
Based on previous analysis, we can define a sample filter:
reads_NvsEachStage$use <- (
# sufficient features (genes)
filter_by_expr_features &
# sufficient molecules counted
filter_by_total_counts &
# sufficient endogenous RNA
filter_by_endoCtrl
)
table(reads_NvsEachStage$use)
It is usually a good idea to exclude genes where we suspect that technical artefacts may have skewed the results. In our case, we consider the top 50 expressed genes.
plotQC(reads_NvsEachStage, type = "highest-expression")

It is typically a good idea to remove genes whose expression level is considered “undetectable”.
filter_genes <- apply(
counts(reads_NvsEachStage[, colData(reads_NvsEachStage)$use]),
1,
function(x) length(x[x >= 2]) >= 20
)
table(filter_genes)
rowData(reads_NvsEachStage)$use <- filter_genes
dim(reads_NvsEachStage[rowData(reads_NvsEachStage)$use, colData(reads_NvsEachStage)$use])
assay(reads_NvsEachStage, "logcounts_raw") <- log2(counts(reads_NvsEachStage) + 1)
reads_NvsEachStage.qc <- reads_NvsEachStage[rowData(reads_NvsEachStage)$use, colData(reads_NvsEachStage)$use]
# save the data
saveRDS(reads_NvsEachStage.qc, file = "GSE71008.reads_NvsEachStage.clean.rds")
The easiest way to overview the data is by transforming it using the principal component analysis and then visualize the first two principal components.
First, we compare the PCA results before and after QC.
Before QC
endog_genes <- !rowData(reads_NvsEachStage)$is_feature_control
plotPCA(
reads_NvsEachStage[endog_genes, ],
exprs_values = "counts",
colour_by = "Class",
size_by = "total_features"
)

After QC
reads_NvsEachStage.qc <- reads_NvsEachStage[rowData(reads_NvsEachStage)$use, colData(reads_NvsEachStage)$use]
endog_genes <- !rowData(reads_NvsEachStage.qc)$is_feature_control
plotPCA(
reads_NvsEachStage.qc[endog_genes, ],
exprs_values = "counts",
colour_by = "Class",
size_by = "total_features"
)

By default only the top 500 most variable genes are used by scater to calculate the PCA. This can be adjusted by changing the
ntop
argument.use top 100 genes
plotPCA(
reads_NvsEachStage.qc[endog_genes, ],
exprs_values = "counts",
colour_by = "Class",
size_by = "total_features",
ntop = 100
)

tSNE (t-Distributed Stochastic Neighbor Embedding) combines dimensionality reduction (e.g. PCA) with random walks on the nearest-neighbour network to map high dimensional data to a 2-dimensional space while preserving local distances between samples. In contrast with PCA, tSNE is a stochastic algorithm which means running the method multiple times on the same dataset will result in different plots. Due to the non-linear and stochastic nature of the algorithm, tSNE is more difficult to intuitively interpret tSNE. To ensure reproducibility, we fix the “seed” of the random-number generator in the code below so that we always get the same plot.
Before QC
plotTSNE(
reads_NvsEachStage[endog_genes, ],
exprs_values = "counts",
perplexity = 40,
colour_by = "Class",
size_by = "total_features",
rand_seed = 123456,
ntop = 100
)

After QC
plotTSNE(
reads_NvsEachStage.qc[endog_genes, ],
exprs_values = "counts",
perplexity = 40,
colour_by = "Class",
size_by = "total_features",
rand_seed = 123456,
ntop = 100
)

Furthermore, tSNE requires you to provide a value of
perplexity
which reflects the number of neighbours used to build the nearest-neighbour network; a high value creates a dense network which clumps samples together while a low value makes the network more sparse allowing groups of samples to separate from each other. scater
uses a default perplexity of the total number of cells divided by five (rounded down).set perplexity = 10
plotTSNE(
reads_NvsEachStage.qc[endog_genes, ],
exprs_values = "counts",
perplexity = 10,
colour_by = "Class",
size_by = "total_features",
rand_seed = 123456,
ntop = 100
)



please reference this paper:
scater
allows us to normalize raw counts using function normaliseExprs()
To compare the efficiency of different normalization methods we will use visual inspection of
PCA
plots and calculation of cell-wise relative log expression via scater
’s plotRLE()
function. Namely, cells with many (few) reads have higher (lower) than median expression for most genes resulting in a positive (negative) RLE across the cell, whereas normalized cells have an RLE close to zero. Example of a RLE function in R
:calc_sample_RLE <-
function (expr_mat, spikes = NULL)
{
RLE_gene <- function(x) {
if (median(unlist(x)) > 0) {
log((x + 1)/(median(unlist(x)) + 1))/log(2)
}
else {
rep(NA, times = length(x))
}
}
if (!is.null(spikes)) {
RLE_matrix <- t(apply(expr_mat[-spikes, ], 1, RLE_gene))
}
else {
RLE_matrix <- t(apply(expr_mat, 1, RLE_gene))
}
sample_RLE <- apply(RLE_matrix, 2, median, na.rm = T)
return(sample_RLE)
}
First, we filter out cleaned dataset
reads_NvsEachStage.qc <- reads_NvsEachStage[rowData(reads_NvsEachStage)$use, colData(reads_NvsEachStage)$use]
The simplest way to normalize this data is to convert it to counts per million (CPM) by dividing each column by its total then multiplying by 1,000,000.
logcounts(reads_NvsEachStage.qc) <- log2(calculateCPM(reads_NvsEachStage.qc, use.size.factors = FALSE) + 1)
plotPCA(
reads_NvsEachStage.qc[endog_genes, ],
exprs_values = "logcounts",
colour_by = "Class",
size_by = "total_features"
)
plotRLE(
reads_NvsEachStage.qc[endog_genes, ],
exprs_mats = list(Raw = "counts", CPM = "logcounts"),
exprs_logged = c(TRUE, TRUE),
colour_by = "Class"
)


Another method is called TMM is the weighted trimmed mean of M-values (to the reference) proposed by edgeR. The M-values in question are the gene-wise log2 fold changes between individual samples.
reads_NvsEachStage.qc <- normaliseExprs(
reads_NvsEachStage.qc,
method = "TMM",
feature_set = endog_genes,
return_log = TRUE,
return_norm_as_exprs = TRUE
)
plotPCA(
reads_NvsEachStage.qc[endog_genes, ],
exprs_values = "normcounts",
colour_by = "Class",
size_by = "total_features"
)
plotRLE(
reads_NvsEachStage.qc[endog_genes, ],
exprs_mats = list(Raw = "counts", TMM = "normcounts"),
exprs_logged = c(TRUE, TRUE),
colour_by = "Class"
)


The size factor (SF) was proposed and popularized by DESeq. First the geometric mean of each gene across all cells is calculated.
reads_NvsEachStage.qc <- normaliseExprs(
reads_NvsEachStage.qc,
method = "RLE",
feature_set = endog_genes,
return_log = TRUE,
return_norm_as_exprs = TRUE,
exprs_values = exprs
)
plotPCA(
reads_NvsEachStage.qc[endog_genes, ],
exprs_values = "normcounts",
colour_by = "Class",
size_by = "total_features"
)
plotRLE(
reads_NvsEachStage.qc[endog_genes, ],
exprs_mats = list(Raw = "counts", SF = "normcounts"),
exprs_logged = c(TRUE, TRUE),
colour_by = "Class"
)


scran
package implements a variant on CPM specialized for single-cell data. Briefly this method deals with the problem of vary large numbers of zero values per cell by pooling cells together calculating a normalization factor (similar to CPM) for the sum of each pool. Since each cell is found in many different pools, cell-specific factors can be deconvoluted from the collection of pool-specific factors using linear algebra.
library(scran)
# define cluster for each sample
sampleLables <- c()
for(i in colnames(reads_NvsEachStage.qc)){tmp <- as.character(anno[which(anno$Individual==i),"Class"]); sampleLables <- c(sampleLables,tmp)}
sampleLables <- replace(sampleLables, which(sampleLables=="S1"),1)
sampleLables <- replace(sampleLables, which(sampleLables=="S2"),2)
sampleLables <- replace(sampleLables, which(sampleLables=="S3"),3)
sampleLables <- replace(sampleLables, which(sampleLables=="S4"),4)
sampleLables <- replace(sampleLables, which(sampleLables=="NC"),5)
sampleLables <- as.numeric(sampleLables)
reads_NvsEachStage.qc <- computeSumFactors(
reads_NvsEachStage.qc,
sizes = 10,
clusters = sampleLables
)
reads_NvsEachStage.qc <- normalize(reads_NvsEachStage.qc)
plotPCA(
reads_NvsEachStage.qc,
exprs_values = "logcounts",
colour_by = "Class",
size_by = "total_features"
)
plotTSNE(
reads_NvsEachStage.qc,
exprs_values = "logcounts",
perplexity = 10,
colour_by = "Class",
size_by = "total_features",
rand_seed = 123456
)
plotRLE(
reads_NvsEachStage.qc,
exprs_mats = list(Raw = "counts", scran = "logcounts"),
exprs_logged = c(TRUE, TRUE),
colour_by = "Class"
)



Defines a S4 class for storing data from single-cell experiments. This includes specialized methods to store and retrieve spike-in information, dimensionality reduction coordinates and size factors for each cell, along with the usual metadata for genes and libraries.
A collection of tools for doing various analyses of single-cell RNA-seq gene expression data, with a focus on quality control.
Implements functions for low-level analyses of single-cell RNA-seq data. Methods are provided for normalization of cell-specific biases, assignment of cell cycle phase, detection of highly variable and significantly correlated genes, correction of batch effects, identification of marker genes, and other common tasks in single-cell analysis workflows.