Define control genes, usually should be ERCC spike-in. In our case, we use three most stably expressed RNA transcripts: miR-99a-5p, miR-30a-5p and miR-221-3p. See more details in paper.
Another measure of cell quality is the ratio between spike-in /control RNAs and endogenous RNAs. This ratio can be used to estimate the total amount of RNA in the samples. Samples with a high level of spike-in / control RNAs had low starting amounts of RNA, likely due to the RNA being degraded.
plotPhenoData(
reads_NvsEachStage,
aes_string(
x = "total_features",
y = "pct_counts_stableRNA",
colour = "Class"
)
)
filter out samples with too high spike-in / control RNA.
It is usually a good idea to exclude genes where we suspect that technical artefacts may have skewed the results. In our case, we consider the top 50 expressed genes.
plotQC(reads_NvsEachStage, type = "highest-expression")
Gene filtering
It is typically a good idea to remove genes whose expression level is considered “undetectable”.
dim(reads_NvsEachStage[rowData(reads_NvsEachStage)$use, colData(reads_NvsEachStage)$use])assay(reads_NvsEachStage, "logcounts_raw")<-log2(counts(reads_NvsEachStage)+1)reads_NvsEachStage.qc <- reads_NvsEachStage[rowData(reads_NvsEachStage)$use, colData(reads_NvsEachStage)$use]# save the datasaveRDS(reads_NvsEachStage.qc, file ="GSE71008.reads_NvsEachStage.clean.rds")
4.4 Visualization
The PCA plot
The easiest way to overview the data is by transforming it using the principal component analysis and then visualize the first two principal components.
First, we compare the PCA results before and after QC.
tSNE (t-Distributed Stochastic Neighbor Embedding) combines dimensionality reduction (e.g. PCA) with random walks on the nearest-neighbour network to map high dimensional data to a 2-dimensional space while preserving local distances between samples. In contrast with PCA, tSNE is a stochastic algorithm which means running the method multiple times on the same dataset will result in different plots. Due to the non-linear and stochastic nature of the algorithm, tSNE is more difficult to intuitively interpret tSNE. To ensure reproducibility, we fix the “seed” of the random-number generator in the code below so that we always get the same plot.
Furthermore, tSNE requires you to provide a value of perplexity which reflects the number of neighbours used to build the nearest-neighbour network; a high value creates a dense network which clumps samples together while a low value makes the network more sparse allowing groups of samples to separate from each other. scater uses a default perplexity of the total number of cells divided by five (rounded down).
scaterallows us to normalize raw counts using function normaliseExprs()
To compare the efficiency of different normalization methods we will use visual inspection of PCA plots and calculation of cell-wise relative log expression via scater’s plotRLE() function. Namely, cells with many (few) reads have higher (lower) than median expression for most genes resulting in a positive (negative) RLE across the cell, whereas normalized cells have an RLE close to zero. Example of a RLE function in R:
The simplest way to normalize this data is to convert it to counts per million (CPM) by dividing each column by its total then multiplying by 1,000,000.
Another method is called TMM is the weighted trimmed mean of M-values (to the reference) proposed by edgeR. The M-values in question are the gene-wise log2 fold changes between individual samples.
scran package implements a variant on CPM specialized for single-cell data. Briefly this method deals with the problem of vary large numbers of zero values per cell by pooling cells together calculating a normalization factor (similar to CPM) for the sum of each pool. Since each cell is found in many different pools, cell-specific factors can be deconvoluted from the collection of pool-specific factors using linear algebra.
Defines a S4 class for storing data from single-cell experiments. This includes specialized methods to store and retrieve spike-in information, dimensionality reduction coordinates and size factors for each cell, along with the usual metadata for genes and libraries.
Implements functions for low-level analyses of single-cell RNA-seq data. Methods are provided for normalization of cell-specific biases, assignment of cell cycle phase, detection of highly variable and significantly correlated genes, correction of batch effects, identification of marker genes, and other common tasks in single-cell analysis workflows.