One of the main challenges when analyzing scRNA-seq data is the presence of zeros, or dropouts. The dropouts are assumed to have arisen for three possible reasons:
The gene was not expressed in the sample and hence there are no transcripts to sequence
The gene was expressed, but for some reason the transcripts were lost somewhere prior to sequencing
The gene was expressed and transcripts were captured and turned into cDNA, but the sequencing depth was not sufficient to produce any reads.
In this chapter, we would like to use R package "scImpute" to preform imputation on expression matrix (check paper).
first, we attached the R library and load the expression matrix after QC (saved in last chapter).
In this chapter, we only analyze the samples with batch information.
then, we normalized the matrix using package scran
library(scran)# Cluster similar cells based on rank correlations in their gene expression profilesqclust <-quickCluster(reads_NvsEachStage.qc.impute, min.size =10)# Methods to normalize single-cell RNA-seq data by deconvolving size factors from cell pools.reads_NvsEachStage.qc.impute <-computeSumFactors( reads_NvsEachStage.qc.impute, sizes =10, clusters = qclust)reads_NvsEachStage.qc.impute <-normalize(reads_NvsEachStage.qc.impute)
There is a large number of potential confounders, artifacts and biases in exRNA-seq data. One of the main challenges in analyzing exRNA-seq data stems from the fact that it is difficult to carry out a true technical replicate to distinguish biological and technical variability.
Correlations with PCs
Let’s first look again at the PCA plot of the QCed dataset:
scater allows one to identify principal components that correlate with experimental and QC variables of interest (it ranks principle components by R^2 from a linear model regressing PC value against the variable of interest).
Let’s test whether some of the variables correlate with any of the PCs.
Detected genes
plotQC( reads_NvsEachStage.qc.impute, type ="find-pcs", exprs_values ="logcounts_raw", variable ="total_features")
Indeed, we can see that PC1 can be almost completely explained by the number of detected genes. In fact, it was also visible on the PCA plot above. This is a well-known issue in scRNA-seq and was described here.
Explanatory variables
scater can also compute the marginal R^2 for each variable when fitting a linear model regressing expression values for each gene against just that variable, and display a density plot of the gene-wise marginal R^2 values for the variables.
plotQC( reads_NvsEachStage.qc.impute, type ="expl", exprs_values ="logcounts", variables =c("total_features","total_counts","RNA_Isolation_batch","Class","library_preparation_day"))
we founded that the "RNA_Isolation_batch" have substantial explanatory power for many genes, so these variables are good candidates for conditioning out in a normalisation step.
5.3 Remove the confounders
Factors contributing to technical noise frequently appear as “batch effects” where samples processed on different days or by different technicians systematically vary from one another.
Remove Unwanted Variation (RUVSeq)
We will use the Remove Unwanted Variation (RUVSeq). Briefly, RUVSeq works as follows. For nn samples and JJ genes, consider the following generalized linear model (GLM), where the RNA-Seq read counts are regressed on both the known covariates of interest and unknown factors of unwanted variation:
RUVg uses negative control genes (e.g. ERCCs), assumed to have constant expression across samples;
RUVs uses centered (technical) replicate/negative control samples for which the covariates of interest are constant;
RUVr uses residuals, e.g., from a first-pass GLM regression of the counts on the covariates of interest.
first, we need to construct one matrix to describe the sample class
Evaluate and compare confounder removal strategies
We consider three different metrics which are all reasonable based on our knowledge of the experimental design. Depending on the biological question that you wish to address, it is important to choose a metric that allows you to evaluate the confounders that are likely to be the biggest concern for the given situation.
Method 1: tSNE/PCA plot
We evaluate the effectiveness of the normalization by inspecting the PCA plot where colour corresponds the technical replicates and shape corresponds to different biological samples. Separation of biological samples and interspersed batches indicates that technical variation has been removed.
We can also examine the effectiveness of correction using the relative log expression (RLE) across samples to confirm technical noise has been removed from the dataset. Note RLE only evaluates whether the number of genes higher and lower than average are equal for each samples - i.e. systemic technical effects. Random technical noise between batches may not be detected by RLE.
scImpute is developed to accurately and robustly impute the dropout values in scRNA-seq data. scImpute can be applied to raw read count matrix before the users perform downstream analyses.
This package implements the remove unwanted variation (RUV) methods of Risso et al. (2014) for the normalization of RNA-Seq read counts between samples.