One of the most common types of analyses when working with bulk RNA-seq data is to identify differentially expressed genes. By comparing the genes that change between two conditions. For example in our case, the two conditions should be the normal and cancer patients. Several different methods, e.g. DESeq2 and edgeR, have been developed for RNA-seq.
During the differential expression analysis, we exploring the similarities and dissimilarities between samples and visualizing differential expressed genes.
3.1 Normalization
3.1.1 widly used quantifies
raw counts
RPKM: Reads Per Kilobase of exon model per Million mapped reads (每千个碱基的转录每百万映射读取的reads)
FPKM: Fragments Per Kilobase of exon model per Million mapped fragments(每千个碱基的转录每百万映射读取的fragments, 对于Pair-end sequencing, two paired reads should be mapped simultaneously)
TPM:Transcripts Per Kilobase of exon model per Million mapped reads (每千个碱基的转录每百万映射读取的Transcripts)
TPMi=(Ni/Li)*1000000/sum(Ni/Li+……..+ Nm/Lm)
RPM/CPM: Reads/Counts of exon model per Million mapped reads (每百万映射读取的reads)
RPM=total exon reads / mapped reads (Millions)
3.1.2 Relative vs absolute expression
As shown above, we have been estimating the relative abundance, i.e. what proportion of transcripts in a sample belong to a particular isoform. Can we estimate the absolute abundance from RNA-Seq data?
Consider the two cells drawn below. The colored squiggly lines represent individual transcripts of the corresponding isoforms.
Cell B has twice as many transcripts of each isoform as cell A. If we conduct RNA-Seq experiments in the two cells, we would get samples from essentially the same distribution. We wouldn’t be able to tell the cells apart based on their RNA-Seq.
Here’s a trick: during library preparation, we add a known amount of an artificial RNA or DNA that is not produced by the studied organism (the blue squiggle below), then we can compare all abundances against it. This artificially introduced material is called a spike-in.
If we regard the spike-in as isoform 0, with the known absolute abundance of T0 transcripts, then the absolute abundance of isoform i can be estimated as:
Ti = T0 * (RPMi / RPM0)
Whether we care about absolute or relative expression depends on the biological question in hand. However, looking at relative expression alone can produce unexpected results.
Suppose again that only two isoforms are being expressed, red and yellow. In condition A, the two isoforms are equally expressed. In condition B, the yellow isoform’s expression doubles, while the red isoform’s expression is not affected at all.
Now let’s look at the relative expression:
Based on this chart, we might conclude that the red isoform is also differentially expressed between the two conditions. Technically this is true, as long as we are talking about relative expression, but this is only a consequence of the overexpression of the yellow isoform.
Notes
Selecting Between-Sample RNA-Seq Normalization Methods from the Perspective of their Assumptions
Marie-Agnès Dillies, et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefings in Bioinformatics, Volume 14, Issue 6, 1 November 2013, Pages 671–683 link
Catalina A Vallejos, et al. Normalizing single-cell RNA sequencing data: challenges and opportunities. Nature Methods volume 14, pages 565–571 link
3.2 Models of RNA-seq data
3.2.1 negative binomial model
The most common model of RNA-seq data is the negative binomial model:
set.seed(1)
hist(
rnbinom(
1000,
mu = 10,
size = 100),
col = "grey50",
xlab = "Read Counts",
main = "Negative Binomial"
)
Mean: μ = mu
Variance: σ2 = mu + mu2/size
It is parameterized by the mean expression (mu) and the dispersion (size), which is inversely related to the variance.
3.2.2 possion models
In possion distribution, which presumes the variance and mean [ ie. expression in our case] are equal.
hist(
rpois(1000, 10),
col = "grey50",
xlab = "Read Counts",
main = "Possion"
)
Mean: μ = mu
Variance: σ2 = mu
3.2.3 zero-inflated negative binomial models
A raw negative binomial model does not fit full-length transcript data as well due to the high dropout rates relative to the non-zero read counts.
d <- 0.5;
counts <- rnbinom(
1000,
mu = 10,
size = 100
)
counts[runif(1000) < d] <- 0
hist(
counts,
col = "grey50",
xlab = "Read Counts",
main = "Zero-inflated NB"
)
3.3 Commonly used test and tools
3.3.0 requirements
To preform differential expression analysis, we usually need two files:
file 1: expression matrix
raw counts, rpkm, rpm for each gene and samples
file 2: experimental design
the experimental design or conditions for each samples
DESeq and EdgeR are very similar and both assume that no genes are differentially expressed. DEseq uses a 'geometric' normalisation strategy, whereas EdgeR is a log-based method. Both normalise data initially via the calculation of size and normalisation factors, respectively.
3.3.1 DESeq2
The package DESeq2 provides methods to test for differential expression by use of negative binomial generalized linear models Usage
Calculate differential expression (requires R/Bioconductor and the package edgeR to be installed!!)
# Name the samples in the same order as they were used in the analyzeRepeats.pl command.
# Replicates should have the same name (i.e. cond1)
getDiffExpression.pl raw.txt cond1 cond1 cond2 cond2 > diffExp.output.txt
Based on the differential expression results, we use MA-plot to explore the log2 fold changes attributable to a given variable over the mean of normalized counts for all the samples
plotMA(deg, ylim=c(-5,5))
dev.off()
Plot normalized counts for differential expressed gene
it is useful to examine the counts of reads for a single gene across the groups.
Hierarchical clustering for differential expressed genes
Once we have normalized the data and perfromed the differential expression analysis, we can cluster the samples relevant to the biological questions. It is a hard problem to do the unsupervised clustering without prior knowledge. That is, we need to identify groups of samples based on the similarities of the transcriptomes.
we first filter out significantly differential expressed genes:
Though rpkm is not suitable for differential expressed gene identification, but it is useful for visualisation. get the rpkm value for these differential expressed genes
plot to explore the differential expression results. Github.
3.5 Gene Ontology (GO) term enrichment
Gene Ontology (GO) term enrichment is a technique for interpreting sets of genes making use of the Gene Ontology system of classification, in which genes are assigned to a set of predefined bins depending on their functional characteristics.