3. Differential Expression Analysis & Exploring
-- Yang Eric Li
Aim for this section
One of the most common types of analyses when working with bulk RNA-seq data is to identify differentially expressed genes. By comparing the genes that change between two conditions. For example in our case, the two conditions should be the normal and cancer patients. Several different methods, e.g. DESeq2 and edgeR, have been developed for RNA-seq.
During the differential expression analysis, we exploring the similarities and dissimilarities between samples and visualizing differential expressed genes.
3.1 Normalization
3.1.1 widly used quantifies
raw counts
RPKM: Reads Per Kilobase of exon model per Million mapped reads (每千个碱基的转录每百万映射读取的reads)
FPKM: Fragments Per Kilobase of exon model per Million mapped fragments(每千个碱基的转录每百万映射读取的fragments, 对于Pair-end sequencing, two paired reads should be mapped simultaneously)
TPM:Transcripts Per Kilobase of exon model per Million mapped reads (每千个碱基的转录每百万映射读取的Transcripts)
RPM/CPM: Reads/Counts of exon model per Million mapped reads (每百万映射读取的reads)
3.1.2 Relative vs absolute expression
As shown above, we have been estimating the relative abundance, i.e. what proportion of transcripts in a sample belong to a particular isoform. Can we estimate the absolute abundance from RNA-Seq data?
Consider the two cells drawn below. The colored squiggly lines represent individual transcripts of the corresponding isoforms.
Cell B has twice as many transcripts of each isoform as cell A. If we conduct RNA-Seq experiments in the two cells, we would get samples from essentially the same distribution. We wouldn’t be able to tell the cells apart based on their RNA-Seq.
Here’s a trick: during library preparation, we add a known amount of an artificial RNA or DNA that is not produced by the studied organism (the blue squiggle below), then we can compare all abundances against it. This artificially introduced material is called a spike-in.
If we regard the spike-in as isoform 0, with the known absolute abundance of T0 transcripts, then the absolute abundance of isoform i can be estimated as:
Ti = T0 * (RPMi / RPM0)
Whether we care about absolute or relative expression depends on the biological question in hand. However, looking at relative expression alone can produce unexpected results.
Suppose again that only two isoforms are being expressed, red and yellow. In condition A, the two isoforms are equally expressed. In condition B, the yellow isoform’s expression doubles, while the red isoform’s expression is not affected at all.
Now let’s look at the relative expression:
Based on this chart, we might conclude that the red isoform is also differentially expressed between the two conditions. Technically this is true, as long as we are talking about relative expression, but this is only a consequence of the overexpression of the yellow isoform.
Notes
Selecting Between-Sample RNA-Seq Normalization Methods from the Perspective of their Assumptions
Marie-Agnès Dillies, et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefings in Bioinformatics, Volume 14, Issue 6, 1 November 2013, Pages 671–683 link
Catalina A Vallejos, et al. Normalizing single-cell RNA sequencing data: challenges and opportunities. Nature Methods volume 14, pages 565–571 link
3.2 Models of RNA-seq data
3.2.1 negative binomial model
The most common model of RNA-seq data is the negative binomial model:
Mean: μ = mu
Variance: σ2 = mu + mu2/size
It is parameterized by the mean expression (mu) and the dispersion (size), which is inversely related to the variance.
3.2.2 possion models
In possion distribution, which presumes the variance and mean [ ie. expression in our case] are equal.
Mean: μ = mu
Variance: σ2 = mu
3.2.3 zero-inflated negative binomial models
A raw negative binomial model does not fit full-length transcript data as well due to the high dropout rates relative to the non-zero read counts.
3.3 Commonly used test and tools
3.3.0 requirements
To preform differential expression analysis, we usually need two files:
file 1: expression matrix
raw counts, rpkm, rpm for each gene and samples
file 2: experimental design
the experimental design or conditions for each samples
the expression matrix looks like:
the experimental design looks like:
DESeq and EdgeR are very similar and both assume that no genes are differentially expressed. DEseq uses a 'geometric' normalisation strategy, whereas EdgeR is a log-based method. Both normalise data initially via the calculation of size and normalisation factors, respectively.
3.3.1 DESeq2
The package DESeq2 provides methods to test for differential expression by use of negative binomial generalized linear models Usage
Tips Analyzing RNA-seq data with DESeq2
Notes
We shared scripts on github.
DESeq2 normalization: R package DESeq2.Github.
3.3.2 edgeR
Usage
Tips edgeR Users Guide
Notes
We shared scripts on github.
TMM normalization: R package edgeR.Github.
3.3.3 using homer
Calculate differential expression (requires R/Bioconductor and the package edgeR to be installed!!) # Name the samples in the same order as they were used in the analyzeRepeats.pl command. # Replicates should have the same name (i.e. cond1) getDiffExpression.pl raw.txt cond1 cond1 cond2 cond2 > diffExp.output.txt
3.3.4 Wilcox/Mann-Whitney-U Test
Usage 1. normalize the reads by library size (edgeR) 2. identify differential expressed gene using wilcoxon.test()
We shared scripts on github.
RPM normalization: R package edgeR to calculate RPM, then test by R function wilcoxon.test().Github.
3.4 Exploring the results
Taking results from DESeq2 as an example, we exploring the similarities and dissimilarities between samples and differential expressed genes
Heatmap for count matrix
we use heatmap to explore the normalized count matrix
PCA analysis
PCA is useful for visualizing the overall effect of experimental covariates and batch effects.
check distance between samples
we apply the dist function to the transpose of the transformed count matrix to get sample-to-sample distances.
A heatmap of this distance matrix gives us an overview over similarities and dissimilarities between samples.
MA-plot
Based on the differential expression results, we use MA-plot to explore the log2 fold changes attributable to a given variable over the mean of normalized counts for all the samples
Plot normalized counts for differential expressed gene
it is useful to examine the counts of reads for a single gene across the groups.
Hierarchical clustering for differential expressed genes
Once we have normalized the data and perfromed the differential expression analysis, we can cluster the samples relevant to the biological questions. It is a hard problem to do the unsupervised clustering without prior knowledge. That is, we need to identify groups of samples based on the similarities of the transcriptomes.
we first filter out significantly differential expressed genes:
Though rpkm is not suitable for differential expressed gene identification, but it is useful for visualisation. get the rpkm value for these differential expressed genes
Notes
We shared scripts on github.
plot to explore the differential expression results. Github.
3.5 Gene Ontology (GO) term enrichment
Gene Ontology (GO) term enrichment is a technique for interpreting sets of genes making use of the Gene Ontology system of classification, in which genes are assigned to a set of predefined bins depending on their functional characteristics.
GO_Molecular_Function
Cellular_Component
Biological_Process
KEGG pathway
Differerntial expression analysis for mRNAs:
filter out highly expressed gene in HCC samples and get geneIDs:
Do GO term enrichment analysis using R scripts:
R script for GO term enrichment analysis:
Notes
We shared scripts on github.
3.6 Homeworks
Level I:
why we usually consider the relative expression rather than absolute expression. In what case, should we consider the absolute expression value.
learn how to calculate differential expression using edgeR, DESeq2 and Wilcox/Mann-Whitney-U Test.
draw venn plot to show the difference between three methods.
Level II:
identify differential expressed genes for other RNA types.
between differential conditions: 1) NC vs. HCC (Before and After); 2) Before vs. After; 3) NC vs. Before vs. After
filter out differential expressed genes between three conditions (NC, Before and After) and try to cluster samples.
Last updated