2. Construction of expression matrix
-- Yang Eric Li
2.0 convert coordinates for bam file
In the last chapter, we use two strategy to align the reads to the genome and different RNA types. As we extract the sequence for each RNA type and mapped reads to them, thus, the coordinates for each reads stored in bam file is not the position of where they aligned to the genome, but the transcripts for each RNA types. We use RSEM ( https://github.com/deweylab/RSEM) to solve these problem.
# build bowtie2 index using RSEM
rsem-prepare-reference --gtf /BioII/lulab_b/shared/genomes/human_hg38/gtf/miRNA.gtf --bowtie2 /BioII/lulab_b/shared/genomes/human_hg38/sequence/GRCh38.p12.genome.fa miRNA.indexDir/
# align reads to miRNA
bowtie2 -p 4 --sensitive-local --no-unal --un NC_1.unAligned.fq -x miRNA.indexDir/ NC_1.rRNA_exon.unmapped.fastq -S NC_1.miRNA.sam
# convert the coordinates in bam files
rsem-tbam2gbam miRNA.indexDir/ NC_1.miRNA.sam NC_1.miRNA.rsem.bam2.1 Visualization: Genome Broswer
2.1.1 Build bedGraph/bigwig files
method 1: use BEDtools and UCSC Kent Utilities
# sort bam file
samtools sort NC_1.miRNA.rsem.bam > NC_1.miRNA.sorted.bam
# build bam index
samtools index NC_1.miRNA.sorted.bam
# create bedGraph
bedtools genomecov -ibam NC_1.miRNA.sorted.bam -bga -split -scale 1.0 | sort -k1,1 -k2,2n > NC_1.miRNA.sorted.bedGraph
# convert bedGraph to bigWig
bedGraphToBigWig NC_1.miRNA.sorted.bedGraph /BioII/lulab_b/shared/genomes/human_hg38/sequence/hg38.chrom.sizes NC_1.miRNA.sorted.bwNote http://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html
method 2: use homer
1.Make tag directories for each experiment
If the experiment is strand specific paired end sequencing, add "-sspe" to the end. If it's unstranded paired-end sequencing, no extra options are needed. makeTagDirectory tags_Dir/ inputfile.sam -format sam -sspe
2.Make bedGraph visualization files for each tag directory
2.1.2 Integrative Genomics Viewer(IGV)

in our case, try to load the bam and bigwig format file

2.1.3 UCSC genome browser

2.2 Quantify gene expression
2.2.1 Required files
alignment: bam/sam format
convert sam to bam
annotaion: gtf/gff/gff3 format
convert gff to gtf
2.2.2 count for raw counts
Tool 1: HTSeq
Given a file with aligned sequencing reads(.sam/.bam) and a list of genomic features(.gtf), a common task is to count how many reads map to each feature(gene).
Usage
in our case
3 overlap resolution models
*

We shared our snakemake package used for exRNA-seq expression matrix construction.Github.
Tips
--nonunique
--nonunique none (default): the read (or read pair) is counted as ambiguous and not counted for any features. Also, if the read (or read pair) aligns to more than one location in the reference, it is scored as alignment_not_unique.
--nonunique all: the read (or read pair) is counted as ambiguous and is also counted in all features to which it was assigned. Also, if the read (or read pair) aligns to more than one location in the reference, it is scored as alignment_not_unique and also separately for each location.
Notice that when using --nonunique all the sum of all counts will not be equal to the number of reads (or read pairs), because those with multiple alignments or overlaps get scored multiple times.
Notes
-m/--mode {mode}
--nonunique={none/all}
-s/--stranded {yes/no/reverse}.
-a {minaqual}.
-t/--type {feature type}. (defult: exon)
-i/--idattr {id attribute}, GFF attribute to be used as feature ID. (defult: gene_id)
Tool 2: featureCounts
Usage
Summarize a BAM format dataset:
Summarize multiple datasets at the same time:
in our case
Tips By default, featureCounts does not count reads overlapping with more than one feature. Users can use the -O option to instruct featureCounts to count such reads (they will be assigned to all their overlapping features or meta-features).
2.2.2 count for RPKM/FPKM/CPM
RPKM: Reads Per Kilobase of exon model per Million mapped reads (每千个碱基的转录每百万映射读取的reads)
FPKM: Fragments Per Kilobase of exon model per Million mapped fragments(每千个碱基的转录每百万映射读取的fragments, 对于Pair-end sequencing, two paired reads should be mapped simultaneously)
RPM/CPM: Reads/Counts of exon model per Million mapped reads (每百万映射读取的reads)
Tool: homer
Usage
0.Align FASTQ reads using STAR or similar 'splicing aware' genome alignment algorithm
1.Make tag directories for each experiment
If the experiment is strand specific paired end sequencing, add "-sspe" to the end. If it's unstranded paired-end sequencing, no extra options are needed. makeTagDirectory Exp1/ inputfile.sam -format sam -sspe
2. Quantify gene expression across all experiments for clustering and reporting (-rpkm / -rpm / -log2 / -quantile / -sqrt):
3. Quantify gene expression as integer counts for differential expression (-noadj)
in our case
Tips: in default, homer do not use gff format file but gtf format. look into the difference between:
/BioII/lulab_b/shared/genomes/human_hg38/gtf/miRNA.gtf /BioII/lulab_b/shared/genomes/human_hg38/gtf/miRNA.gff
2.2.3 count for TPM
TPM:Transcripts Per Kilobase of exon model per Million mapped reads (每千个碱基的转录每百万映射读取的Transcripts)
Merge expression matrix of different RNA types together, then convert RPKM to TPM.
R script:
2.2.4 Merge to expression matrix
use linux command (cut, awk, sed, paste...) generate expression matrix
use homer:
2.3 Cleaning the expression matrix
load expression matrix
check the library size
check total detected genes
filter genes
Purpose Filtering low-expression genes improved DEG detection sensitivity.
Methods
Filter the genes with a total counts < = threshold;
Filter genes with maximum normalized counts across all samples < = threshold;
Filter genes with mean normalized counts across all samples < = threshold.
Criteria Retain genes: (Reads >= 1 counts) >= 20% samples.
check correlations between samples

Scripts summary
build bedGraph/bigWig for visualization
count for expression value
generate expression matrix
cleaning the expression matrix
Homeworks
scripts and files:
https://github.com/lulab/training/tree/master/proj_exRNA/example_small/constructExpMx
cnode: /BioII/lulab_b/shared/shared_scripts/bioinfoTraining/constructExpMx
Level I:
1. learn how to construct the expression matrix.
2. compare the difference between HTSeq, featureCounts and homer
3. check the bam/bigWig file using IGV
4. QC and cleaning the expression matrix
Level II:
reference scripts: cnode: /home/younglee/projects/hcc_example/bin
1. convert bam/sam files coordinates to genome coordinates (for sequential mapping)
2. construct expression matrix for all RNA types (miRNA, piRNA, mRNA, lncRNA...)
3. QC and cleaning the expression matrix
Level III:
quantify miRNA using bam files from common (map to hg38 genome directly) and sequential mapping (map to multiple RNA types), respectively.
Compare the differences and explain why. (using RSEM: https://github.com/deweylab/RSEM, please check rsem-prepare-reference and rsem-tbam2gbam functions)
Last updated