1.preprocessing_mapping_QC

1. Preprocessing, Mapping and QC

How to process RNA-seq data.

0) Understand your data

Before processing, make sure you have a correct understanding of what data you are analysing.

Basic information

  • Sequencing date

  • Contributor: department or labratory

  • Accession: GEO/SRA/PRJNA numbers in ncbi or other dataset ID(clear and brief)

  • Data source: download path or home-made

sample information

  • Organism

  • Disease state: cancer type or healthy

  • Tissue or other source (cell)

  • Molecular: DNA or RNA, total RNA , polyA RNA, or small RNA ?

Sequencing strategy

  • Sequencing platform

  • Library layout: Single-end V.S. Paired-end ?

  • Insert length: 50/100/150 ?

  • Strand specific ?

  • Size selection ?

  • Enrichment: Poly-A enriched or total RNA(ribosomal RNA removed) ?

  • Cellular localization: whole cell or intranuclear of cytoplasmic ?

When analyses published datasets, you can get all these information from ncbi website.

Our example HCC exRNA-seq data

Basic information

  • Sequencing date: 2017.11-

  • Contributor: Lulab Tsinghua University

  • Accession: NA

  • Data source: home-made

sample information

  • Organism: Homo sapiens

  • Disease state: hepatocellular carcinoma(HCC)

  • Tissue or other source: plasma

  • Molecular: total RNA(fragment)

Sequencing strategy

  • Sequencing platform: Illunima HiSeq X

  • Library layout: Paired-end 150

  • Insert length(RNA length): ~30 nt

  • Strand specific: Yes

  • Size selection: none

  • Enrichment: total RNA without ribosomal RNA removed

  • Cellular localization: whole extracellular circulating RNA

Flow chart of exRNA-seq processing method

Input/output of each procedure

Step

Input

Tool/script

Output

Note

1.Preprocessing

00.rawdata/*.fastq

-

02.rRNA/*/*.no_rRNA.fastq

~/projects/exRNA/ hcc_examples/

1.1 index

~/genome/human_hg38/gtf/*.gtf

-

~/genome/human_hg38/index/bowtie2_index/*.bt2

-

1.1.1 .gtf to .fa

~/genome/human_hg38/sequence/GRCh38.p12.genome.fa ~/genome/human_hg38/gtf/*.gtf

bedtools

*.fa

-

1.1.2 .fa to .bt2

*.fa

bowtie2-build

*.bt2

-

1.2 fastqc

00.rawdata/*.fastq

fastqc

00.rawdata/*_fastqc.html

check raw reads' quality

1.3 Remove 3' adapter and trim reads

00.rawdata/*.fastq

cutadapt

01.cutAdapter/sampleID/sampleID.trimmed.cutAdapt3.fastq

trim low quality ends (plus a hard cutoff: 50nt)

1.4 2nd fastqc

01.cutAdapter/sampleID/sampleID.trimmed.cutAdapt3.fastq

fastqc

01.cutAdapter/sampleID/sampleID.trimmed.cutAdapt3_fastqc.html

make sure the low quality reads have been removed and/or trimmed

1.5 Remove ribosomal RNA

01.cutAdapter/sampleID/sampleID.trimmed.cutAdapt3.fastq + index file

bowtie2

02.rRNA/sampleID/sampleID.rRNA.sam 02.rRNA/sampleID/sampleID.no_rRNA.fastq

-

2.Mapping

02.rRNA/sampleID/sampleID.no_rRNA.fastq + index files

bowtie2

03.mapping/sampleID/01.miRNA/sampleID.miRNA.sam 03.mapping/sampleID/01.miRNA/sampleID.no_miRNA.fastq -> 03.mapping/sampleID/02.piRNA/sampleID.piRNA.sam 03.mapping/sampleID/02.piRNA/sampleID.no_piRNA.fastq -> ...

map to different RNA annotations step by step

1) Preprocessing

Quality control of raw reads, and extract the clean RNA sequence.

0. File format

0.1 Annotation file format (.gtf or .gff)

The Gene transfer format (GTF) is a tab-delimited file format used to hold information about gene annotation.

Example

chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
  • You can get Gencode27 annotation files at: /BioII/lulab_b/shared/genomes/human_hg38/gtf/

  • Annotation file(.gtf/.gff) download source or prossed method, please refer to README file under this directory.

0.2 FASTQ Format (.fastq or .fq)

FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity.

Example

@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+
;;3;;;;;;;;;;;;7;;;;;;;88

Example file location

/BioII/lulab_b/shared/projects/exRNA/hcc_examples/

Present working directory:

your own project directory: ~/projects/exRNA/hcc_examples/

Create symbolic links:

mkdir 00.rawdata
cd 00.rawdata
ln -s /BioII/lulab_b/shared/projects/exRNA/hcc_examples/fastq/* .

Create a file under ~/projects/exRNA/hcc_examples/ to record name of these files, named as "sample_name", like:

AfterSurgery_1
AfterSurgery_2
AfterSurgery_3
BeforeSurgery_1
BeforeSurgery_2
BeforeSurgery_3
NC_1
NC_2
NC_3

1.1 build index

1.1.1 Link genomic annotation file to your home directory

Present working directory: your own home.

mkdir -p genome/human_hg38/index/bowtie2_index
mkdir genome/human_hg38/gtf
mkdir genome/human_hg38/sequence
ln -s /BioII/lulab_b/shared/genomes/human_hg38/gtf/* ./genome/human_hg38/gtf/
ln -s /BioII/lulab_b/shared/genomes/human_hg38/sequence/GRCh38.p12.genome.fa ./genome/human_hg38/sequence/
ln -s /BioII/lulab_b/shared/genomes/human_hg38/index/bowtie2_index/bowtie2_hg38_index/ genome/human_hg38/index/bowtie2_index/

1.1.2 Get RNA sequence(.fa) from annotation file(.gtf)

Use rRNA as example.

  • Tool: bedtools - getfasta

  • Annotation file: rRNA_exon.gtf

Present working directory: ~/genome/human_hg38/index/bowtie2_index/

mkdir 00.bowtie2_rRNA_index
cd 00.bowtie2_rRNA_index/
bedtools getfasta -fi ../../../sequence/GRCh38.p12.genome.fa -bed ../../../gtf/rRNA_exon.gff -fo rRNA.fa

1.1.3 Build bowtie2 index

Use rRNA as example.

Present working directory: ~/genome/human_hg38/index/bowtie2_index/00.bowtie2_rRNA_index

bowtie2-build rRNA.fa rRNA

Output files:

rRNA.1.bt2
rRNA.2.bt2
rRNA.3.bt2
rRNA.4.bt2
rRNA.rev.1.bt2
rRNA.rev.2.bt2

NOTE: Use exon sequences to build index.

1.2 fastqc

Use tool fastqc to test the quality of raw reads.

mkdir ../fastqc_output
fastqc -q -o ../fastqc_output AfterSurgery_1.fastq
fefe

Processing multiple files.

Write the following command to "fastqc.sh"

#!/bin/bash
for i in `cat ../sample_name`; do fastqc ${i}.fastq; done

Run "fastqc.sh"

bash fastqc.sh &

Check the fastqc results

  • Confirm the fastqc results files *_fastqc.html and *_fastqc.zip;

  • Copy *_fastqc.html to your computer and open in a web browser;

  • Check the quality of raw reads;

  • Filter the low quality samples, which median of "per base sequence quality" less than 28(median line under green area).

fastqc results.

1.3 Remove adapters and trim reads

The raw reads always with adapters, which should be removed to get clean reads.

Use cutadapt toolkit to remove adapters and trim some low quality reads. _Present working directory: ~/projects/exRNA/hcc_examples/

mkdir -p 01.cutAdapter/NC_1
cutadapt -u -100 -q 30,30 --trim-n -m 15 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC --too-short-output=./01.cutAdapter/NC_1/NC_1.trimmed.cutAdapt3.TooShort.fastq -o ./01.cutAdapter/NC_1/NC_1.trimmed.cutAdapt3.fastq ./00.rawdata/NC_1/NC_1.fastq >./01.cutAdapter/NC_1/NC_1.cutAdapt3.log

Above is for one sample, you can use bash script to process multiple samples.

You should check the cutadapt log file (sample.cutAdapt3.log) to get information about adapter removed reads. cutadapt log file.

Get more information about cutadapt toolkit, click here: [cutadapt.] (http://cutadapt.readthedocs.io/en/stable/index.html)

1.4 2nd fastQC

A second FastQC run is performed to ensure that the previous quality trimming and/or adapter removal steps successfully conserved high quality reads without being too stringent.

Run fastqc of the adapter removed reads

fastqc ./01.cutAdapter/NC_1/NC_1.trimmed.cutAdapt3.fastq

Differences in results with 1st fastQC:

  • The per-base quality scores should be different.

  • The Sequencing adapters are no longer identified as over-represented.

1.5 Remove ribosomal RNA mapped reads

Why remove ribosomal RNA ?

  • Ribosomal RNA(rRNA) hold more than 80-90% of total RNA.

  • Large proportions of rRNA will have an effect on the usable number of effective reads obtained from the samples.

  • rRNA over-expressed samples should be filtered.

Toolkit: Bowtie2 Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning short reads.

Mapping reads to rRNA

Present working directory: ~/projects/exRNA/hcc_examples

mkdir -p 02.rRNA/NC_1
bowtie2 --sensitive-local --norc --no-unal -x ~/genome/human_hg38/index/bowtie2_index/00.bowtie2_rRNA_index/rRNA -S 02.rRNA/NC_1/NC_1.rRNA.sam --un ./02.rRNA/NC_1/NC_1.no_rRNA.fastq ./01.cutAdapter/NC_1/NC_1.trimmed.cutAdapt3.fastq > 02.rRNA/NC_1/NC_1.rRNA.log

If want get a .bam file output, use samtools:

samtools view -bS in.sam >out.bam

Check the log file (NC_1.rRNA.log) to calculate the rRNA mapping ratio. bowtie2 log file. Above is for one sample, you can use bash script to process multiple samples.

2) Mapping

Mapping reads to human different RNA types.

  • Input: 02.rRNA/*/*.no_rRNA.fastq

  • Tool: bowtie2.

  • Working directory:~/projects/exRNA/hcc_examples/

mkdir -p 03.mapping/NC_1
  • Ordered mapping to avoid multiple assign reads to different RNA types:

  • 01.miRNA

  • 02.piRNA

  • 03.Y RNA

  • 04.snRNA

  • 05.srpRNA

  • 06.tRNA

  • 07.lncRNA

  • 08.mRNA

  • 09.other human genome

  • 10.non-human genome

  • Use the unmapped .fastq file from last step as input to the ordered mapping.

  • Methods: you can follow "1.4 mapping reads to rRNA" above.

01. Use rRNA removed reads as input to mapping miRNA

mkdir 03.mapping/NC_1/01.miRNA
bowtie2 --sensitive-local -x ~/genome/human_hg38/index/bowtie2_index/01.bowtie2_miRNA_index/miRNA -S 03.mapping/NC_1/01.miRNA/NC_1.miRNA.sam --un ./03.mapping/NC_1/01.miRNA/NC_1.no_miRNA.fastq ./02.rRNA/NC_1/NC_1.no_rRNA.fastq > 03.mapping/NC_1/01.miRNA/NC_1.miRNA.log

02. Use miRNA removed reads as input to mapping piRNA

mkdir 03.mapping/NC_1/02.piRNA
bowtie2 --sensitive-local -x ~/genome/human_hg38/index/bowtie2_index/02.bowtie2_piRNA_index/piRNA -S 03.mapping/NC_1/02.piRNA/NC_1.piRNA.sam --un ./03.mapping/NC_1/02.piRNA/NC_1.no_miRNA.fastq ./02.rRNA/NC_1/02.piRNA/NC_1.no_rRNA.fastq > 03.mapping/NC_1/02.piRNA/NC_1.miRNA.log

03.Y_RNA, 04.snRNA, 05.srpRNA, 06.tRNA, 07.lncRNA, 08.mRNA, 09.hg38

  • Follow the methods above.

  • Reads can't mapped to these 8 types RNA would be assigned as "other human genome".

  • Reads can't mapped to human genome would be assigned as "non-human".

3) QC (Quality Control)

When processing the real data from some experiments, there are many things can affect the RNA-seq results. So it is important to take all possibilities into consideration, and filter the low quality sample in downstream analyses.

After preprocessing and mapping, you can do some basic statistics of the prossesed datasets.

3.1 Data summary

3.1.1 Summarize the pre-processing and mapping results before downstream analyses.

The summary should contain the following part at least :

Sample_info

  • sample_id

  • sequenced date or batch number

  • Raw reads number (raw fastq file or cutadapt log file)

Preprocessing

  • Clean reads number and percentage of raw reads (cutadapt log file)

  • rRNA number and percentage of clean reads (rRNA mapping log file of bowtie2)

  • kept reads (rRNA mapping log file of bowtie2)

Mapping ratio

  • human genome (human genome mapping log file of bowtie2)

  • miRNA

  • piRNA

  • Y RNA

  • snRNA

  • srpRNA

  • tRNA

  • lncRNA

  • mRNA

  • Other human genome region (difference between whole genome and mapped RNA)

  • Non-human genome (human genome mapping log file of bowtie2)

Reference.! You can combine some bash command or use your own code to summarize the data processing information. Here we shared a simple bash script for your reference. Github link.

3.1.2 Visualize the reads length distribution by scatter plot/bar chart.

Reference.

3.1.3 Check whether the mapping ratio are consistent with priori knowledge or published reports. If not, there might have some bias in your sample, which should be filtered.

You can combine some bash command or use your own code to summarise reads length. Here we shared a simple bash script for your reference.

3.2 Reads length distribution

3.2.1 Calculate the reads length :

  • raw reads (raw fastq file).

  • clean reads (rRNA removed and low-quality trimmed reads).

  • mapped reads of miRNA, piRNA, Y RNA, snRNA, srpRNA, tRNA, lncRNA, mRNA (mapped result sam file).

    NOTE: The sam/bam files record mapping information of all reads, include mapped and unmapped. Take care of the "flag" in 2nd column of sam/bam file. file format.

You can combine some bash command or use your own code to summarize reads length. Here we shared a simple bash script for your reference. Github link.

3.2.2 Visualize the reads length distribution by scatter plot/bar chart.

Reference.

3.2.3 Check whether the reads length distribution are consistent with priori knowledge or published reports. If not, there might have some bias in your sample, which should be filtered.

3.3 Quality control criteria:

You can define a set of criteria to QC and filter your data. Here we shared our QC criteria for your reference.

Check point

Threshold

Input

Raw reads quality

reads quality >28 (median lines in green area)

Check fastqc results(*.html)

Clean reads number

> 10 million

cutadapt log file

rRNA ratio

< 10%

bowtie2 rRNA mapping log file

human genome mapped ratio

> 65%

bowtie2 human genome mapping log file

other genome region ratio

< 10%

difference between whole genome and mapped RNA

4) More practise and reference

  • Snakemake is a good workflow management system to create reproducible and scalable data analyses. Snakemake.

  • There is a Snakemake example to identify circRNA,location: /BioII/lulab_b/shared/shared_scripts/Snakemake/examples.Identify_circRNA/

  • We shared our snakemake package used to exRNA-seq basic analyses.Github link.

    *

    sharing programming tips

  • terminal: grey background, Monaco 18 (aequilate font, suitable for programming);

  • take care of file path and name when using shared scripts.

  • Slides.

    5) Homework

Level 1

  • Learn more about Illunima RNA-seq types, library preparation and analysis methods.

  • hcc_example samples:

  • Follow this tutorial step by step, and check your results by comparing with Siqi's results(location:wangsiqi/project/hcc_example)

  • hcc_lulab samples, summarize 61 samples(shared in github) and show your team's results:

  • Data summary table (ref: 3.1.1).

  • Mapping ratio plot (ref: 3.1.2).

  • Reads length distribution plot (ref: 3.2.2).

  • Filter low quality samples (ref: 3.3).

  • (Advanced) More plots to show you results. Upload your scripts if using R package.

  • (Advanced) Remove 5` adapter from sample_R2.fastq file, compare with R1 clean reads. Think the potentaial reason if they are inconsistent.

  • (Advanced) comparing gft generated fasta file and miRbase 22 fasta file.download

grep hsa ~/Downloads/hairpin.fa |wc -l
1917
grep -c ">" RNAs_fasta/miRNA.fa
1869

Level 2:

  • Write and upload a bash script to complete above analysis.

Level 3:

  • Learn snakemake usage and create your own snakefile.