1.preprocessing_mapping_QC
1. Preprocessing, Mapping and QC
How to process RNA-seq data.
0) Understand your data
Before processing, make sure you have a correct understanding of what data you are analysing.
Basic information
Sequencing date
Contributor: department or labratory
Accession: GEO/SRA/PRJNA numbers in ncbi or other dataset ID(clear and brief)
Data source: download path or home-made
sample information
Organism
Disease state: cancer type or healthy
Tissue or other source (cell)
Molecular: DNA or RNA, total RNA , polyA RNA, or small RNA ?
Sequencing strategy
Sequencing platform
Library layout: Single-end V.S. Paired-end ?
Insert length: 50/100/150 ?
Strand specific ?
Size selection ?
Enrichment: Poly-A enriched or total RNA(ribosomal RNA removed) ?
Cellular localization: whole cell or intranuclear of cytoplasmic ?
When analyses published datasets, you can get all these information from ncbi website.
Our example HCC exRNA-seq data
Basic information
Sequencing date: 2017.11-
Contributor: Lulab Tsinghua University
Accession: NA
Data source: home-made
sample information
Organism: Homo sapiens
Disease state: hepatocellular carcinoma(HCC)
Tissue or other source: plasma
Molecular: total RNA(fragment)
Sequencing strategy
Sequencing platform: Illunima HiSeq X
Library layout: Paired-end 150
Insert length(RNA length): ~30 nt
Strand specific: Yes
Size selection: none
Enrichment: total RNA without ribosomal RNA removed
Cellular localization: whole extracellular circulating RNA
Flow chart of exRNA-seq processing method
Input/output of each procedure
1) Preprocessing
Quality control of raw reads, and extract the clean RNA sequence.
0. File format
0.1 Annotation file format (.gtf or .gff)
The Gene transfer format (GTF) is a tab-delimited file format used to hold information about gene annotation.
Example
You can get Gencode27 annotation files at: /BioII/lulab_b/shared/genomes/human_hg38/gtf/
Annotation file(.gtf/.gff) download source or prossed method, please refer to README file under this directory.
0.2 FASTQ Format (.fastq or .fq)
FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity.
Example
Example file location
Present working directory:
Create symbolic links:
Create a file under ~/projects/exRNA/hcc_examples/ to record name of these files, named as "sample_name", like:
1.1 build index
1.1.1 Link genomic annotation file to your home directory
Present working directory: your own home.
1.1.2 Get RNA sequence(.fa) from annotation file(.gtf)
Use rRNA as example.
Tool: bedtools - getfasta
Annotation file: rRNA_exon.gtf
Present working directory: ~/genome/human_hg38/index/bowtie2_index/
1.1.3 Build bowtie2 index
Use rRNA as example.
Present working directory: ~/genome/human_hg38/index/bowtie2_index/00.bowtie2_rRNA_index
Output files:
NOTE: Use exon sequences to build index.
1.2 fastqc
Use tool fastqc to test the quality of raw reads.
Processing multiple files.
Write the following command to "fastqc.sh"
Run "fastqc.sh"
Check the fastqc results
Confirm the fastqc results files *_fastqc.html and *_fastqc.zip;
Copy *_fastqc.html to your computer and open in a web browser;
Check the quality of raw reads;
Filter the low quality samples, which median of "per base sequence quality" less than 28(median line under green area).
1.3 Remove adapters and trim reads
The raw reads always with adapters, which should be removed to get clean reads.
Use cutadapt toolkit to remove adapters and trim some low quality reads. _Present working directory: ~/projects/exRNA/hcc_examples/
Above is for one sample, you can use bash script to process multiple samples.
Get more information about cutadapt toolkit, click here: [cutadapt.] (http://cutadapt.readthedocs.io/en/stable/index.html)
1.4 2nd fastQC
A second FastQC run is performed to ensure that the previous quality trimming and/or adapter removal steps successfully conserved high quality reads without being too stringent.
Run fastqc of the adapter removed reads
Differences in results with 1st fastQC:
The per-base quality scores should be different.
The Sequencing adapters are no longer identified as over-represented.
1.5 Remove ribosomal RNA mapped reads
Why remove ribosomal RNA ?
Ribosomal RNA(rRNA) hold more than 80-90% of total RNA.
Large proportions of rRNA will have an effect on the usable number of effective reads obtained from the samples.
rRNA over-expressed samples should be filtered.
Toolkit: Bowtie2 Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning short reads.
Mapping reads to rRNA
Present working directory: ~/projects/exRNA/hcc_examples
If want get a .bam file output, use samtools:
2) Mapping
Mapping reads to human different RNA types.
Input: 02.rRNA/*/*.no_rRNA.fastq
Tool: bowtie2.
Working directory:~/projects/exRNA/hcc_examples/
Ordered mapping to avoid multiple assign reads to different RNA types:
01.miRNA
02.piRNA
03.Y RNA
04.snRNA
05.srpRNA
06.tRNA
07.lncRNA
08.mRNA
09.other human genome
10.non-human genome
Use the unmapped .fastq file from last step as input to the ordered mapping.
Methods: you can follow "1.4 mapping reads to rRNA" above.
01. Use rRNA removed reads as input to mapping miRNA
02. Use miRNA removed reads as input to mapping piRNA
03.Y_RNA, 04.snRNA, 05.srpRNA, 06.tRNA, 07.lncRNA, 08.mRNA, 09.hg38
Follow the methods above.
Reads can't mapped to these 8 types RNA would be assigned as "other human genome".
Reads can't mapped to human genome would be assigned as "non-human".
3) QC (Quality Control)
When processing the real data from some experiments, there are many things can affect the RNA-seq results. So it is important to take all possibilities into consideration, and filter the low quality sample in downstream analyses.
After preprocessing and mapping, you can do some basic statistics of the prossesed datasets.
3.1 Data summary
3.1.1 Summarize the pre-processing and mapping results before downstream analyses.
The summary should contain the following part at least :
Sample_info
sample_id
sequenced date or batch number
Raw reads number (raw fastq file or cutadapt log file)
Preprocessing
Clean reads number and percentage of raw reads (cutadapt log file)
rRNA number and percentage of clean reads (rRNA mapping log file of bowtie2)
kept reads (rRNA mapping log file of bowtie2)
Mapping ratio
human genome (human genome mapping log file of bowtie2)
miRNA
piRNA
Y RNA
snRNA
srpRNA
tRNA
lncRNA
mRNA
Other human genome region (difference between whole genome and mapped RNA)
Non-human genome (human genome mapping log file of bowtie2)
3.1.2 Visualize the reads length distribution by scatter plot/bar chart.
3.1.3 Check whether the mapping ratio are consistent with priori knowledge or published reports. If not, there might have some bias in your sample, which should be filtered.
You can combine some bash command or use your own code to summarise reads length. Here we shared a simple bash script for your reference.
3.2 Reads length distribution
3.2.1 Calculate the reads length :
raw reads (raw fastq file).
clean reads (rRNA removed and low-quality trimmed reads).
mapped reads of miRNA, piRNA, Y RNA, snRNA, srpRNA, tRNA, lncRNA, mRNA (mapped result sam file).
NOTE: The sam/bam files record mapping information of all reads, include mapped and unmapped. Take care of the "flag" in 2nd column of sam/bam file. file format.
You can combine some bash command or use your own code to summarize reads length. Here we shared a simple bash script for your reference. Github link.
3.2.2 Visualize the reads length distribution by scatter plot/bar chart.
3.2.3 Check whether the reads length distribution are consistent with priori knowledge or published reports. If not, there might have some bias in your sample, which should be filtered.
3.3 Quality control criteria:
You can define a set of criteria to QC and filter your data. Here we shared our QC criteria for your reference.
4) More practise and reference
Snakemake is a good workflow management system to create reproducible and scalable data analyses. Snakemake.
There is a Snakemake example to identify circRNA,location: /BioII/lulab_b/shared/shared_scripts/Snakemake/examples.Identify_circRNA/
We shared our snakemake package used to exRNA-seq basic analyses.Github link.
*
sharing programming tips
terminal: grey background, Monaco 18 (aequilate font, suitable for programming);
take care of file path and name when using shared scripts.
5) Homework
Level 1
Learn more about Illunima RNA-seq types, library preparation and analysis methods.
hcc_example samples:
Follow this tutorial step by step, and check your results by comparing with Siqi's results(location:wangsiqi/project/hcc_example)
hcc_lulab samples, summarize 61 samples(shared in github) and show your team's results:
Data summary table (ref: 3.1.1).
Mapping ratio plot (ref: 3.1.2).
Reads length distribution plot (ref: 3.2.2).
Filter low quality samples (ref: 3.3).
(Advanced) More plots to show you results. Upload your scripts if using R package.
(Advanced) Remove 5` adapter from sample_R2.fastq file, compare with R1 clean reads. Think the potentaial reason if they are inconsistent.
(Advanced) comparing gft generated fasta file and miRbase 22 fasta file.download
Level 2:
Write and upload a bash script to complete above analysis.
Level 3:
Learn snakemake usage and create your own snakefile.
Last updated