1.bash_basics

bash: How to set up multiple jobs as a pipeline

1. Creating a script

You may like to create one directory like to hold your scripts

mkdir ~/scripts, and to put this path to .bashrc or .bash_profile

In order to ensure that no confusion can rise, bash script names often end in ".sh".

create one example bash scripts

vi ~/scripts/myBash1.sh

!/bin/bash

clear echo "This is information provided by mysystem.sh. Program starts now."

echo "Hello, $USER"

echo "Today's date is date, this is week date +"%V"." echo "This is uname -s running on a uname -m processor."

echo "This is the uptime information:" uptime

echo "I'm creating two variables" USERS=uptime | cut -d "," -f 3 VALUE="4" echo "There are$USERS have used this computer." echo "This is the number: $VALUE"

(Tips: if you use vim, you may like to activate syntax highlighting, type ":syntax enable" in vim, you can add this setting to your .vimrc file to make it permanent.)

2. Running a script

The script should have execute permissions for the correct owners in order to be runnable.

chmod u+x ~/scripts/myBash1.sh

type ~/scripts/myBash1.sh, bash ~/scripts/myBash1.sh or bash -x ~/scripts/myBash1.sh to run the script.

> ~/scripts/myBash1.sh
This is information provided by mysystem.sh. Program starts now.

Hello, liyang

Today's date is Wed Mar 21 22:07:26 CST 2018, this is week 12.
This is Darwin running on a x86_64 processor.

This is the uptime information:
22:07 up 30 days, 12:42, 14 users, load averages: 1.41 1.67 1.74

I'm creating two variables
There are 14 users have used this computer.
This is the number: 4

3. Bash basics

3.1 Variables (page 299)

To set a variable in the shell, use

VARNAME="value"

Setting and exporting is usually done in one step:

export VARNAME="value"

3.2 Quoting characters (page 327)

Escape characters:

echo $date
echo \$date

Single quotes:

echo '$date'

Double quotes:

echo "$date"
echo "date"
echo "I said: \"Hello World!\""

3.3 Shell expansion (page 325)

Brace expansion {}:

echo sp{el,il,al}l

Variable expansion $:

echo $SHELL
echo ${FRANKY:=Franky}

Command substitution:

echo $(date) echo date echo date

Arithmetic expansion:

echo $((365*24))
echo $[365*24]

3.4 Regular expressions (page 346)

Operator

Effect

Matches any single character.

The preceding item is optional and will be matched, at most, once.

The preceding item will be matched zero or more times.

The preceding item will be matched one or more times.

{N}

The preceding item is matched exactly N times.

{N,}

The preceding item is matched N or more times.

{N,M}

The preceding item is matched at least N times, but not more than M times.

represents the range if it's not first or last in a list or the ending point of a range in a list.

Matches the empty string at the beginning of a line; also represents the characters not in the range of a list.

Matches the empty string at the end of a line.

Matches the empty string at the edge of a word.

Matches the empty string provided it's not at the edge of a word.

Match the empty string at the beginning of word.

Match the empty string at the end of word.

3.5 grep, awk, sed, pipe(|), cut, sort, uniq, join, cat, paste (Week 1)

3.6 Conditional statements (page 379)

general:

Primary

Meaning

[-a FILE]

True if FILE exists.

[-b FILE]

True if FILE exists and is a block-special file.

[-c FILE]

True if FILE exists and is a character-special file.

[-d FILE]

True if FILE exists and is a directory.

[-e FILE]

True if FILE exists.

[-f FILE]

True if FILE exists and is a regular file.

[-g FILE]

True if FILE exists and its SGID bit is set.

[-h FILE]

True if FILE exists and is a symbolic link.

[-k FILE]

True if FILE exists and its sticky bit is set.

[-p FILE]

True if FILE exists and is a named pipe (FIFO).

[-r FILE]

True if FILE exists and is readable.

[-s FILE]

True if FILE exists and has a size greater than zero.

[-t FD]

True if file descriptor FD is open and refers to a terminal.

[-u FILE]

True if FILE exists and its SUID (set user ID) bit is set.

[-w FILE]

True if FILE exists and is writable.

[-x FILE]

True if FILE exists and is executable.

[-O FILE]

True if FILE exists and is owned by the effective user ID.

[-G FILE]

True if FILE exists and is owned by the effective group ID.

[-L FILE]

True if FILE exists and is a symbolic link.

[-N FILE]

True if FILE exists and has been modified since it was last read.

[-S FILE]

True if FILE exists and is a socket.

[FILE1 -nt FILE2]

True if FILE1 has been changed more recently than FILE2, or if FILE1 exists and FILE2 does not.

[FILE1 -ot FILE2]

True if FILE1 is older than FILE2, or is FILE2 exists and FILE1 does not.

[FILE1 -ef FILE2]

True if FILE1 and FILE2 refer to the same device and inode numbers.

for loop:

for NAME [in LIST ]; do COMMANDS; done

while loop:

while CONTROL-COMMAND; do CONSEQUENT-COMMANDS; done

until loop:

until TEST-COMMAND; do CONSEQUENT-COMMANDS; done

break and continue

3.7 Functions

FUNCTION () { COMMANDS; }

4. Example

Assuming that you have 5 sequencing data, you are trying to check the mapping quality for each sample based on the output log. The output log is looks like:

> ll
drwxr-xr-x@ 7 liyang staff 238 Mar 22 01:08 01.input
drwxr-xr-x@ 3 liyang staff 102 Mar 22 01:24 02.output
-rw-r--r-- 1 liyang staff 145 Mar 22 01:27 README
drwxr-xr-x 3 liyang staff 102 Mar 22 01:26 bin

> cd 01.input
> ls
sample1 sample2 sample3 sample4 sample5

> cd sample1

> cat Log.final.out
Started job on | Dec 22 13:14:21
Started mapping on | Dec 22 14:10:01
Finished on | Dec 22 14:49:53
Mapping speed, Million of reads per hour | 29.16

Number of input reads | 19372132
Average input read length | 51
UNIQUE READS:
Uniquely mapped reads number | 17395896
Uniquely mapped reads % | 89.80%
Average mapped length | 50.93
Number of splices: Total | 5038
Number of splices: Annotated (sjdb) | 0
Number of splices: GT/AG | 3677
Number of splices: GC/AG | 158
Number of splices: AT/AC | 8
Number of splices: Non-canonical | 1195
Mismatch rate per base, % | 0.25%
Deletion rate per base | 0.00%
Deletion average length | 1.48
Insertion rate per base | 0.00%
Insertion average length | 1.36
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 1080034
% of reads mapped to multiple loci | 5.58%
Number of reads mapped to too many loci | 189582
% of reads mapped to too many loci | 0.98%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 2.54%
% of reads unmapped: other | 1.10%

Based on these information, you need to write one bash script to extract the number of input reads, uniquely mapped reads number and multi-mapped reads number, and generate one summary file. Usually, the uniquely mapped ratio were used to measure the mapping quality. The sample do not pass the criteria should be labeled.

!/usr/bin/bash

set -o nounset set -o errexit

echo "$OPTIND start at $OPTIND"

while getopts ":i:o:n:p:" optname; do case $optname in i) input="$OPTARG";; o) outputDir="$OPTARG";; n) cutoff="$OPTARG";; p) prefix="$OPTARG";; ?) echo "Usage: basename $0 -i input -o outputDir -n cutoff -p prefix";; :) echo "No argument value for option $OPTARG";; esac

echo "$OPTIND is now $OPTIND"

echo $

done;

Initialize variables

if [ $# -eq 8 ]; then outputDir="${outputDir%*/}" echo "The input file is "basename ${input} echo "The output directory in "${outputDir} echo "The cutoff is "${cutoff} echo "the prefix for output file is "${prefix}

get values from input file

totalN=cat ${input} | grep 'Number of input reads' | cut -f 2 uniqN=cat ${input} | grep 'Uniquely mapped reads number' | cut -f 2 ratio=bc <<< "scale=4; $uniqN/$totalN" multiN=cat ${input} | grep 'Number of reads mapped to multiple loci' | cut -f 2

if (( $(echo "$ratio > $cutoff" | bc -l) )); then echo "The mapping result is pass the cutoff" echo -e "${totalN}\t${uniqN}\t${multiN}" | awk 'BEGIN{FS=OFS="\t"}{print $1,$2,$3,$2/$1,$2+$3,($2+$3)/$1,"pass"}' >> $outputDir/$prefix.sta else echo "The mapping result is NOT pass the cutoff" echo -e "${totalN}\t${uniqN}\t${multiN}" | awk 'BEGIN{FS=OFS="\t"}{print $1,$2,$3,$2/$1,$2+$3,($2+$3)/$1,"fail"}' >> $outputDir/$prefix.sta fi

echo "Job finished!" fi

Run the script:

for i in `ls 01.input/`;

do echo $i;

bash bin/sta.sh -i 01.input/$i/Log.final.out -o 02.output/ -n 0.9 -p summary;

done

Homework

level 1: type the code in your computer and understand the meaning for each command.

download link: Week_2_files/bash_example.zip

level 2: try to write one bash script to check the md5 of files in folder and output the file name of the truncated file.

> cd homework/checkMD5/
> ls
file1 file2 file3 file4 file5 md5sum.txt

> cat md5sum.txt
MD5 (./file1) = 15e8d15469ba992caac192b8684396a3
MD5 (./file2) = 85dc26fde0917b4dbba6339607b63d31
MD5 (./file3) = 62fdd313368e8125115ef53a3caaf95b
MD5 (./file4) = d09c83de50bb2da6c0c824fe44ba887a
MD5 (./file5) = db648585dd242f899818247643e599dd

download link: Week_2_files/homework/checkMD5.zip

level 3: try to write one bash script your own.

Reference

https://www.tldp.org/LDP/Bash-Beginners-Guide/html/Bash-Beginners-Guide.html

PreviousTable of Content Next2.R_basics

Last updated 7 years ago