1.bash_basics

bash: How to set up multiple jobs as a pipeline

1. Creating a script

You may like to create one directory like to hold your scripts

mkdir ~/scripts, and to put this path to .bashrc or .bash_profile

In order to ensure that no confusion can rise, bash script names often end in ".sh".

create one example bash scripts

vi ~/scripts/myBash1.sh

!/bin/bash

clear echo "This is information provided by mysystem.sh. Program starts now."

echo "Hello, $USER"

echo "Today's date is date, this is week date +"%V"." echo "This is uname -s running on a uname -m processor."

echo "This is the uptime information:" uptime

echo "I'm creating two variables" USERS=uptime | cut -d "," -f 3 VALUE="4" echo "There are$USERS have used this computer." echo "This is the number: $VALUE"

(Tips: if you use vim, you may like to activate syntax highlighting, type ":syntax enable" in vim, you can add this setting to your .vimrc file to make it permanent.)

2. Running a script

The script should have execute permissions for the correct owners in order to be runnable.

chmod u+x ~/scripts/myBash1.sh

type ~/scripts/myBash1.sh, bash ~/scripts/myBash1.sh or bash -x ~/scripts/myBash1.sh to run the script.

> ~/scripts/myBash1.sh
This is information provided by mysystem.sh. Program starts now.
Hello, liyang
Today's date is Wed Mar 21 22:07:26 CST 2018, this is week 12.
This is Darwin running on a x86_64 processor.
This is the uptime information:
22:07 up 30 days, 12:42, 14 users, load averages: 1.41 1.67 1.74
I'm creating two variables
There are 14 users have used this computer.
This is the number: 4

3. Bash basics

3.1 Variables (page 299)

To set a variable in the shell, use

VARNAME="value"

Setting and exporting is usually done in one step:

export VARNAME="value"

3.2 Quoting characters (page 327)

  • Escape characters:

echo $date
echo \$date
  • Single quotes:

echo '$date'
  • Double quotes:

echo "$date"
echo "date"
echo "I said: \"Hello World!\""

3.3 Shell expansion (page 325)

  • Brace expansion {}:

echo sp{el,il,al}l
  • Variable expansion $:

echo $SHELL
echo ${FRANKY:=Franky}
  • Command substitution:

echo $(date) echo date echo date

  • Arithmetic expansion:

echo $((365*24))
echo $[365*24]

3.4 Regular expressions (page 346)

Operator

Effect

.

Matches any single character.

?

The preceding item is optional and will be matched, at most, once.

*

The preceding item will be matched zero or more times.

+

The preceding item will be matched one or more times.

{N}

The preceding item is matched exactly N times.

{N,}

The preceding item is matched N or more times.

{N,M}

The preceding item is matched at least N times, but not more than M times.

-

represents the range if it's not first or last in a list or the ending point of a range in a list.

^

Matches the empty string at the beginning of a line; also represents the characters not in the range of a list.

$

Matches the empty string at the end of a line.

\b

Matches the empty string at the edge of a word.

\B

Matches the empty string provided it's not at the edge of a word.

\<

Match the empty string at the beginning of word.

\>

Match the empty string at the end of word.

3.5 grep, awk, sed, pipe(|), cut, sort, uniq, join, cat, paste (Week 1)

3.6 Conditional statements (page 379)

  • general:

Primary

Meaning

[-a FILE]

True if FILE exists.

[-b FILE]

True if FILE exists and is a block-special file.

[-c FILE]

True if FILE exists and is a character-special file.

[-d FILE]

True if FILE exists and is a directory.

[-e FILE]

True if FILE exists.

[-f FILE]

True if FILE exists and is a regular file.

[-g FILE]

True if FILE exists and its SGID bit is set.

[-h FILE]

True if FILE exists and is a symbolic link.

[-k FILE]

True if FILE exists and its sticky bit is set.

[-p FILE]

True if FILE exists and is a named pipe (FIFO).

[-r FILE]

True if FILE exists and is readable.

[-s FILE]

True if FILE exists and has a size greater than zero.

[-t FD]

True if file descriptor FD is open and refers to a terminal.

[-u FILE]

True if FILE exists and its SUID (set user ID) bit is set.

[-w FILE]

True if FILE exists and is writable.

[-x FILE]

True if FILE exists and is executable.

[-O FILE]

True if FILE exists and is owned by the effective user ID.

[-G FILE]

True if FILE exists and is owned by the effective group ID.

[-L FILE]

True if FILE exists and is a symbolic link.

[-N FILE]

True if FILE exists and has been modified since it was last read.

[-S FILE]

True if FILE exists and is a socket.

[FILE1 -nt FILE2]

True if FILE1 has been changed more recently than FILE2, or if FILE1 exists and FILE2 does not.

[FILE1 -ot FILE2]

True if FILE1 is older than FILE2, or is FILE2 exists and FILE1 does not.

[FILE1 -ef FILE2]

True if FILE1 and FILE2 refer to the same device and inode numbers.

  • for loop:

for NAME [in LIST ]; do COMMANDS; done

  • while loop:

while CONTROL-COMMAND; do CONSEQUENT-COMMANDS; done

  • until loop:

until TEST-COMMAND; do CONSEQUENT-COMMANDS; done

  • break and continue

3.7 Functions

FUNCTION () { COMMANDS; }

4. Example

Assuming that you have 5 sequencing data, you are trying to check the mapping quality for each sample based on the output log. The output log is looks like:

> ll
[email protected] 7 liyang staff 238 Mar 22 01:08 01.input
[email protected] 3 liyang staff 102 Mar 22 01:24 02.output
-rw-r--r-- 1 liyang staff 145 Mar 22 01:27 README
drwxr-xr-x 3 liyang staff 102 Mar 22 01:26 bin
> cd 01.input
> ls
sample1 sample2 sample3 sample4 sample5
> cd sample1
> cat Log.final.out
Started job on | Dec 22 13:14:21
Started mapping on | Dec 22 14:10:01
Finished on | Dec 22 14:49:53
Mapping speed, Million of reads per hour | 29.16
Number of input reads | 19372132
Average input read length | 51
UNIQUE READS:
Uniquely mapped reads number | 17395896
Uniquely mapped reads % | 89.80%
Average mapped length | 50.93
Number of splices: Total | 5038
Number of splices: Annotated (sjdb) | 0
Number of splices: GT/AG | 3677
Number of splices: GC/AG | 158
Number of splices: AT/AC | 8
Number of splices: Non-canonical | 1195
Mismatch rate per base, % | 0.25%
Deletion rate per base | 0.00%
Deletion average length | 1.48
Insertion rate per base | 0.00%
Insertion average length | 1.36
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 1080034
% of reads mapped to multiple loci | 5.58%
Number of reads mapped to too many loci | 189582
% of reads mapped to too many loci | 0.98%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 2.54%
% of reads unmapped: other | 1.10%

Based on these information, you need to write one bash script to extract the number of input reads, uniquely mapped reads number and multi-mapped reads number, and generate one summary file. Usually, the uniquely mapped ratio were used to measure the mapping quality. The sample do not pass the criteria should be labeled.

!/usr/bin/bash

set -o nounset set -o errexit

echo "$OPTIND start at $OPTIND"

while getopts ":i:o:n:p:" optname; do case $optname in i) input="$OPTARG";; o) outputDir="$OPTARG";; n) cutoff="$OPTARG";; p) prefix="$OPTARG";; ?) echo "Usage: basename $0 -i input -o outputDir -n cutoff -p prefix";; :) echo "No argument value for option $OPTARG";; esac

echo "$OPTIND is now $OPTIND"

echo $

done;

Initialize variables

if [ $# -eq 8 ]; then outputDir="${outputDir%*/}" echo "The input file is "basename ${input} echo "The output directory in "${outputDir} echo "The cutoff is "${cutoff} echo "the prefix for output file is "${prefix}

get values from input file

totalN=cat ${input} | grep 'Number of input reads' | cut -f 2 uniqN=cat ${input} | grep 'Uniquely mapped reads number' | cut -f 2 ratio=bc <<< "scale=4; $uniqN/$totalN" multiN=cat ${input} | grep 'Number of reads mapped to multiple loci' | cut -f 2

if (( $(echo "$ratio > $cutoff" | bc -l) )); then echo "The mapping result is pass the cutoff" echo -e "${totalN}\t${uniqN}\t${multiN}" | awk 'BEGIN{FS=OFS="\t"}{print $1,$2,$3,$2/$1,$2+$3,($2+$3)/$1,"pass"}' >> $outputDir/$prefix.sta else echo "The mapping result is NOT pass the cutoff" echo -e "${totalN}\t${uniqN}\t${multiN}" | awk 'BEGIN{FS=OFS="\t"}{print $1,$2,$3,$2/$1,$2+$3,($2+$3)/$1,"fail"}' >> $outputDir/$prefix.sta fi

echo "Job finished!" fi

Run the script:

for i in `ls 01.input/`;

do echo $i;

bash bin/sta.sh -i 01.input/$i/Log.final.out -o 02.output/ -n 0.9 -p summary;

done

Homework

level 1: type the code in your computer and understand the meaning for each command.

download link: Week_2_files/bash_example.zip

level 2: try to write one bash script to check the md5 of files in folder and output the file name of the truncated file.

> cd homework/checkMD5/
> ls
file1 file2 file3 file4 file5 md5sum.txt
> cat md5sum.txt
MD5 (./file1) = 15e8d15469ba992caac192b8684396a3
MD5 (./file2) = 85dc26fde0917b4dbba6339607b63d31
MD5 (./file3) = 62fdd313368e8125115ef53a3caaf95b
MD5 (./file4) = d09c83de50bb2da6c0c824fe44ba887a
MD5 (./file5) = db648585dd242f899818247643e599dd

download link: Week_2_files/homework/checkMD5.zip

level 3: try to write one bash script your own.

Reference

https://www.tldp.org/LDP/Bash-Beginners-Guide/html/Bash-Beginners-Guide.html