Editor : Anita
Reviewer: Lucas
DOC_ID : P02-0002

GATK

The GATK is the industry standard for identifying SNPs and indels in germline DNA and RNAseq data. Its scope is now expanding to include somatic short variant calling, and to tackle copy number (CNV) and structural variation (SV). In addition to the variant callers themselves, the GATK also includes many utilities to perform related tasks such as processing and quality control of high-throughput sequencing data, and bundles the popular Picard toolkit.

These tools were primarily designed to process exomes and whole genomes generated with Illumina sequencing technology, but they can be adapted to handle a variety of other technologies and experimental designs. And although it was originally developed for human genetics, the GATK has since evolved to handle genome data from any organism, with any level of ploidy.

Learn more

Environment requirement

Hardware : CPU : 10 cores
RAM : 45 Gb
Software :The pipeline is developed based on GA System. All software used in the pipeline are included in pre-defined software environments (GA application collection ). Please install suitable environment before start the pipeline(How to do it!) .

Reference library

Genome reference set

Build	Name	Content
hg38	GATK-hg38	ref=Homo_sapiens_assembly38.fasta ;knownSite1=dbsnp_138.hg38.vcf.gz ;knownSite2=Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
b37	GATK-b37	ref=human_g1k_v37.fasta ;knownSite1=dbsnp_138.b37.vcf.gz ;knownSite2=Mills_and_1000G_gold_standard.indels.b37.vcf.gz
hg19	GATK-hg19	ref=ucsc.hg19.fasta.fasta ;knownSite1=dbsnp_138.hg19.vcf.gz ;knownSite2=Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz

Annovar Reference

Build	Name	Content
hg38	hg38	ref = “/pkg/biology/reference/Homo_sapiens/annovar/hg38”
b37	b37	ref = “/pkg/biology/reference/Homo_sapiens/annovar/hg19”
hg19	hg19	ref = “/pkg/biology/reference/Homo_sapiens/annovar/hg19”

Project folder structure

Project Name project.sh
- Raw SingleEnd: SampleName_.fastq.gz
  PairedEnd: SampleName-R1.fastq.gz, SampleName-R2.fastq.gz
- Processed       SampleName_bqsr.bam, _bqsr.bam.bai
                  SampleName.g.vcf, .g.vcf.idx
                  SampleName.vcf, vcf.idx
                   ExperimentName.vcf , vcf.idx
- QC fastqc : sample_R1/R2_fastqc.html
  qualimap : sample folder
- Log Summary.log
  service_id.ER, OU
- Report Annovar : Exp01_anno.vcf
  MultiQC : multiqc_report.html*:Files prepared by users are marked in Red

Manifest File

None

The diagram of workflow

GATK-GMP : Mapping raw reads to a reference genome and create bam file for following small indel variant calling, structural variants analysis. The module remove duplicates, adaptor to reduce bias from library preparation. The base quality score are also recalibrate according GATK’s algorithm. (Ref)

QC_QualiMap : Qualimap is a platform-independent application written in Java and R that provides both a Graphical User Interface (GUI) and a command-line interface to facilitate the quality control of alignment sequencing data. (Ref )

QC_Multiqc : Aggregate results from bioinformatics analyses across many samples into a single report, MultiQC searches a given directory for analysis logs and compiles a HTML report. (Ref )

GATK_VC : The moduel detect SNPs and indels by using GATK HaplotypeCaller. Reads with variant are re-de novo assembly to artificial haplotype, Read are then mapped to artifical haplotype. The moduel add Gather g.VCF from parellelled GATK HaplotypeCaller. Combine multiple g.VCF of the same sample into one g.vcf file (Ref)

GATK_GVCF : GenotypeGVCFs generate genotype calling with GVCF files produced from the HaplotypeCaller. For multifple sample, GVCF files can be combined by CombineGVCFs. (Ref)

GATK_MU2V : Identify somatic short variants (SNVs and Indels) in one or more tumor samples from a single individual, with or without a matched normal sample.This workflow requires a paired BAM files ( tumor and normal sample). the BAM format follow the GATK Best Practices for data pre-processing document. The SM parameter should be set as sample name. Module GATK_GMP module is suggested to generate formated BAM files. The module execute 4 step for somatic variants calling (Ref)

AN_ANNO : In this step, the “_export.vcf” file output from the previous step is added to the gene annotation with the Annovar program to generate the “_multianno.vcf” file after annotation.(Ref)

SV_Delly2GermlineCALL : To discover germ-line structure variant such as deletions, tandem duplications, inversions and translocations by using Delly 2. the structural variants can be visualized using Delly-maze and Delly-suave.(Ref)

SV_Delly2merge : Merge results from Delly2 Germline variant callers and generate genotype call.

SV_Delly2SomaticCALL : To discover somatic structure variant such as deletions, tandem duplications, inversions and translocations by using Delly 2. the structural variants can be visualized using Delly-maze and Delly-suave.(Ref)

Module list

Module code	Supported References	Node	module Information
GATK_GMP.sh	GATK-hg38, -b37, -hg19	ngs96G	GATK genome mapping
QC_QualiMap.sh	Any	ngs48G	Qualimap
QC_Multiqc.sh	Any	ngs16G	Combine QC resultes
GATK_VC.sh	GATK-hg38, -b37, -hg19	ngs48G	GATK halpotypecaller variant calling to g.vcf
GATK_GVCF.sh	GATK-hg38, -b37, -hg19	ngs48G	GATK CombineGVCFs, GenotypeGVCFs
GATK_MU2V.sh	GATK-hg38, -b37	ngs48G	GATK Mutect2, GetPileupSummaries, CalculateContamination and FilterMutectCalls from bqsr_bam to variant_filted.vcf.gz

Workflow script configuration

Color theme of Script
Gray(Dark Gray): Script comment
Red (Pale Red): module name
Green (Dark Emerald): Variables defined by use

The workflow script demonstrate genome variant calling pipeline implemented by using GATK and Annotation modules. The work script are included several functional section. The detail of each section show below.

1. Parameter setup :

# *****************************************
# ***** Global parameters ******
# *****************************************
# set your email for job message senting
email = user@your.email
# set your project account for submit job
projectID = MST-XXXXXXXX

2. Reads Mapping :

The is the obligatory first phase that must precede all variant discovery. It involves pre-processing the raw sequence data (provided in FASTQ format) to produce analysis-ready BAM files. This involves alignment to a reference genome as well as some data cleanup operations to correct for technical biases and make the data suitable for analysis.

Ref : https://gatk.broadinstitute.org/hc/en-us/articles/360035535912-Data-pre-processing-for-variant-discovery

# ****************************************************
# ****** Reads Mapping ******
# ***SampleN_R1,R2 => SampleN_bqsr.bam***
# ****************************************************

# ===== Mapping (fastq to _bqsr.bam) =====
GATKGMP_1=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,inFile=Sample01,sampleName=Sample01‘ GATK_GMP.sh)
echo -e $(date)’\tSample01:GATK_GMP_1\t’$GATKGMP_1’\tSubmit’ >> log/Summary.log

GATKGMP_2=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,inFile=Sample02,sampleName=Sample02‘ GATK_GMP.sh)
echo -e $(date)’\tSample02:GATK_GMP_2\t’$GATKGMP_2’\tSubmit’ >> log/Summary.log

Parameter	Description	Remark
GATK_GMP.sh	Module of genome mapping	分析的模組需存放在[modules]資料夾中
inFile	The name of raw read files of the sample:File format : .fastq or .fastq.gz File path : raw/	例如: inFile = Sample01 會在 raw/ 讀取: 1: Sample01_R1.fastq.gz 2: Sample01_R2.fastq.gz
refGenome	在執行分析時選用的基因參考資料庫	目前支援GATK-hg38, GATK-b37及GATK-hg19 基因資料庫
sampleName	The name of output bam file of the sampleFile format : *.bam File path : processed/	例如: sampleName = Sample01 會在 processed/生成Sample01_bqsr.bam

3. Mapping Quality Check:

Bam file quality checking

Qualimap is a Java application that supports user-friendly quality control of mapping data, by considering sequence features and their genomic properties. Qualimap takes sequence alignment data and provides graphical and statistical analyses for the evaluation of data.

Ref : http://qualimap.bioinfo.cipf.es/

# *******************************************************
# ******** Mapping Quality Check ********
# ***SampleN_bqsr.bam => QC/SampleN folder***
# *******************************************************

QM_1=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,inFile=Sample01_bqsr.bam,sampleName=Sample01‘ QC_QualiMap.sh )
echo -e $(date)’\tSample01:QM_1\t’$QM_1’\tSubmit’ >> log/Summary.log

QM_2=$(qsub -P $projectID -W group_list= $projectID -m e -M $email -v ‘projDir=’$(pwd)’/,inFile=Sample02_bqsr.bam,sampleName=Sample02‘ QC_QualiMap.2.X.Y.sh )
echo -e $(date)’\tSample02:QM_2\t’$QM_2’\tSubmit’ >> log/Summary.log

Parameter	Description	Remark
QC_QualiMap.sh	Module of Bam file QC	分析的模組需存放在[modules]資料夾中
inFile	The name of Mapping files of the sample:File format : *.bam File path : processed/	例如: inFile = Sample01_bqsr.bam 會在 processed/ 讀取: 1: Sample01_bqsr.bam
sampleName	The folder name of mapping file QCFile path: QC/	例如: sampleName = Sample01 會在 QC/ 建立Sample01的folder :QC/Sample01/

Generate QC report

Aggregate results from bioinformatics analyses across many samples into a single report, MultiQC searches a given directory for analysis logs and compiles a HTML report. It’s a general use tool, perfect for summarising the output from numerous bioinformatics tools.

# *****************************************
# ******** Generate report ********
# ******** Multiqc module ********
# *****************************************
MQC=$(qsub -P $projectID -W group_list= $projectID -m e -M $email -v “projDir=’$(pwd)’/,qcDir=QC/,rptDir=report/,rptName=multiqc” QC_Multiqc.2.X.Y.sh )
echo -e $(date)’\tMQC\t’$MQC’\tSubmit’ >> log/Summary.log

Parameter	Description	Remark
QC_Multiqc.sh	Module of Bam file and fastq QC	分析的模組需存放在[modules]資料夾中
qcDir	The folder name of analysis QC dataFile format : “_fastqc.zip”File path : QC/	例如: qcDir=QC/會在 QC/ 讀取所有檔名符合 “_fastqc.zip”的資料
rptDir	The folder name of save QC resultFile format : “_report.html”File path : report/	例如: rptDir=report/會將生成的結果存放在 report/
rptName	The file name of save QC result File format : “_report.html”	例如: rptName=multiqc 會在report/產生名為multiqc的file :1 : report/multiqc_report.html

4A. Germline Short Variant Detecting :

Short variant calling

Identify germline short variants (SNPs and Indels) in one or more individuals to produce a joint callset in VCF format.

Ref : https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-

# *************************************************
# ****** Variant calling *****
# ***SampleN_bqsr.bam =>SampleN.g.vcf***
# *************************************************

# =====variant calling (_bqsr.bam to .g.vcf) =====
GATKVC_1=$(qsub -P $projectID -W group_list= $projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,sampleName=Sample01‘ GATK_VC.sh )
echo -e $(date)’\tSample01:GATK_VC_1\t’$GATKVC_1’\tSubmit’ >> log/Summary.log

GATKVC_2=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,sampleName=Sample02‘ GATK_VC.sh )
echo -e $(date)’\tSample02:GATK_VC_2\t’$GATKVC_2’\tSubmit’ >> log/Summary.log

For Target sequencing, e.g. Whole exome sequencing or onco-panel, User can provided target sequencing region in BED format by using -intervals parameter

# *************************************************
# ****** Variant calling *****
# ***SampleN_bqsr.bam =>SampleN.g.vcf***
# *************************************************
# =====(!!!Custom intevals!!!) variant calling (_bqsr.bam to .g.vcf) =====
GATKVC_1=$(qsub -P $projectID -W group_list= $projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,intervals=WES.bed,sampleName=Sample01‘ GATK_VC.sh )
echo -e $(date)’\tSample01:GATK_VC_1\t’$GATKVC_1’\tSubmit’ >> log/Summary.log

GATKVC_2=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,intervals=WES.bed,sampleName=Sample02‘ GATK_VC.sh )
echo -e $(date)’\tSample02:GATK_VC_2\t’$GATKVC_2’\tSubmit’ >> log/Summary.log

Parameter	Description	Remark
GATK_VC.2.X.Y.sh	The module name of GATK variants calling	分析的模組需存放在[modules]資料夾中
refGenome	在執行分析時選用的基因參考資料庫	目前支援GATK-hg38, GATK-b37及GATK-hg19 基因資料庫
sampleName	The sample name of mapped files :Input file format : _bqsr.bam Output file format : .g.vcf File path : processed/	例如: sampleName=Sample01模組會在 processed 資料夾裡讀取Sample1_bqsr.bam 進行分析，其結果輸出為Sample01.g.vcf
intervals	Add interval conditions as needed when performing analysis : (Ref)Whole genome analysis: Intervals are not required but they can help speed up analysis by eliminating “difficult” regions and enabling parallelismExome analysis and other targeted sequencing: You must provide the list of targets, with padding, to exclude off-target noise. This will also speed up analysis and enable parallelism.	GATK supports several types of interval list formats : (Ref)Picard-style .interval_listGATK-style .listBED files with extension .bedVCF files

Genotype calling :

GenotypeGVCFs merges gVCF output files produced from the HaplotypeCaller or from combining such gVCF files using CombineGVCFs. Only gVCF files resulting from HaplotypeCaller or CombineGVCFs can be used as input for this tool.

(Ref : http://www.genomicdataanalysis.com/genotypegvcfs/)

# **********************************************************
# ********* Genotype calling *********
# ***SampleN.g.vcf.gz => SampleN_combined.vcf***
# **********************************************************

# ==Genotype calling (sample1~N.g.vcf => CombineGVCF)==

GATKGVCF=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,inFiles=’Sample01.g.vcf&&Sample02.g.vcf’,output=Exp01‘ GATK_GVCF.2.X.Y.sh )
echo -e $(date)’\t*.g.vcf to do combine\t’$GATKGVCF’\tSubmit’ >> log/Summary.log

Parameter	Description	Remark
GATK_GVCF.2.X.Y.sh	The module name of GATK genotype calling	分析的模組需存放在[modules]資料夾中
refGenome	在執行分析時選用的基因參考資料庫	目前支援GATK-hg38, GATK-b37及GATK-hg19 基因資料庫
inFiles	The name of the sample of combine files :File format : .g.vcf File path : processed/►若要合併多個 g.vcf 可以用 “&&” 連結.*	例如: inFiles =’Sample01.g.vcf&&Sample02.g.vcf’ 會在 processed/ 讀取: 1: processed/Sample01.g.vcf 2: processed/Sample02.g.vcf
output	The name of the sample of through combine files :File format : *_combined.vcf File path : processed/	例如:output=Exp01 則資料會存放在 processed/ :1: processed/Exp01_combined.vcf

5A. Germline Structure Variant Detecting :

Structure variant call:

To discover germline structure variant such as deletions, tandem duplications, inversions and translocations by using Delly 2.

Ref : https://github.com/dellytools/delly

# ***********************************************
# ****** Variant calling *****
# ***SampleN_bqsr.bam =>SampleN.bcf***
# ***********************************************

# ===== Delly2GermlineCALL( _bqsr.bam to .bcf) =====

DG_1=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,sampleName=Sample01,refGenome=GATK-hg38‘ SV_Delly2GermlineCALL.2.X.Y.sh )
echo -e $(date)’\tSample01:Delly2GermlineCALL_1\t’$DG_1’\tSubmit’ >> log/Summary.log

DG_2=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,sampleName=Sample02,refGenome=GATK-hg38‘ SV_Delly2GermlineCALL.2.X.Y.sh)
echo -e $(date)’\tSample02:Delly2GermlineCALL_2\t’$DG_2’\tSubmit’ >> log/Summary.log

Parameter	Description	Remark
SV_Delly2GermlineCALL.2.X.Y.sh	module of structure variant calling	分析的模組需存放在[modules]資料夾中
refGenome	在執行分析時選用的基因參考資料庫	目前支援GATK-hg38, GATK-b37及GATK-hg19 基因資料庫
sampleName	The sample name of mapped files :Input file format : _bqsr.bam Output file format : .bcf File path : processed/	例如: sampleName=Sample01, 模組會在 processed 資料夾裡讀取 Sample1_bqsr.bam 進行分析，其結果輸出為Sample01.bcf。

Genotype call:

Merge results from Delly2 Germline variant callers and generate genotype call.

Ref : https://github.com/dellytools/delly

# ===== Delly2merge( .bcf to .vcf) =====

DM=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,sampleName=”Sample01&&Sample02“,outputName=Exp01‘ SV_Delly2merge.2.X.Y.sh)
echo -e $(date)’\tExp01:Delly2merge_1\t’$DM’\tSubmit’ >> log/Summary.log

Parameter	Description	Remark
SV_Delly2merge.2.X.Y.sh	module of merge bcf file	分析的模組需存放在[modules]資料夾中
refGenome	在執行分析時選用的基因參考資料庫	目前支援GATK-hg38, GATK-b37及GATK-hg19 基因資料庫
sampleName	The name of bcf files : File path : processed/ File format : .bcf ►若要合併多個 .bcf 可以用 “&&” 連結.*	例如:單一檔案：sampleName=Sample01 多個檔案：sampleName=Sample01&&Sample02模組會在 processed 資料夾裡讀取 Sample01.bcf(and Sample02.bcf) 進行合併。
outputName	The name of output file : File path : processed/ File format : *_geno_merge.vcf	例如：outputName=Exp01模組會將資料以”Exp01_geno_merge.vcf”存放在[processed]資料夾中。

4B. Somatic Short Variant Detecting :

Short variant call :

Identify somatic short variants (SNVs and Indels) in one or more tumor samples from a single individual, with or without a matched normal sample.

Ref : https://gatk.broadinstitute.org/hc/en-us/articles/360035894731-Somatic-short-variant-discovery-SNVs-Indels-

# **************************************************
# ****** Variant calling *****
# ***SampleN_bqsr.bam =>SampleN.vcf.gz***
# **************************************************

# =====variant calling (_bqsr.bam to _.vcf.gz) =====

GATKMU2V=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,ControlName=Sample01, TestName=Sample02, OutName=Case01‘ GATK_MU2V.2.X.Y.sh)
echo -e $(date)’\tSample01&Sample02:GATK_MU2V\t’$GATKMU2V’\tSubmit’ >> log/Summary.log

Parameter	Description	Remark
GATK_MU2V.2.X.Y.sh	The module name of GATK somatic variants calling	分析的模組需存放在[modules]資料夾中
refGenome	在執行分析時選用的基因參考資料庫	目前支援GATK-hg38及GATK-b37 基因資料庫
ControlName	The name of narmal sample files :File format : *_bqsr.bam File path : processed/	例如: ControlName=Sample01 會在 processed/ 讀取 Sample01_bqsr.bam 作為對照組
TestName	The name of test sample files :File format : *_bqsr.bam File path : processed/	例如: TestName=Sample02 會在 processed/ 讀取 Sample02_bqsr.bam 作為測試組
OutName	The name of somatic variant calling output files : File format : _mutect2_VF.vcf.gz/_segments.table File path : processed/	例如: OutName=Case01 會在 processed/ 資料夾裡寫入:Case01_mutect2_VF.vcf.gzCase01_segments.table

5B. Somatic Structure Variant Detecting:

Somatic structure variant call :

To discover somatic structure variant such as deletions, tandem duplications, inversions and translocations by using Delly 2.

Ref : https://github.com/dellytools/delly

# ===== Delly2SomaticCALL( _bqsr.bam to .vcf) =====

DS=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,controlName=Sample01,testName=Sample02‘ SV_Delly2SomaticCALL.2.X.Y.sh)
echo -e $(date)’\tExp01:delly2SomaticCALL\t’$DM’\tSubmit’ >> log/Summary.log

Parameter	Description	Remark
SV_Delly2SomaticCALL.2.X.Y.sh	module of structure variant calling	分析的模組需存放在[modules]資料夾中
refGenome	在執行分析時選用的基因參考資料庫	目前支援GATK-hg38, GATK-b37及GATK-hg19 基因資料庫
controlName	欲執行分析的對照樣品名 File format : *_bqsr.bam File path : processed/	例如: controlName=Sample01, 模組會在 processed 資料夾裡讀取 Sample01_bqsr.bam 作為對照組，對測試樣品變異分析作比較，若變異在對照組中也有發現，則該位點會從輸出結果中排除。
testName	欲執行分析的測試樣品名 File format : _bqsr.bam File path : processed/ Output file format : _Somatic.vcf	例如: testName=Sample02, 模組會在 processed 資料夾裡讀取 Sample02_bqsr.bam 作為測試組進行分析。其結果會以VCF資料格式以為”Sample2_Somatic.vcf”, 資料會存放在[processed]資料夾。

6. Variant Annotation :

將變異位點檔案以Annovar 程式加上基因註解，輸出註解檔。

# ==== Annovar annotation ( _export.vcf to _anno.vcf) ====

ANNO=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=hg38,inFiles=Exp01_export.vcf,output=Exp01‘ AN_ANNO.2.X.Y.sh)
echo -e $(date)’\tAnnovar export vars\t’$ANNO’\tSubmit’ >> log/Summary.log

Parameter	Description	Remark
AN_ANNO.2.X.Y.sh	module of annotation vcf	分析的模組需存放在[modules]資料夾中，此模組將變異位點檔案(vcf) 以Annovar 程式加上基因註解，會產生註解後之＂_multianno.vcf＂檔案
refGenome	在執行註解時選用的基因參考資料庫	目前支援hg38及 hg19 (相容於 b37) 基因資料庫
inFiles	欲加上基因註解的變異位點檔案檔名 File format: .vcf ” or .vcf.gz “ File path: processed/	例如: inFiles=Exp01__export.vcf 會在 processed/ 讀取Exp01_export.vcf 進行位點功能註解。
output	輸出的變異位點基因註解檔案名稱 File format: *_multianno.vcf File path: report/	例如: output=Exp01時會在 report/ 輸出的變異位點基因註解檔案 Exp01.hg38_multianno.vcf

Demo Workflow script :

 
#!/bin/bash

# ************* GATK Genome Germline variant calling 
# ************* Platform Taiwania
# ************* Script version 2.0 (Applies to modules from version 2.0.0)
## [projDir] project
##   +-- [rawDir] raw <= Sample1,2,3_R1/_R2.fastq.gz
##   +-- [prcDir] processed  => sampleName_bqsr.bam/.g.vcf.gz
##   +-- [anaDir] analyzed => case.g.vcf.gz
##   +-- [rptDir] report => RawReadQC.html, project.db
##   +-- [tmpDir] temp
##   +-- [logDir] log =>Summary.log
##   +-- [qcDir] QC => .pdf, .txt, .csv


## 1. ##
# *****************************************
# ******     Global parameters       ******
# *****************************************

# set your email for job message senting
email = user@your.email
# set your project account for submit job
projectID = MST-XXXXXXXX  

# *****************************************
# ************* raw read QC   *************
# *****************************************
#===== fastqc module =====
#Use Raw_Read_Quality_Check-Ver2 pipeline


## 2. ##
# *****************************************
# ***           Reads Mapping           ***
# ***SampleN_R1,R2 => SampleN_bqsr.bam  ***
# *****************************************

# ===== Mapping (fastq to _bqsr.bam) =====
GATKGMP_1=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,refGenome=GATK-hg38,inFile=Sample01,sampleName=Sample01' modules/GATK_GMP.2.X.Y.sh)
echo -e $(date)'\tSample01:GATK_GMP_1\t'$GATKGMP_1'\tSubmit' >> log/Summary.log
GATKGMP_2=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,refGenome=GATK-hg38,inFile=Sample02,sampleName=Sample02' modules/GATK_GMP.2.X.Y.sh )
echo -e $(date)'\tSample02:GATK_GMP_2\t'$GATKGMP_2'\tSubmit' >> log/Summary.log


## 3. ##
# *****************************************
# ***       Mapping Quality Check       *** 
# ***     Bam file quality checking     ***
# ***SampleN_bqsr.bam => qualimap result***
# *****************************************
# ***          Generate report          ***
# ***SampleN_fastqc.zip => multiqc_html ***
# *****************************************

# ====QualiMap  ( _bqsr.bam to qualimap result) ====
QM_1=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,inFile=Sample01_bqsr.bam,sampleName=Sample01' modules/QC_QualiMap.2.X.Y.sh )
echo -e $(date)'\tSample01:QM_1\t'$QM_1'\tSubmit' >> log/Summary.log
QM_2=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,inFile=Sample02_bqsr.bam,sampleName=Sample02' modules/QC_QualiMap.2.X.Y.sh)
echo -e $(date)'\tSample02:QM_2\t'$QM_2'\tSubmit' >> log/Summary.log

# ===== Multiqc module =====
MQC=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v "projDir='$(pwd)'/,qcDir=QC/,rptDir=report/,rptName=multiqc" modules/QC_Multiqc.2.X.Y.sh)
echo -e $(date)'\tMQC\t'$MQC'\tSubmit' >> log/Summary.log


## 4A. ##
# *****************************************
# ***          Variant calling          ***
# *** SampleN_bqsr.bam => SampleN.g.vcf ***
# *****************************************
# *****************************************
# * Variant calling (!!Custom intevals!!) *
# *** SampleN_bqsr.bam => SampleN.g.vcf ***
# *****************************************
# ***          Genotype calling         *** 
# **SampleN.g.vcf =>qSampleN_combined.vcf**
# *****************************************

# =====variant calling (_bqsr.bam to _.g.vcf) ===== pick one
GATKVC_1=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,refGenome=GATK-hg38,sampleName=Sample01' modules/GATK_VC.2.X.Y.sh )
echo -e $(date)'\tSample01:GATK_VC_1\t'$GATKVC_1'\tSubmit' >> log/Summary.log
GATKVC_2=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,refGenome=GATK-hg38,sampleName=Sample02' modules/GATK_VC.2.X.Y.sh )
echo -e $(date)'\tSample02:GATK_VC_2\t'$GATKVC_2'\tSubmit' >> log/Summary.log

# =====variant calling !!Custom intevals!! (_bqsr.bam to _.g.vcf) ===== pick one
GATKVC_1=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,refGenome=GATK-hg38,intervals=WES.bed,sampleName=Sample01' modules/GATK_VC.2.X.Y.sh )
echo -e $(date)'\tSample01:GATK_VC_1\t'$GATKVC_1'\tSubmit' >> log/Summary.log
GATKVC_2=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,refGenome=GATK-hg38,intervals=WES.bed,sampleName=Sample02' modules/GATK_VC.2.X.Y.sh )
echo -e $(date)'\tSample02:GATK_VC_2\t'$GATKVC_2'\tSubmit' >> log/Summary.log

# ==Genotype calling (sample1~N.g.vcf => CombineGVCF)==
GATKGVCF=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,refGenome=GATK-hg38,inFiles='Sample01.g.vcf&&Sample02.g.vcf',output=Exp01' modules/GATK_GVCF.2.X.Y.sh )
echo -e $(date)'\tGATKGVCF\t'$GATKGVCF'\tSubmit' >> log/Summary.log


## 6. ##
# ******************************************
# ***         Variant annotation         ***
# ***      _export.vcf => _anno.vcf      ***
# ******************************************

# ==== Annovar annotation ( _export.vcf to _anno.vcf) ====
ANNO=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,refGenome=hg38,inFiles=Exp01_export.vcf,output=Exp01' modules/AN_ANNO.2.X.Y.sh)
echo -e $(date)'\tAnnovar export vars\t'$ANNO'\tSubmit' >> log/Summary.log

Resources

Demo data set : WES/WGS Demo data set

Report template:
Germline variant calling report (Ref)
Somatic variant calling report (Ref )

基因體研究的全方位解決對策

基因體研究的全方位解決對策

GATK-Genome_Variant_Calling

GATK

Environment requirement

Reference library

Genome reference set

Annovar Reference

Project folder structure

Manifest File

The diagram of workflow

Module list

Workflow script configuration

1. Parameter setup :

2. Reads Mapping :

3. Mapping Quality Check:

Bam file quality checking

Generate QC report

4A. Germline Short Variant Detecting :

Short variant calling

Genotype calling :

5A. Germline Structure Variant Detecting :

Structure variant call:

Genotype call:

4B. Somatic Short Variant Detecting :

Short variant call :

5B. Somatic Structure Variant Detecting:

Somatic structure variant call :

6. Variant Annotation :

Demo Workflow script :

Resources

Leave a comment 取消回覆