GATK-Genome_Variant_Calling


Editor : Anita
Reviewer: Lucas
DOC_ID : P02-0002

GATK 

The GATK is the industry standard for identifying SNPs and indels in germline DNA and RNAseq data. Its scope is now expanding to include somatic short variant calling, and to tackle copy number (CNV) and structural variation (SV). In addition to the variant callers themselves, the GATK also includes many utilities to perform related tasks such as processing and quality control of high-throughput sequencing data, and bundles the popular Picard toolkit.

These tools were primarily designed to process exomes and whole genomes generated with Illumina sequencing technology, but they can be adapted to handle a variety of other technologies and experimental designs. And although it was originally developed for human genetics, the GATK has since evolved to handle genome data from any organism, with any level of ploidy.

    Learn more   

Environment requirement

  • Hardware :    CPU : 10 cores
        RAM : 45 Gb
  • Software :The pipeline is developed based on GA System. All software used in the pipeline are included in pre-defined software environments (GA application collection ). Please install suitable environment before start the pipeline(How to do it!) . 

Reference library 

Genome reference set

BuildNameContent
hg38GATK-hg38ref=Homo_sapiens_assembly38.fasta ;knownSite1=dbsnp_138.hg38.vcf.gz ;knownSite2=Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
b37GATK-b37ref=human_g1k_v37.fasta ;knownSite1=dbsnp_138.b37.vcf.gz ;knownSite2=Mills_and_1000G_gold_standard.indels.b37.vcf.gz
hg19GATK-hg19ref=ucsc.hg19.fasta.fasta ;knownSite1=dbsnp_138.hg19.vcf.gz ;knownSite2=Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz

Annovar Reference

BuildNameContent
hg38hg38ref = “/pkg/biology/reference/Homo_sapiens/annovar/hg38”
b37b37ref = “/pkg/biology/reference/Homo_sapiens/annovar/hg19”
hg19hg19ref = “/pkg/biology/reference/Homo_sapiens/annovar/hg19”

Project folder structure

  • Project Name           project.sh
    • Raw                 SingleEnd: SampleName_.fastq.gz
                        PairedEnd: SampleName-R1.fastq.gz, SampleName-R2.fastq.gz
    • Processed       SampleName_bqsr.bam, _bqsr.bam.bai
                       SampleName.g.vcf, .g.vcf.idx
                       SampleName.vcf, vcf.idx
                       ExperimentName.vcf , vcf.idx
    • QC               fastqc : sample_R1/R2_fastqc.html
                     qualimap : sample folder
    • Log               Summary.log
                     service_id.ER, OU
    • Report            Annovar : Exp01_anno.vcf
                     MultiQC : multiqc_report.html*:Files prepared by users are marked in Red

Manifest File

None

The diagram of workflow

GATK-GMP : Mapping raw reads to a reference genome and create bam file for following  small indel variant calling, structural variants analysis. The module remove duplicates, adaptor to reduce bias from library preparation.  The base quality score are also recalibrate according GATK’s algorithm. (Ref)

QC_QualiMap : Qualimap is a platform-independent application written in Java and R that provides both a Graphical User Interface (GUI) and a command-line interface to facilitate the quality control of alignment sequencing data. (Ref)

QC_Multiqc : Aggregate results from bioinformatics analyses across many samples into a single report, MultiQC searches a given directory for analysis logs and compiles a HTML report. (Ref)

GATK_VC : The moduel detect SNPs and indels by using  GATK HaplotypeCaller. Reads with variant are re-de novo assembly to artificial haplotype, Read are then mapped to artifical haplotype. The moduel  add Gather g.VCF from  parellelled GATK HaplotypeCaller. Combine multiple g.VCF of the same sample into one g.vcf file (Ref)

GATK_GVCF : GenotypeGVCFs generate genotype calling with GVCF files produced from the HaplotypeCaller. For multifple sample, GVCF files can be  combined by CombineGVCFs. (Ref)

GATK_MU2V : Identify somatic short variants (SNVs and Indels) in one or more tumor samples from a single individual, with or without a matched normal sample.This workflow requires a paired BAM files ( tumor and normal sample). the BAM format follow  the GATK Best Practices for data pre-processing  document. The SM parameter should be set as sample name. Module GATK_GMP module is suggested to generate formated BAM files. The module execute 4 step for somatic variants calling (Ref)

AN_ANNO : In this step, the “_export.vcf” file output from the previous step is added to the gene annotation with the Annovar program to generate the “_multianno.vcf” file after annotation.(Ref)

SV_Delly2GermlineCALL :  To discover germ-line structure variant such as deletions, tandem duplications, inversions and translocations by using Delly 2. the structural variants can be visualized using Delly-maze and Delly-suave.(Ref)

SV_Delly2merge : Merge results from Delly2  Germline variant callers and generate genotype call.

SV_Delly2SomaticCALL : To discover somatic structure variant such as deletions, tandem duplications, inversions and translocations by using Delly 2. the structural variants can be visualized using Delly-maze and Delly-suave.(Ref)

Module list

Module codeSupported ReferencesNode module Information
GATK_GMP.shGATK-hg38, -b37, -hg19ngs96GGATK genome mapping
QC_QualiMap.shAnyngs48GQualimap
QC_Multiqc.shAnyngs16GCombine QC resultes
GATK_VC.shGATK-hg38, -b37, -hg19ngs48GGATK halpotypecaller variant calling to g.vcf
GATK_GVCF.shGATK-hg38, -b37, -hg19ngs48GGATK CombineGVCFs, GenotypeGVCFs
GATK_MU2V.shGATK-hg38, -b37ngs48GGATK Mutect2, GetPileupSummaries,  CalculateContamination and FilterMutectCalls from bqsr_bam to variant_filted.vcf.gz

Workflow script configuration

Color theme of Script 
Gray(Dark Gray): Script comment
Red (Pale Red):  module name
Green (Dark Emerald):  Variables defined by use

The workflow script demonstrate genome variant calling pipeline implemented by using GATK and Annotation modules. The work script are included several functional section. The detail of each section show below.

1. Parameter setup : 

# *****************************************
# *****       Global parameters     ******
# *****************************************
# set your email for job message senting
email = user@your.email
# set your project account for submit job
projectID = MST-XXXXXXXX  

2. Reads Mapping : 

The is the obligatory first phase that must precede all variant discovery. It involves pre-processing the raw sequence data (provided in FASTQ format) to produce analysis-ready BAM files. This involves alignment to a reference genome as well as some data cleanup operations to correct for technical biases and make the data suitable for analysis. 

Ref : https://gatk.broadinstitute.org/hc/en-us/articles/360035535912-Data-pre-processing-for-variant-discovery

# ****************************************************
# ******                 Reads Mapping              ******
# ***SampleN_R1,R2 => SampleN_bqsr.bam***
# ****************************************************

# ===== Mapping (fastq to _bqsr.bam) =====
GATKGMP_1=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,inFile=Sample01,sampleName=Sample01‘ GATK_GMP.sh)
echo -e $(date)’\tSample01:GATK_GMP_1\t’$GATKGMP_1’\tSubmit’ >> log/Summary.log

GATKGMP_2=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,inFile=Sample02,sampleName=Sample02‘ GATK_GMP.sh)
echo -e $(date)’\tSample02:GATK_GMP_2\t’$GATKGMP_2’\tSubmit’ >> log/Summary.log

Parameter DescriptionRemark
GATK_GMP.shModule of genome mapping分析的模組需存放在[modules]資料夾中
inFileThe name of raw read files of the sample:File format :  *.fastq or  *.fastq.gz
File path : raw/
例如: inFile = Sample01 會在 raw/ 讀取:
1: Sample01_R1.fastq.gz
2: Sample01_R2.fastq.gz
refGenome在執行分析時選用的基因參考資料庫目前支援GATK-hg38, GATK-b37及GATK-hg19 基因資料庫
sampleNameThe name of output bam file of the sampleFile format :  *.bam
File path : processed/
例如: sampleName = Sample01 會在 processed/生成Sample01_bqsr.bam

3. Mapping Quality Check: 

Bam file quality checking

Qualimap is a Java application that supports user-friendly quality control of mapping data, by considering sequence features and their genomic properties. Qualimap takes sequence alignment data and provides graphical and statistical analyses for the evaluation of data.

Ref : http://qualimap.bioinfo.cipf.es/

# *******************************************************
# ********        Mapping Quality Check          ********
# ***SampleN_bqsr.bam => QC/SampleN folder***
# *******************************************************

QM_1=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,inFile=Sample01_bqsr.bam,sampleName=Sample01‘ QC_QualiMap.sh )
echo -e $(date)’\tSample01:QM_1\t’$QM_1’\tSubmit’ >> log/Summary.log

QM_2=$(qsub -P $projectID -W group_list= $projectID -m e -M $email -v ‘projDir=’$(pwd)’/,inFile=Sample02_bqsr.bam,sampleName=Sample02‘ QC_QualiMap.2.X.Y.sh )
echo -e $(date)’\tSample02:QM_2\t’$QM_2’\tSubmit’ >> log/Summary.log

Parameter DescriptionRemark
QC_QualiMap.shModule of Bam file QC分析的模組需存放在[modules]資料夾中
inFileThe name of Mapping files of the sample:File format :  *.bam
File path : processed/
例如: inFile = Sample01_bqsr.bam 會在 processed/ 讀取:
1: Sample01_bqsr.bam
sampleNameThe folder name of  mapping file QCFile path: QC/例如: sampleName = Sample01 會在 QC/ 建立Sample01的folder :QC/Sample01/ 

Generate QC report

Aggregate results from bioinformatics analyses across many samples into a single report, MultiQC searches a given directory for analysis logs and compiles a HTML report. It’s a general use tool, perfect for summarising the output from numerous bioinformatics tools.

# *****************************************
# ********     Generate report    ********
# ********     Multiqc module     ********
# *****************************************
MQC=$(qsub -P $projectID -W group_list= $projectID -m e -M $email -v “projDir=’$(pwd)’/,qcDir=QC/,rptDir=report/,rptName=multiqc” QC_Multiqc.2.X.Y.sh )
echo -e $(date)’\tMQC\t’$MQC’\tSubmit’ >> log/Summary.log

Parameter DescriptionRemark
QC_Multiqc.shModule of Bam file and fastq QC分析的模組需存放在[modules]資料夾中
qcDirThe folder name of analysis QC dataFile format :  “_fastqc.zip”File path : QC/例如: qcDir=QC/會在 QC/ 讀取所有檔名符合 “_fastqc.zip”的資料
rptDirThe folder name of save QC resultFile format :  “_report.html”File path : report/例如: rptDir=report/會將生成的結果存放在 report/  
rptNameThe file name of save QC result
File format :  “_report.html”
例如: rptName=multiqc 會在report/產生名為multiqc的file :1 : report/multiqc_report.html 

4A. Germline Short Variant Detecting : 

Short variant calling

Identify germline short variants (SNPs and Indels) in one or more individuals to produce a joint callset in VCF format.

Ref : https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-

# *************************************************
# ******                 Variant calling              *****
# ***SampleN_bqsr.bam =>SampleN.g.vcf***
# *************************************************

# =====variant calling (_bqsr.bam to .g.vcf) =====
GATKVC_1=$(qsub -P $projectID -W group_list= $projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,sampleName=Sample01‘ GATK_VC.sh )
echo -e $(date)’\tSample01:GATK_VC_1\t’$GATKVC_1’\tSubmit’ >> log/Summary.log

GATKVC_2=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,sampleName=Sample02‘ GATK_VC.sh )
echo -e $(date)’\tSample02:GATK_VC_2\t’$GATKVC_2’\tSubmit’ >> log/Summary.log

For Target sequencing, e.g. Whole exome sequencing or onco-panel, User can provided target sequencing region in BED format  by using -intervals parameter

# *************************************************
# ******                 Variant calling              *****
# ***SampleN_bqsr.bam =>SampleN.g.vcf***
# *************************************************
# =====(!!!Custom intevals!!!) variant calling (_bqsr.bam to .g.vcf) =====
GATKVC_1=$(qsub -P $projectID -W group_list= $projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,intervals=WES.bed,sampleName=Sample01‘ GATK_VC.sh )
echo -e $(date)’\tSample01:GATK_VC_1\t’$GATKVC_1’\tSubmit’ >> log/Summary.log

GATKVC_2=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,intervals=WES.bed,sampleName=Sample02‘ GATK_VC.sh )
echo -e $(date)’\tSample02:GATK_VC_2\t’$GATKVC_2’\tSubmit’ >> log/Summary.log

Parameter DescriptionRemark
GATK_VC.2.X.Y.shThe module name of GATK variants calling分析的模組需存放在[modules]資料夾中
refGenome在執行分析時選用的基因參考資料庫目前支援GATK-hg38, GATK-b37及GATK-hg19 基因資料庫
sampleNameThe sample name of mapped files :Input file format : *_bqsr.bam
Output file  format : *.g.vcf
File path : processed/
例如: sampleName=Sample01模組會在 processed 資料夾裡讀取Sample1_bqsr.bam 進行分析,其結果輸出為Sample01.g.vcf
intervalsAdd interval conditions as needed when performing analysis : (Ref)Whole genome analysis: Intervals are not required but they can help speed up analysis by eliminating “difficult” regions and enabling parallelismExome analysis and other targeted sequencing: You must provide the list of targets, with padding, to exclude off-target noise. This will also speed up analysis and enable parallelism.GATK supports several types of interval list formats : (Ref)Picard-style .interval_listGATK-style .listBED files with extension .bedVCF files


Genotype calling : 

GenotypeGVCFs merges gVCF output files produced from the HaplotypeCaller or from combining such gVCF files using CombineGVCFs. Only gVCF files resulting from HaplotypeCaller or CombineGVCFs can be used as input for this tool. 

(Ref : http://www.genomicdataanalysis.com/genotypegvcfs/)

# **********************************************************
# *********              Genotype calling               *********
# ***SampleN.g.vcf.gz => SampleN_combined.vcf***
# **********************************************************

# ==Genotype calling (sample1~N.g.vcf => CombineGVCF)==

GATKGVCF=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,inFiles=’Sample01.g.vcf&&Sample02.g.vcf’,output=Exp01‘ GATK_GVCF.2.X.Y.sh )
echo -e $(date)’\t*.g.vcf to do combine\t’$GATKGVCF’\tSubmit’ >> log/Summary.log

Parameter DescriptionRemark
GATK_GVCF.2.X.Y.shThe module name of GATK genotype calling分析的模組需存放在[modules]資料夾中
refGenome在執行分析時選用的基因參考資料庫目前支援GATK-hg38, GATK-b37及GATK-hg19 基因資料庫
inFilesThe name of the sample of combine files :File format : *.g.vcf
File path : processed/►若要合併多個 g.vcf 可以用 “&&” 連結.
例如: inFiles =’Sample01.g.vcf&&Sample02.g.vcf’ 會在 processed/ 讀取:
1: processed/Sample01.g.vcf
2: processed/Sample02.g.vcf
outputThe name of the sample of through combine files :File format : *_combined.vcf 
File path : processed/
例如:output=Exp01 則資料會存放在 processed/ :1: processed/Exp01_combined.vcf

5A. Germline Structure Variant Detecting : 

Structure variant call:

To discover germline structure variant such as deletions, tandem duplications, inversions and translocations by using Delly 2.

Ref : https://github.com/dellytools/delly

# ***********************************************
# ******               Variant calling             *****
# ***SampleN_bqsr.bam =>SampleN.bcf***
# ***********************************************

# ===== Delly2GermlineCALL( _bqsr.bam to .bcf) =====

DG_1=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,sampleName=Sample01,refGenome=GATK-hg38‘ SV_Delly2GermlineCALL.2.X.Y.sh )
echo -e $(date)’\tSample01:Delly2GermlineCALL_1\t’$DG_1’\tSubmit’ >> log/Summary.log

DG_2=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,sampleName=Sample02,refGenome=GATK-hg38‘ SV_Delly2GermlineCALL.2.X.Y.sh)
echo -e $(date)’\tSample02:Delly2GermlineCALL_2\t’$DG_2’\tSubmit’ >> log/Summary.log

Parameter DescriptionRemark
SV_Delly2GermlineCALL.2.X.Y.shmodule of structure variant calling分析的模組需存放在[modules]資料夾中
refGenome在執行分析時選用的基因參考資料庫目前支援GATK-hg38, GATK-b37及GATK-hg19 基因資料庫
sampleNameThe sample name of mapped files :Input file format : *_bqsr.bam
Output file format : *.bcf
File path : processed/
例如: sampleName=Sample01,
模組會在 processed 資料夾裡讀取  Sample1_bqsr.bam 進行分析,其結果輸出為Sample01.bcf。

Genotype call:

Merge results from Delly2  Germline variant callers and generate genotype call.

Ref : https://github.com/dellytools/delly

# ===== Delly2merge( .bcf to .vcf) =====

DM=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,sampleName=”Sample01&&Sample02“,outputName=Exp01‘ SV_Delly2merge.2.X.Y.sh)
echo -e $(date)’\tExp01:Delly2merge_1\t’$DM’\tSubmit’ >> log/Summary.log

Parameter DescriptionRemark
SV_Delly2merge.2.X.Y.shmodule of merge bcf file分析的模組需存放在[modules]資料夾中
refGenome在執行分析時選用的基因參考資料庫目前支援GATK-hg38, GATK-b37及GATK-hg19 基因資料庫
sampleNameThe name of bcf files :
File path : processed/
File format :  *.bcf ►若要合併多個 .bcf 可以用 “&&” 連結.
例如:單一檔案:sampleName=Sample01
多個檔案:sampleName=Sample01&&Sample02模組會在 processed 資料夾裡讀取  Sample01.bcf(and Sample02.bcf) 進行合併。
outputNameThe name of output file :
File path : processed/
File format : *_geno_merge.vcf
例如:outputName=Exp01模組會將資料以”Exp01_geno_merge.vcf”存放在[processed]資料夾中。

4B. Somatic Short Variant Detecting : 

Short variant call : 

Identify somatic short variants (SNVs and Indels) in one or more tumor samples from a single individual, with or without a matched normal sample.

Ref : https://gatk.broadinstitute.org/hc/en-us/articles/360035894731-Somatic-short-variant-discovery-SNVs-Indels-

# **************************************************
# ******                 Variant calling               *****
# ***SampleN_bqsr.bam =>SampleN.vcf.gz***
# **************************************************

# =====variant calling (_bqsr.bam to _.vcf.gz) =====

GATKMU2V=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,ControlName=Sample01, TestName=Sample02, OutName=Case01‘ GATK_MU2V.2.X.Y.sh)
echo -e $(date)’\tSample01&Sample02:GATK_MU2V\t’$GATKMU2V’\tSubmit’ >> log/Summary.log

Parameter DescriptionRemark
GATK_MU2V.2.X.Y.shThe module name of GATK somatic variants calling分析的模組需存放在[modules]資料夾中
refGenome在執行分析時選用的基因參考資料庫目前支援GATK-hg38及GATK-b37 基因資料庫
ControlNameThe name of narmal sample files :File format : *_bqsr.bam
File path : processed/
例如: ControlName=Sample01 會在 processed/ 讀取
Sample01_bqsr.bam 作為對照組
TestNameThe name of test sample files :File format : *_bqsr.bam
File path : processed/
例如: TestName=Sample02 會在 processed/ 讀取
Sample02_bqsr.bam 作為測試組
OutNameThe name of somatic variant calling output files : File format : *_mutect2_VF.vcf.gz/*_segments.table 
File path : processed/
例如: OutName=Case01 會在 processed/ 資料夾裡寫入:Case01_mutect2_VF.vcf.gzCase01_segments.table 

5B. Somatic Structure Variant Detecting: 

Somatic structure variant call :  

To discover somatic structure variant such as deletions, tandem duplications, inversions and translocations by using Delly 2. 

Ref : https://github.com/dellytools/delly

# ===== Delly2SomaticCALL( _bqsr.bam to .vcf) =====

DS=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,controlName=Sample01,testName=Sample02‘ SV_Delly2SomaticCALL.2.X.Y.sh)
echo -e $(date)’\tExp01:delly2SomaticCALL\t’$DM’\tSubmit’ >> log/Summary.log

Parameter DescriptionRemark
SV_Delly2SomaticCALL.2.X.Y.shmodule of structure variant calling分析的模組需存放在[modules]資料夾中
refGenome在執行分析時選用的基因參考資料庫目前支援GATK-hg38, GATK-b37及GATK-hg19 基因資料庫
controlName欲執行分析的對照樣品名
File format :  *_bqsr.bam
File path : processed/
例如: controlName=Sample01,
模組會在 processed 資料夾裡讀取  Sample01_bqsr.bam 作為對照組,對測試樣品變異分析作比較,若變異在對照組中也有發現,則該位點會從輸出結果中排除。
testName欲執行分析的測試樣品名
File format : *_bqsr.bam
File path : processed/
Output file format : *_Somatic.vcf 
例如: testName=Sample02,
模組會在 processed 資料夾裡讀取  Sample02_bqsr.bam 作為測試組進行分析。其結果會以VCF資料格式以為”Sample2_Somatic.vcf”, 資料會存放在[processed]資料夾。

6. Variant Annotation :  

將變異位點檔案以Annovar 程式加上基因註解,輸出註解檔。

# ==== Annovar annotation ( _export.vcf to _anno.vcf) ====

ANNO=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=hg38,inFiles=Exp01_export.vcf,output=Exp01‘ AN_ANNO.2.X.Y.sh)
echo -e $(date)’\tAnnovar export vars\t’$ANNO’\tSubmit’ >> log/Summary.log

Parameter DescriptionRemark
AN_ANNO.2.X.Y.shmodule of annotation vcf分析的模組需存放在[modules]資料夾中,此模組將變異位點檔案(vcf) 以Annovar 程式加上基因註解,會產生註解後之"_multianno.vcf"檔案
refGenome在執行註解時選用的基因參考資料庫目前支援hg38及 hg19 (相容於 b37) 基因資料庫
inFiles欲加上基因註解的變異位點檔案檔名
File  format: *.vcf ” or *.vcf.gz “
File     path: processed/
例如: inFiles=Exp01__export.vcf 會在 processed/ 讀取Exp01_export.vcf 進行位點功能註解。
output輸出的變異位點基因註解檔案名稱
File  format: *_multianno.vcf 
File     path: report/
例如: output=Exp01時會在 report/ 輸出的變異位點基因註解檔案 Exp01.hg38_multianno.vcf

 

Demo Workflow script : 

 
#!/bin/bash

# ************* GATK Genome Germline variant calling
# ************* Platform Taiwania
# ************* Script version 2.0 (Applies to modules from version 2.0.0)
## [projDir] project
##   +-- [rawDir] raw <= Sample1,2,3_R1/_R2.fastq.gz
##   +-- [prcDir] processed  => sampleName_bqsr.bam/.g.vcf.gz
##   +-- [anaDir] analyzed => case.g.vcf.gz
##   +-- [rptDir] report => RawReadQC.html, project.db
##   +-- [tmpDir] temp
##   +-- [logDir] log =>Summary.log
##   +-- [qcDir] QC => .pdf, .txt, .csv


## 1. ##
# *****************************************
# ******     Global parameters       ******
# *****************************************

# set your email for job message senting
email = user@your.email
# set your project account for submit job
projectID = MST-XXXXXXXX  

# *****************************************
# ************* raw read QC   *************
# *****************************************
#===== fastqc module =====
#Use Raw_Read_Quality_Check-Ver2 pipeline


## 2. ##
# *****************************************
# ***           Reads Mapping           ***
# ***SampleN_R1,R2 => SampleN_bqsr.bam  ***
# *****************************************

# ===== Mapping (fastq to _bqsr.bam) =====
GATKGMP_1=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,refGenome=GATK-hg38,inFile=Sample01,sampleName=Sample01' modules/GATK_GMP.2.X.Y.sh)
echo -e $(date)'\tSample01:GATK_GMP_1\t'$GATKGMP_1'\tSubmit' >> log/Summary.log
GATKGMP_2=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,refGenome=GATK-hg38,inFile=Sample02,sampleName=Sample02' modules/GATK_GMP.2.X.Y.sh )
echo -e $(date)'\tSample02:GATK_GMP_2\t'$GATKGMP_2'\tSubmit' >> log/Summary.log


## 3. ##
# *****************************************
# ***       Mapping Quality Check       ***
# ***     Bam file quality checking     ***
# ***SampleN_bqsr.bam => qualimap result***
# *****************************************
# ***          Generate report          ***
# ***SampleN_fastqc.zip => multiqc_html ***
# *****************************************

# ====QualiMap  ( _bqsr.bam to qualimap result) ====
QM_1=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,inFile=Sample01_bqsr.bam,sampleName=Sample01' modules/QC_QualiMap.2.X.Y.sh )
echo -e $(date)'\tSample01:QM_1\t'$QM_1'\tSubmit' >> log/Summary.log
QM_2=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,inFile=Sample02_bqsr.bam,sampleName=Sample02' modules/QC_QualiMap.2.X.Y.sh)
echo -e $(date)'\tSample02:QM_2\t'$QM_2'\tSubmit' >> log/Summary.log

# ===== Multiqc module =====
MQC=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v "projDir='$(pwd)'/,qcDir=QC/,rptDir=report/,rptName=multiqc" modules/QC_Multiqc.2.X.Y.sh)
echo -e $(date)'\tMQC\t'$MQC'\tSubmit' >> log/Summary.log


## 4A. ##
# *****************************************
# ***          Variant calling          ***
# *** SampleN_bqsr.bam => SampleN.g.vcf ***
# *****************************************
# *****************************************
# * Variant calling (!!Custom intevals!!) *
# *** SampleN_bqsr.bam => SampleN.g.vcf ***
# *****************************************
# ***          Genotype calling         ***
# **SampleN.g.vcf =>qSampleN_combined.vcf**
# *****************************************

# =====variant calling (_bqsr.bam to _.g.vcf) ===== pick one
GATKVC_1=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,refGenome=GATK-hg38,sampleName=Sample01' modules/GATK_VC.2.X.Y.sh )
echo -e $(date)'\tSample01:GATK_VC_1\t'$GATKVC_1'\tSubmit' >> log/Summary.log
GATKVC_2=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,refGenome=GATK-hg38,sampleName=Sample02' modules/GATK_VC.2.X.Y.sh )
echo -e $(date)'\tSample02:GATK_VC_2\t'$GATKVC_2'\tSubmit' >> log/Summary.log

# =====variant calling !!Custom intevals!! (_bqsr.bam to _.g.vcf) ===== pick one
GATKVC_1=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,refGenome=GATK-hg38,intervals=WES.bed,sampleName=Sample01' modules/GATK_VC.2.X.Y.sh )
echo -e $(date)'\tSample01:GATK_VC_1\t'$GATKVC_1'\tSubmit' >> log/Summary.log
GATKVC_2=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,refGenome=GATK-hg38,intervals=WES.bed,sampleName=Sample02' modules/GATK_VC.2.X.Y.sh )
echo -e $(date)'\tSample02:GATK_VC_2\t'$GATKVC_2'\tSubmit' >> log/Summary.log

# ==Genotype calling (sample1~N.g.vcf => CombineGVCF)==
GATKGVCF=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,refGenome=GATK-hg38,inFiles='Sample01.g.vcf&&Sample02.g.vcf',output=Exp01' modules/GATK_GVCF.2.X.Y.sh )
echo -e $(date)'\tGATKGVCF\t'$GATKGVCF'\tSubmit' >> log/Summary.log


## 6. ##
# ******************************************
# ***         Variant annotation         ***
# ***      _export.vcf => _anno.vcf      ***
# ******************************************

# ==== Annovar annotation ( _export.vcf to _anno.vcf) ====
ANNO=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,refGenome=hg38,inFiles=Exp01_export.vcf,output=Exp01' modules/AN_ANNO.2.X.Y.sh)
echo -e $(date)'\tAnnovar export vars\t'$ANNO'\tSubmit' >> log/Summary.log


 

Resources

Demo data set :  WES/WGS Demo data set

Report template: 
    Germline variant calling report (Ref)
    Somatic variant calling report (Ref )
 

Leave a comment