Editor : Anita
Reviewer: Lucas
DOC_ID : P02-0002
GATK
The GATK is the industry standard for identifying SNPs and indels in germline DNA and RNAseq data. Its scope is now expanding to include somatic short variant calling, and to tackle copy number (CNV) and structural variation (SV). In addition to the variant callers themselves, the GATK also includes many utilities to perform related tasks such as processing and quality control of high-throughput sequencing data, and bundles the popular Picard toolkit.
These tools were primarily designed to process exomes and whole genomes generated with Illumina sequencing technology, but they can be adapted to handle a variety of other technologies and experimental designs. And although it was originally developed for human genetics, the GATK has since evolved to handle genome data from any organism, with any level of ploidy.
Environment requirement
- Hardware : CPU : 10 cores
RAM : 45 Gb - Software :The pipeline is developed based on GA System. All software used in the pipeline are included in pre-defined software environments (GA application collection ). Please install suitable environment before start the pipeline(How to do it!) .
Reference library
Genome reference set
Build | Name | Content |
hg38 | GATK-hg38 | ref=Homo_sapiens_assembly38.fasta ;knownSite1=dbsnp_138.hg38.vcf.gz ;knownSite2=Mills_and_1000G_gold_standard.indels.hg38.vcf.gz |
b37 | GATK-b37 | ref=human_g1k_v37.fasta ;knownSite1=dbsnp_138.b37.vcf.gz ;knownSite2=Mills_and_1000G_gold_standard.indels.b37.vcf.gz |
hg19 | GATK-hg19 | ref=ucsc.hg19.fasta.fasta ;knownSite1=dbsnp_138.hg19.vcf.gz ;knownSite2=Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz |
Annovar Reference
Build | Name | Content |
hg38 | hg38 | ref = “/pkg/biology/reference/Homo_sapiens/annovar/hg38” |
b37 | b37 | ref = “/pkg/biology/reference/Homo_sapiens/annovar/hg19” |
hg19 | hg19 | ref = “/pkg/biology/reference/Homo_sapiens/annovar/hg19” |
Project folder structure
- Project Name project.sh
- Raw SingleEnd: SampleName_.fastq.gz
PairedEnd: SampleName-R1.fastq.gz, SampleName-R2.fastq.gz
- Processed SampleName_bqsr.bam, _bqsr.bam.bai
SampleName.g.vcf, .g.vcf.idx
SampleName.vcf, vcf.idx
ExperimentName.vcf , vcf.idx
- QC fastqc : sample_R1/R2_fastqc.html
qualimap : sample folder
- Log Summary.log
service_id.ER, OU
- Report Annovar : Exp01_anno.vcf
MultiQC : multiqc_report.html*:Files prepared by users are marked in Red
- Raw SingleEnd: SampleName_.fastq.gz
Manifest File
None
The diagram of workflow
![](https://genoinfocore.crc.nycu.edu.tw/wp-content/uploads/2022/02/P02-0002-pipeline-chart-200814-1024x605.png)
GATK-GMP : Mapping raw reads to a reference genome and create bam file for following small indel variant calling, structural variants analysis. The module remove duplicates, adaptor to reduce bias from library preparation. The base quality score are also recalibrate according GATK’s algorithm. (Ref)
QC_QualiMap : Qualimap is a platform-independent application written in Java and R that provides both a Graphical User Interface (GUI) and a command-line interface to facilitate the quality control of alignment sequencing data. (Ref)
QC_Multiqc : Aggregate results from bioinformatics analyses across many samples into a single report, MultiQC searches a given directory for analysis logs and compiles a HTML report. (Ref)
GATK_VC : The moduel detect SNPs and indels by using GATK HaplotypeCaller. Reads with variant are re-de novo assembly to artificial haplotype, Read are then mapped to artifical haplotype. The moduel add Gather g.VCF from parellelled GATK HaplotypeCaller. Combine multiple g.VCF of the same sample into one g.vcf file (Ref)
GATK_GVCF : GenotypeGVCFs generate genotype calling with GVCF files produced from the HaplotypeCaller. For multifple sample, GVCF files can be combined by CombineGVCFs. (Ref)
GATK_MU2V : Identify somatic short variants (SNVs and Indels) in one or more tumor samples from a single individual, with or without a matched normal sample.This workflow requires a paired BAM files ( tumor and normal sample). the BAM format follow the GATK Best Practices for data pre-processing document. The SM parameter should be set as sample name. Module GATK_GMP module is suggested to generate formated BAM files. The module execute 4 step for somatic variants calling (Ref)
AN_ANNO : In this step, the “_export.vcf” file output from the previous step is added to the gene annotation with the Annovar program to generate the “_multianno.vcf” file after annotation.(Ref)
SV_Delly2GermlineCALL : To discover germ-line structure variant such as deletions, tandem duplications, inversions and translocations by using Delly 2. the structural variants can be visualized using Delly-maze and Delly-suave.(Ref)
SV_Delly2merge : Merge results from Delly2 Germline variant callers and generate genotype call.
SV_Delly2SomaticCALL : To discover somatic structure variant such as deletions, tandem duplications, inversions and translocations by using Delly 2. the structural variants can be visualized using Delly-maze and Delly-suave.(Ref)
Module list
Module code | Supported References | Node | module Information |
GATK_GMP.sh | GATK-hg38, -b37, -hg19 | ngs96G | GATK genome mapping |
QC_QualiMap.sh | Any | ngs48G | Qualimap |
QC_Multiqc.sh | Any | ngs16G | Combine QC resultes |
GATK_VC.sh | GATK-hg38, -b37, -hg19 | ngs48G | GATK halpotypecaller variant calling to g.vcf |
GATK_GVCF.sh | GATK-hg38, -b37, -hg19 | ngs48G | GATK CombineGVCFs, GenotypeGVCFs |
GATK_MU2V.sh | GATK-hg38, -b37 | ngs48G | GATK Mutect2, GetPileupSummaries, CalculateContamination and FilterMutectCalls from bqsr_bam to variant_filted.vcf.gz |
Workflow script configuration
Color theme of Script
Gray(Dark Gray): Script comment
Red (Pale Red): module name
Green (Dark Emerald): Variables defined by use
The workflow script demonstrate genome variant calling pipeline implemented by using GATK and Annotation modules. The work script are included several functional section. The detail of each section show below.
1. Parameter setup :
# *****************************************
# ***** Global parameters ******
# *****************************************
# set your email for job message senting
email = user@your.email
# set your project account for submit job
projectID = MST-XXXXXXXX
2. Reads Mapping :
The is the obligatory first phase that must precede all variant discovery. It involves pre-processing the raw sequence data (provided in FASTQ format) to produce analysis-ready BAM files. This involves alignment to a reference genome as well as some data cleanup operations to correct for technical biases and make the data suitable for analysis.
# ****************************************************
# ****** Reads Mapping ******
# ***SampleN_R1,R2 => SampleN_bqsr.bam***
# ****************************************************
# ===== Mapping (fastq to _bqsr.bam) =====
GATKGMP_1=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,inFile=Sample01,sampleName=Sample01‘ GATK_GMP.sh)
echo -e $(date)’\tSample01:GATK_GMP_1\t’$GATKGMP_1’\tSubmit’ >> log/Summary.log
GATKGMP_2=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,inFile=Sample02,sampleName=Sample02‘ GATK_GMP.sh)
echo -e $(date)’\tSample02:GATK_GMP_2\t’$GATKGMP_2’\tSubmit’ >> log/Summary.log
Parameter | Description | Remark |
GATK_GMP.sh | Module of genome mapping | 分析的模組需存放在[modules]資料夾中 |
inFile | The name of raw read files of the sample:File format : *.fastq or *.fastq.gz File path : raw/ | 例如: inFile = Sample01 會在 raw/ 讀取: 1: Sample01_R1.fastq.gz 2: Sample01_R2.fastq.gz |
refGenome | 在執行分析時選用的基因參考資料庫 | 目前支援GATK-hg38, GATK-b37及GATK-hg19 基因資料庫 |
sampleName | The name of output bam file of the sampleFile format : *.bam File path : processed/ | 例如: sampleName = Sample01 會在 processed/生成Sample01_bqsr.bam |
3. Mapping Quality Check:
Bam file quality checking
Qualimap is a Java application that supports user-friendly quality control of mapping data, by considering sequence features and their genomic properties. Qualimap takes sequence alignment data and provides graphical and statistical analyses for the evaluation of data.
Ref : http://qualimap.bioinfo.cipf.es/
# *******************************************************
# ******** Mapping Quality Check ********
# ***SampleN_bqsr.bam => QC/SampleN folder***
# *******************************************************
QM_1=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,inFile=Sample01_bqsr.bam,sampleName=Sample01‘ QC_QualiMap.sh )
echo -e $(date)’\tSample01:QM_1\t’$QM_1’\tSubmit’ >> log/Summary.log
QM_2=$(qsub -P $projectID -W group_list= $projectID -m e -M $email -v ‘projDir=’$(pwd)’/,inFile=Sample02_bqsr.bam,sampleName=Sample02‘ QC_QualiMap.2.X.Y.sh )
echo -e $(date)’\tSample02:QM_2\t’$QM_2’\tSubmit’ >> log/Summary.log
Parameter | Description | Remark |
QC_QualiMap.sh | Module of Bam file QC | 分析的模組需存放在[modules]資料夾中 |
inFile | The name of Mapping files of the sample:File format : *.bam File path : processed/ | 例如: inFile = Sample01_bqsr.bam 會在 processed/ 讀取: 1: Sample01_bqsr.bam |
sampleName | The folder name of mapping file QCFile path: QC/ | 例如: sampleName = Sample01 會在 QC/ 建立Sample01的folder :QC/Sample01/ |
Generate QC report
Aggregate results from bioinformatics analyses across many samples into a single report, MultiQC searches a given directory for analysis logs and compiles a HTML report. It’s a general use tool, perfect for summarising the output from numerous bioinformatics tools.
# *****************************************
# ******** Generate report ********
# ******** Multiqc module ********
# *****************************************
MQC=$(qsub -P $projectID -W group_list= $projectID -m e -M $email -v “projDir=’$(pwd)’/,qcDir=QC/,rptDir=report/,rptName=multiqc” QC_Multiqc.2.X.Y.sh )
echo -e $(date)’\tMQC\t’$MQC’\tSubmit’ >> log/Summary.log
Parameter | Description | Remark |
QC_Multiqc.sh | Module of Bam file and fastq QC | 分析的模組需存放在[modules]資料夾中 |
qcDir | The folder name of analysis QC dataFile format : “_fastqc.zip”File path : QC/ | 例如: qcDir=QC/會在 QC/ 讀取所有檔名符合 “_fastqc.zip”的資料 |
rptDir | The folder name of save QC resultFile format : “_report.html”File path : report/ | 例如: rptDir=report/會將生成的結果存放在 report/ |
rptName | The file name of save QC result File format : “_report.html” | 例如: rptName=multiqc 會在report/產生名為multiqc的file :1 : report/multiqc_report.html |
4A. Germline Short Variant Detecting :
Short variant calling
Identify germline short variants (SNPs and Indels) in one or more individuals to produce a joint callset in VCF format.
# *************************************************
# ****** Variant calling *****
# ***SampleN_bqsr.bam =>SampleN.g.vcf***
# *************************************************
# =====variant calling (_bqsr.bam to .g.vcf) =====
GATKVC_1=$(qsub -P $projectID -W group_list= $projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,sampleName=Sample01‘ GATK_VC.sh )
echo -e $(date)’\tSample01:GATK_VC_1\t’$GATKVC_1’\tSubmit’ >> log/Summary.log
GATKVC_2=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,sampleName=Sample02‘ GATK_VC.sh )
echo -e $(date)’\tSample02:GATK_VC_2\t’$GATKVC_2’\tSubmit’ >> log/Summary.log
For Target sequencing, e.g. Whole exome sequencing or onco-panel, User can provided target sequencing region in BED format by using -intervals parameter
# *************************************************
# ****** Variant calling *****
# ***SampleN_bqsr.bam =>SampleN.g.vcf***
# *************************************************
# =====(!!!Custom intevals!!!) variant calling (_bqsr.bam to .g.vcf) =====
GATKVC_1=$(qsub -P $projectID -W group_list= $projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,intervals=WES.bed,sampleName=Sample01‘ GATK_VC.sh )
echo -e $(date)’\tSample01:GATK_VC_1\t’$GATKVC_1’\tSubmit’ >> log/Summary.log
GATKVC_2=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,intervals=WES.bed,sampleName=Sample02‘ GATK_VC.sh )
echo -e $(date)’\tSample02:GATK_VC_2\t’$GATKVC_2’\tSubmit’ >> log/Summary.log
Parameter | Description | Remark |
GATK_VC.2.X.Y.sh | The module name of GATK variants calling | 分析的模組需存放在[modules]資料夾中 |
refGenome | 在執行分析時選用的基因參考資料庫 | 目前支援GATK-hg38, GATK-b37及GATK-hg19 基因資料庫 |
sampleName | The sample name of mapped files :Input file format : *_bqsr.bam Output file format : *.g.vcf File path : processed/ | 例如: sampleName=Sample01模組會在 processed 資料夾裡讀取Sample1_bqsr.bam 進行分析,其結果輸出為Sample01.g.vcf |
intervals | Add interval conditions as needed when performing analysis : (Ref)Whole genome analysis: Intervals are not required but they can help speed up analysis by eliminating “difficult” regions and enabling parallelismExome analysis and other targeted sequencing: You must provide the list of targets, with padding, to exclude off-target noise. This will also speed up analysis and enable parallelism. | GATK supports several types of interval list formats : (Ref)Picard-style .interval_listGATK-style .listBED files with extension .bedVCF files |
Genotype calling :
GenotypeGVCFs merges gVCF output files produced from the HaplotypeCaller or from combining such gVCF files using CombineGVCFs. Only gVCF files resulting from HaplotypeCaller or CombineGVCFs can be used as input for this tool.
(Ref : http://www.genomicdataanalysis.com/genotypegvcfs/)
# **********************************************************
# ********* Genotype calling *********
# ***SampleN.g.vcf.gz => SampleN_combined.vcf***
# **********************************************************
# ==Genotype calling (sample1~N.g.vcf => CombineGVCF)==
GATKGVCF=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,inFiles=’Sample01.g.vcf&&Sample02.g.vcf’,output=Exp01‘ GATK_GVCF.2.X.Y.sh )
echo -e $(date)’\t*.g.vcf to do combine\t’$GATKGVCF’\tSubmit’ >> log/Summary.log
Parameter | Description | Remark |
GATK_GVCF.2.X.Y.sh | The module name of GATK genotype calling | 分析的模組需存放在[modules]資料夾中 |
refGenome | 在執行分析時選用的基因參考資料庫 | 目前支援GATK-hg38, GATK-b37及GATK-hg19 基因資料庫 |
inFiles | The name of the sample of combine files :File format : *.g.vcf File path : processed/►若要合併多個 g.vcf 可以用 “&&” 連結. | 例如: inFiles =’Sample01.g.vcf&&Sample02.g.vcf’ 會在 processed/ 讀取: 1: processed/Sample01.g.vcf 2: processed/Sample02.g.vcf |
output | The name of the sample of through combine files :File format : *_combined.vcf File path : processed/ | 例如:output=Exp01 則資料會存放在 processed/ :1: processed/Exp01_combined.vcf |
5A. Germline Structure Variant Detecting :
Structure variant call:
To discover germline structure variant such as deletions, tandem duplications, inversions and translocations by using Delly 2.
Ref : https://github.com/dellytools/delly
# ***********************************************
# ****** Variant calling *****
# ***SampleN_bqsr.bam =>SampleN.bcf***
# ***********************************************
# ===== Delly2GermlineCALL( _bqsr.bam to .bcf) =====
DG_1=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,sampleName=Sample01,refGenome=GATK-hg38‘ SV_Delly2GermlineCALL.2.X.Y.sh )
echo -e $(date)’\tSample01:Delly2GermlineCALL_1\t’$DG_1’\tSubmit’ >> log/Summary.log
DG_2=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,sampleName=Sample02,refGenome=GATK-hg38‘ SV_Delly2GermlineCALL.2.X.Y.sh)
echo -e $(date)’\tSample02:Delly2GermlineCALL_2\t’$DG_2’\tSubmit’ >> log/Summary.log
Parameter | Description | Remark |
SV_Delly2GermlineCALL.2.X.Y.sh | module of structure variant calling | 分析的模組需存放在[modules]資料夾中 |
refGenome | 在執行分析時選用的基因參考資料庫 | 目前支援GATK-hg38, GATK-b37及GATK-hg19 基因資料庫 |
sampleName | The sample name of mapped files :Input file format : *_bqsr.bam Output file format : *.bcf File path : processed/ | 例如: sampleName=Sample01, 模組會在 processed 資料夾裡讀取 Sample1_bqsr.bam 進行分析,其結果輸出為Sample01.bcf。 |
Genotype call:
Merge results from Delly2 Germline variant callers and generate genotype call.
Ref : https://github.com/dellytools/delly
# ===== Delly2merge( .bcf to .vcf) =====
DM=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,sampleName=”Sample01&&Sample02“,outputName=Exp01‘ SV_Delly2merge.2.X.Y.sh)
echo -e $(date)’\tExp01:Delly2merge_1\t’$DM’\tSubmit’ >> log/Summary.log
Parameter | Description | Remark |
SV_Delly2merge.2.X.Y.sh | module of merge bcf file | 分析的模組需存放在[modules]資料夾中 |
refGenome | 在執行分析時選用的基因參考資料庫 | 目前支援GATK-hg38, GATK-b37及GATK-hg19 基因資料庫 |
sampleName | The name of bcf files : File path : processed/ File format : *.bcf ►若要合併多個 .bcf 可以用 “&&” 連結. | 例如:單一檔案:sampleName=Sample01 多個檔案:sampleName=Sample01&&Sample02模組會在 processed 資料夾裡讀取 Sample01.bcf(and Sample02.bcf) 進行合併。 |
outputName | The name of output file : File path : processed/ File format : *_geno_merge.vcf | 例如:outputName=Exp01模組會將資料以”Exp01_geno_merge.vcf”存放在[processed]資料夾中。 |
4B. Somatic Short Variant Detecting :
Short variant call :
Identify somatic short variants (SNVs and Indels) in one or more tumor samples from a single individual, with or without a matched normal sample.
# **************************************************
# ****** Variant calling *****
# ***SampleN_bqsr.bam =>SampleN.vcf.gz***
# **************************************************
# =====variant calling (_bqsr.bam to _.vcf.gz) =====
GATKMU2V=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,ControlName=Sample01, TestName=Sample02, OutName=Case01‘ GATK_MU2V.2.X.Y.sh)
echo -e $(date)’\tSample01&Sample02:GATK_MU2V\t’$GATKMU2V’\tSubmit’ >> log/Summary.log
Parameter | Description | Remark |
GATK_MU2V.2.X.Y.sh | The module name of GATK somatic variants calling | 分析的模組需存放在[modules]資料夾中 |
refGenome | 在執行分析時選用的基因參考資料庫 | 目前支援GATK-hg38及GATK-b37 基因資料庫 |
ControlName | The name of narmal sample files :File format : *_bqsr.bam File path : processed/ | 例如: ControlName=Sample01 會在 processed/ 讀取 Sample01_bqsr.bam 作為對照組 |
TestName | The name of test sample files :File format : *_bqsr.bam File path : processed/ | 例如: TestName=Sample02 會在 processed/ 讀取 Sample02_bqsr.bam 作為測試組 |
OutName | The name of somatic variant calling output files : File format : *_mutect2_VF.vcf.gz/*_segments.table File path : processed/ | 例如: OutName=Case01 會在 processed/ 資料夾裡寫入:Case01_mutect2_VF.vcf.gzCase01_segments.table |
5B. Somatic Structure Variant Detecting:
Somatic structure variant call :
To discover somatic structure variant such as deletions, tandem duplications, inversions and translocations by using Delly 2.
Ref : https://github.com/dellytools/delly
# ===== Delly2SomaticCALL( _bqsr.bam to .vcf) =====
DS=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=GATK-hg38,controlName=Sample01,testName=Sample02‘ SV_Delly2SomaticCALL.2.X.Y.sh)
echo -e $(date)’\tExp01:delly2SomaticCALL\t’$DM’\tSubmit’ >> log/Summary.log
Parameter | Description | Remark |
SV_Delly2SomaticCALL.2.X.Y.sh | module of structure variant calling | 分析的模組需存放在[modules]資料夾中 |
refGenome | 在執行分析時選用的基因參考資料庫 | 目前支援GATK-hg38, GATK-b37及GATK-hg19 基因資料庫 |
controlName | 欲執行分析的對照樣品名 File format : *_bqsr.bam File path : processed/ | 例如: controlName=Sample01, 模組會在 processed 資料夾裡讀取 Sample01_bqsr.bam 作為對照組,對測試樣品變異分析作比較,若變異在對照組中也有發現,則該位點會從輸出結果中排除。 |
testName | 欲執行分析的測試樣品名 File format : *_bqsr.bam File path : processed/ Output file format : *_Somatic.vcf | 例如: testName=Sample02, 模組會在 processed 資料夾裡讀取 Sample02_bqsr.bam 作為測試組進行分析。其結果會以VCF資料格式以為”Sample2_Somatic.vcf”, 資料會存放在[processed]資料夾。 |
6. Variant Annotation :
將變異位點檔案以Annovar 程式加上基因註解,輸出註解檔。
# ==== Annovar annotation ( _export.vcf to _anno.vcf) ====
ANNO=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v ‘projDir=’$(pwd)’/,refGenome=hg38,inFiles=Exp01_export.vcf,output=Exp01‘ AN_ANNO.2.X.Y.sh)
echo -e $(date)’\tAnnovar export vars\t’$ANNO’\tSubmit’ >> log/Summary.log
Parameter | Description | Remark |
AN_ANNO.2.X.Y.sh | module of annotation vcf | 分析的模組需存放在[modules]資料夾中,此模組將變異位點檔案(vcf) 以Annovar 程式加上基因註解,會產生註解後之"_multianno.vcf"檔案 |
refGenome | 在執行註解時選用的基因參考資料庫 | 目前支援hg38及 hg19 (相容於 b37) 基因資料庫 |
inFiles | 欲加上基因註解的變異位點檔案檔名 File format: *.vcf ” or *.vcf.gz “ File path: processed/ | 例如: inFiles=Exp01__export.vcf 會在 processed/ 讀取Exp01_export.vcf 進行位點功能註解。 |
output | 輸出的變異位點基因註解檔案名稱 File format: *_multianno.vcf File path: report/ | 例如: output=Exp01時會在 report/ 輸出的變異位點基因註解檔案 Exp01.hg38_multianno.vcf |
Demo Workflow script :
#!/bin/bash
# ************* GATK Genome Germline variant calling
# ************* Platform Taiwania
# ************* Script version 2.0 (Applies to modules from version 2.0.0)
## [projDir] project
## +-- [rawDir] raw <= Sample1,2,3_R1/_R2.fastq.gz
## +-- [prcDir] processed => sampleName_bqsr.bam/.g.vcf.gz
## +-- [anaDir] analyzed => case.g.vcf.gz
## +-- [rptDir] report => RawReadQC.html, project.db
## +-- [tmpDir] temp
## +-- [logDir] log =>Summary.log
## +-- [qcDir] QC => .pdf, .txt, .csv
## 1. ##
# *****************************************
# ****** Global parameters ******
# *****************************************
# set your email for job message senting
email = user@your.email
# set your project account for submit job
projectID = MST-XXXXXXXX
# *****************************************
# ************* raw read QC *************
# *****************************************
#===== fastqc module =====
#Use Raw_Read_Quality_Check-Ver2 pipeline
## 2. ##
# *****************************************
# *** Reads Mapping ***
# ***SampleN_R1,R2 => SampleN_bqsr.bam ***
# *****************************************
# ===== Mapping (fastq to _bqsr.bam) =====
GATKGMP_1=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,refGenome=GATK-hg38,inFile=Sample01,sampleName=Sample01' modules/GATK_GMP.2.X.Y.sh)
echo -e $(date)'\tSample01:GATK_GMP_1\t'$GATKGMP_1'\tSubmit' >> log/Summary.log
GATKGMP_2=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,refGenome=GATK-hg38,inFile=Sample02,sampleName=Sample02' modules/GATK_GMP.2.X.Y.sh )
echo -e $(date)'\tSample02:GATK_GMP_2\t'$GATKGMP_2'\tSubmit' >> log/Summary.log
## 3. ##
# *****************************************
# *** Mapping Quality Check ***
# *** Bam file quality checking ***
# ***SampleN_bqsr.bam => qualimap result***
# *****************************************
# *** Generate report ***
# ***SampleN_fastqc.zip => multiqc_html ***
# *****************************************
# ====QualiMap ( _bqsr.bam to qualimap result) ====
QM_1=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,inFile=Sample01_bqsr.bam,sampleName=Sample01' modules/QC_QualiMap.2.X.Y.sh )
echo -e $(date)'\tSample01:QM_1\t'$QM_1'\tSubmit' >> log/Summary.log
QM_2=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,inFile=Sample02_bqsr.bam,sampleName=Sample02' modules/QC_QualiMap.2.X.Y.sh)
echo -e $(date)'\tSample02:QM_2\t'$QM_2'\tSubmit' >> log/Summary.log
# ===== Multiqc module =====
MQC=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v "projDir='$(pwd)'/,qcDir=QC/,rptDir=report/,rptName=multiqc" modules/QC_Multiqc.2.X.Y.sh)
echo -e $(date)'\tMQC\t'$MQC'\tSubmit' >> log/Summary.log
## 4A. ##
# *****************************************
# *** Variant calling ***
# *** SampleN_bqsr.bam => SampleN.g.vcf ***
# *****************************************
# *****************************************
# * Variant calling (!!Custom intevals!!) *
# *** SampleN_bqsr.bam => SampleN.g.vcf ***
# *****************************************
# *** Genotype calling ***
# **SampleN.g.vcf =>qSampleN_combined.vcf**
# *****************************************
# =====variant calling (_bqsr.bam to _.g.vcf) ===== pick one
GATKVC_1=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,refGenome=GATK-hg38,sampleName=Sample01' modules/GATK_VC.2.X.Y.sh )
echo -e $(date)'\tSample01:GATK_VC_1\t'$GATKVC_1'\tSubmit' >> log/Summary.log
GATKVC_2=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,refGenome=GATK-hg38,sampleName=Sample02' modules/GATK_VC.2.X.Y.sh )
echo -e $(date)'\tSample02:GATK_VC_2\t'$GATKVC_2'\tSubmit' >> log/Summary.log
# =====variant calling !!Custom intevals!! (_bqsr.bam to _.g.vcf) ===== pick one
GATKVC_1=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,refGenome=GATK-hg38,intervals=WES.bed,sampleName=Sample01' modules/GATK_VC.2.X.Y.sh )
echo -e $(date)'\tSample01:GATK_VC_1\t'$GATKVC_1'\tSubmit' >> log/Summary.log
GATKVC_2=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,refGenome=GATK-hg38,intervals=WES.bed,sampleName=Sample02' modules/GATK_VC.2.X.Y.sh )
echo -e $(date)'\tSample02:GATK_VC_2\t'$GATKVC_2'\tSubmit' >> log/Summary.log
# ==Genotype calling (sample1~N.g.vcf => CombineGVCF)==
GATKGVCF=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,refGenome=GATK-hg38,inFiles='Sample01.g.vcf&&Sample02.g.vcf',output=Exp01' modules/GATK_GVCF.2.X.Y.sh )
echo -e $(date)'\tGATKGVCF\t'$GATKGVCF'\tSubmit' >> log/Summary.log
## 6. ##
# ******************************************
# *** Variant annotation ***
# *** _export.vcf => _anno.vcf ***
# ******************************************
# ==== Annovar annotation ( _export.vcf to _anno.vcf) ====
ANNO=$(qsub -P $projectID -W group_list=$projectID -m e -M $email -v 'projDir='$(pwd)'/,refGenome=hg38,inFiles=Exp01_export.vcf,output=Exp01' modules/AN_ANNO.2.X.Y.sh)
echo -e $(date)'\tAnnovar export vars\t'$ANNO'\tSubmit' >> log/Summary.log
Resources
Demo data set : WES/WGS Demo data set
Report template:
Germline variant calling report (Ref)
Somatic variant calling report (Ref )