DOC_ID : T15-0002

GATK_GMP module :

DOC_ID : M05-3000
Editor : Anita
Reviewer :Angela

Function :

Map raw reads to the reference genome and create bam file for small indel variant calling or structural variants analysis. This module will remove duplicates and adapters for reducing biases from library preparation. The base quality score will also be recalibrated according to GATK’s algorithm :

Map to Reference genome and sorting : The first step is performed per-read group and consists of mapping each individual read pair to the reference genome which is a synthetic single-stranded representation of common genome sequence that is intended to provide a common coordinate framework for all genomic analysis.
- Convert paired raw read file( _R1/_R2.fastq) , and add SM parameter for downstream somatic variant calling. Extracts read sequences and clip adapter sequence, then pipe to genome mapping software, bwa mem.
  - fastq ⇒ _sorted.bam
Remove Duplicate : The second step remove read duplicates which have identical R1-R2 reads.
- _sorted.bam ⇒ _dedup.bam
Base (Quality Score) Recalibration : The goal of this procedure is to correct for systematic bias that affect the assignment of base quality scores by the sequencer. The first pass consists of calculating error empirically and finding patterns in how error varies with basecall features over all bases. The relevant observations are written to a recalibration table. The second pass consists of applying numerical corrections to each individual basecall based on the patterns identified in the first step (recorded in the recalibration table) and write out the recalibrated data to a new BAM or CRAM file.
- _dedup.bam ⇒ _recal.table
- _dedup.bam ⇒ _bqsr.bam

Ref : https://gatk.broadinstitute.org/hc/en-us/articles/360035535912-Data-pre-processing-for-variant-discovery

Installation :

All software are included in GA environment.

Note :

►執行分析前請先利用CreateProject.sh創建一個專案資料夾，請參閱Project standar d folder structure文件。

►執行模組需確認所屬計算節點(–partition) : 一般節點的使用者建議使用ct224 ; 生醫節點的使用者建議使用ngs96G^註1。

►欲了解模組使用的方式，請執行模組的 -h 指令

#註1 : 欲確認使用者身分，請登入國網中心iService後，選取會員中心/計畫管理/我的計畫，若計畫名稱為”國家生醫數位資料與分析運算雲端服務平台III”即為生醫節點使用者

Description :

Tested environment	GApp0.0.0.2
Software version	gatk4=4.1.8.1bwa=0.7.17sambamba=0.7.1samtools=1.10
Usage(Slurm)	Command in Slurm (Taiwania III) `sbatch -A $projectID --mail-user=$email --export='projDir='$(pwd)'/,refGenome=hg38,inFile=Sample01-cleanup,sampleName=Sample01' modules/GATK_GMP.sh`
Usage(linux console)	Command in linux console `bash modules/GATK_GMP.sh -p $(pwd) -r hg38 -i Sample01-cleanup -s Sample01`
#For Slurm operation, please refer to “Basic operation of Taiwania III“

Usage :

The following explains the usage of module parameters :

Parameter	Description	Remark
GATK_GMP.sh	Module of genome mapping	分析的模組需存放在[modules]資料夾中
projDir	分析專案的資料夾路徑（專案資料夾結構說明）	Script需在分析專案的資料夾執行， $(pwd) 會傳回使用者現在所在的路徑
refGenome	在執行分析時選用的基因參考資料庫	目前支援GATK-hg38, GATK-b37及GATK-hg19 基因資料庫
inFile	欲執行分析的檔案名稱資料格式(輸入) : .fastq 或 .fastq.gz資料路徑 : processed/	例如: inFile = Sample01-cleanup 會在 processed/ 讀取 :1: Sample01-cleanup_R1.fastq.gz 2: Sample01-cleanup_R2.fastq.gz
sampleName	輸出的檔案名稱資料格式(輸出) : .bam, .bai, _recal.table 和 _sorted.bam資料路徑 : processed/	例如: sampleName = Sample01會在 processed/ 生成 :1: Sample01_sorted.bam 2: Sample01_recal.table 3: Sample01_bqsr.bam 4: Sample01_bqsr.bai

基因體研究的全方位解決對策