Module-GATK_GMP


DOC_ID : T15-0002

GATK_GMP module : 


DOC_ID : M05-3000
Editor : Anita
Reviewer :Angela

Function :

Map raw reads to the reference genome and create bam file for small indel variant calling or structural variants analysis. This module will remove duplicates and adapters for reducing biases from library preparation. The base quality score will also be recalibrated according to GATK’s algorithm :

  1. Map to Reference genome and sorting : The first step is performed per-read group and consists of mapping each individual read pair to the reference genome which is a synthetic single-stranded representation of common genome sequence that is intended to provide a common coordinate framework for all genomic analysis. 
    • Convert paired raw read file( _R1/_R2.fastq) , and add SM parameter for downstream somatic variant calling. Extracts read sequences and clip adapter sequence, then pipe to genome mapping software, bwa mem.
      • fastq ⇒ _sorted.bam
         
  2. Remove Duplicate : The second step remove read duplicates which have identical R1-R2 reads. 
    • _sorted.bam ⇒ _dedup.bam 
     
  3. Base (Quality Score) Recalibration : The goal of this procedure is to correct for systematic bias that affect the assignment of base quality scores by the sequencer. The first pass consists of calculating error empirically and finding patterns in how error varies with basecall features over all bases. The relevant observations are written to a recalibration table. The second pass consists of applying numerical corrections to each individual basecall based on the patterns identified in the first step (recorded in the recalibration table) and write out the recalibrated data to a new BAM or CRAM file.
    • _dedup.bam ⇒ _recal.table
    • _dedup.bam ⇒ _bqsr.bam

Ref : https://gatk.broadinstitute.org/hc/en-us/articles/360035535912-Data-pre-processing-for-variant-discovery 

Installation :

All software are included in GA environment

Note :

►執行分析前請先利用CreateProject.sh創建一個專案資料夾,請參閱Project standard folder structure文件。

►執行模組需確認所屬計算節點(–partition) : 一般節點的使用者建議使用ct224 ; 生醫節點的使用者建議使用ngs96G註1

►欲了解模組使用的方式,請執行模組的 -h 指令
 

#註1 : 欲確認使用者身分,請登入國網中心iService後,選取會員中心/計畫管理/我的計畫,若計畫名稱為”國家生醫數位資料與分析運算雲端服務平台III”即為生醫節點使用者

Description :

Tested environmentGApp0.0.0.2
Software versiongatk4=4.1.8.1bwa=0.7.17sambamba=0.7.1samtools=1.10
Usage(Slurm)Command in Slurm (Taiwania III)
sbatch -A $projectID --mail-user=$email --export='projDir='$(pwd)'/,refGenome=hg38,inFile=Sample01-cleanup,sampleName=Sample01' modules/GATK_GMP.sh
Usage(linux console)Command in linux console
bash modules/GATK_GMP.sh -p $(pwd) -r hg38 -i Sample01-cleanup -s Sample01
#For Slurm operation, please refer to “Basic operation of Taiwania III

Usage :

The following explains the usage of module parameters :

Parameter DescriptionRemark
GATK_GMP.shModule of genome mapping分析的模組需存放在[modules]資料夾中
projDir分析專案的資料夾路徑(專案資料夾結構Script需在分析專案的資料夾執行, $(pwd) 會傳回使用者現在所在的路徑
refGenome在執行分析時選用的基因參考資料庫目前支援GATK-hg38, GATK-b37及GATK-hg19 基因資料庫
inFile欲執行分析的檔案名稱資料格式(輸入) : *.fastq 或 *.fastq.gz資料路徑 : processed/例如: inFile = Sample01-cleanup 會在 processed/ 讀取 :1: Sample01-cleanup_R1.fastq.gz
2: Sample01-cleanup_R2.fastq.gz
sampleName輸出的檔案名稱資料格式(輸出) : *.bam, *.bai, *_recal.table 和 *_sorted.bam資料路徑 : processed/例如: sampleName = Sample01會在 processed/ 生成 :1: Sample01_sorted.bam
2: Sample01_recal.table
3: Sample01_bqsr.bam
4: Sample01_bqsr.bai

Leave a comment