DOC_ID : T15-0002
AN_SnpSift module :
DOC_ID : M59-3000
Editor : Anita
Reviewer :Angela
Function :
SnpSift is a toolbox that allows you to filter and manipulate annotated files.
Once your genomic variants have been annotated, you need to filter them out in order to find the “interesting / relevant variants”. Given the large data files, this is not a trivial task (e.g. you cannot load all the variants into XLS spreadsheet). SnpSift helps to perform this VCF file manipulation and filtering required at this stage in data processing pipelines.
SnpSift utilities
SnpSift is a collection of tools to manipulate VCF (variant call format) files.
Some examples of what you can do:
Operation | Meaning |
---|---|
Filter | You can filter using arbitrary expressions, for instance “(QUAL > 30) | (exists INDEL) | ( countHet() < 2 )”. The actual expressions can be quite complex, so it allows for a lot of flexibility. |
Annotate | You can add ‘ID’ and INFO fields from another “VCF database” (e.g. typically dbSnp database in VCF format). |
CaseControl | You can compare how many variants are in ‘case’ and in ‘control’ groups. Also calculates p-values (Fisher exact test). |
Intervals | Filter variants that intersect with intervals. |
Intervals (intidx) | Filter variants that intersect with intervals. Index the VCF file using memory mapped I/O to speed up the search. This is intended for huge VCF files and a small number of intervals to retrieve. |
Join | Join by generic genomic regions (intersecting or closest). |
RmRefGen | Remove reference genotype (i.e. replace ‘0/0’ genotypes by ‘.’) |
TsTv | Calculate transition to transversion ratio. |
Extract fields | Extract fields from a VCF file to a TXT (tab separated) format. |
Variant type | Adds SNP/MNP/INS/DEL to info field. It also adds “HOM/HET” if there is only one sample. |
GWAS Catalog | Annotate using GWAS Catalog. |
DbNSFP | Annotate using dbNSFP: The dbNSFP is an integrated database of functional predictions from multiple algorithms (SIFT, Polyphen2, LRT and MutationTaster, PhyloP and GERP++, etc.) |
SplitChr | Split a VCF file by chromosome |
Ref : https://pcingola.github.io/SnpEff/ss_introduction/
Installation :
All software are included in GA environment.
Note :
►執行分析前請先利用CreateProject.2.0.sh創建一個專案資料夾,請參閱Project standard folder structure文件。
►執行模組需確認所屬計算節點(–partition) : 一般節點的使用者建議使用ct56 ; 生醫節點的使用者建議使用ngs24G註1。
►欲了解模組使用的方式,請執行模組的 -h 指令
#註1 : 欲確認使用者身分,請登入國網中心iService後,選取會員中心/計畫管理/我的計畫,若計畫名稱為”國家生醫數位資料與分析運算雲端服務平台III”即為生醫節點使用者
Description :
Tested environment | GApp0.0.0.2 |
Software version | SnpSift=/opt/ohpc/Taiwania3/pkg/biology/SnpEff/snpEff_v5.0e/SnpSift.jar (SnpSift 5.0e 2021-03-09) |
Usage(Slurm) | Command in Slurm (Taiwania III)sbatch -A $projectID --mail-user=$email --export='projDir='$(pwd)'/,refGenome=hg38,sampleName=Sample01.ann.vcf,output=Sample01' modules/AN_SnpSift.sh |
Usage(Linux console) | Command in linux consolebash modules/AN_SnpSift.sh -p $(pwd) -r hg38 -s Sample01.ann.vcf -o Sample01 |
#For Slurm operation, please refer to “Basic operation of Taiwania III“ |
Usage :
The following explains the usage of module parameters :
Parameter | Description | Remark |
AN_SnpSift.sh | module of filter and manipulate annotated files | 分析的模組需存放在[modules]資料夾中 |
projDir | 分析專案的資料夾路徑(專案資料夾結構說明) | Script需在分析專案的資料夾執行, $(pwd) 會傳回使用者現在所在的路徑 |
sampleName | 輸入的檔案名稱資料格式 : *.ann.vcf資料路徑 : report/ | 例如 : sampleName=Sample01.ann.vcf 會讀取放在report/資料夾裡的Sample01.ann.vcf 檔案 |
output | 輸出的檔案名稱資料格式 : *.ann.nsfp.vcf資料路徑 : report/ | 例如 : output=Sample01 會在report/資料夾生成 Sample01.ann.nsfp.vcf檔案 |
refGenome | 在執行分析時選用的基因參考資料庫 | 目前支援hg38及hg19基因資料庫 |