DOC_ID : T11-0001
Doc_ID: A08-0001GRCh38en104Star275a
Editor: Mira
Reviewer: hsujc
Description
RSEM is a software package for estimating gene and isoform expression levels from RNA-Seq data. The RSEM package provides an user-friendly interface, supports threads for parallel computation of the EM algorithm, single-end and paired-end read data, quality scores, variable-length reads and RSPD estimation. In addition, it provides posterior mean and 95% credibility interval estimates for expression levels.
Build RSEM references using RefSeq, Ensembl, or GENCODE annotations
RefSeq and Ensembl are two frequently used annotations. For human and mouse, GENCODE annotaions are also available. Here, we show how to build RSEM references using Ensembl annotation. It is important to use every genome version with it’s compatible gtf file.
Source
- URL :
- DNA Sequence :http://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/=>Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
- GTF : http://ftp.ensembl.org/pub/release-104/gtf/homo_sapiens/=>Homo_sapiens.GRCh38.104.gtf.gz
- File size :
- Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz 860.1MB (解壓縮後3.08G)
- Homo_sapiens.GRCh38.104.gtf.gz 48.5MB (解壓縮後1.30G)
Genome assemble version : GRCh38 Release 104
Detail information :
使用STAR version 2.7.5a及RSEM version 1.3.3 來製作index,請確認已完成GApp standard analysis environment的安裝
#Activate standard analysis environment
conda activate GApp
#移動到Ref資料夾
cd ~/GA_bundle/Ref/
#創建放置fasta及GTF的資料夾
mkdir -p Homo_sapiens/GRCh38en104Star2.7.5a/
#移動至資料夾
cd Homo_sapiens/GRCh38en104Star2.7.5a/
#下載fasta
wget ftp://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
#解壓縮檔案
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
#下載GTF
wget ftp://ftp.ensembl.org/pub/release-104/gtf/homo_sapiens/Homo_sapiens.GRCh38.104.gtf.gz
#解壓縮檔案
gunzip Homo_sapiens.GRCh38.104.gtf.gz
#用RSEM及STAR prepare index
rsem-prepare-reference \
–gtf ~/GA_bundle/Ref/Homo_sapiens/GRCh38en104Star2.7.5a/Homo_sapiens.GRCh38.104.gtf \
–star \
-p 20 \
~/GA_bundle/Ref/Homo_sapiens/GRCh38en104Star2.7.5a/Homo_sapiens.GRCh38.dna.primary_assembly.fa \
~/GA_bundle/Ref/Homo_sapiens/GRCh38en104Star2.7.5a/GRCh38.104.genome
使用gtfToGenePred及samtools version 1.10製作用來提供基因位置及rRNA位置的訊息的ref_flat file及 ribosomal intervals file
#利用gtfToGenePred將gtf轉為ref_flat檔
gtfToGenePred -genePredExt -geneNameAsName2 -ignoreGroupsWithoutExons Homo_sapiens.GRCh38.104.gtf /dev/stdout |awk ‘BEGIN { OFS=”\t”} {print $12, $1, $2, $3, $4, $5, $6, $7, $8, $9, $10}’ > Homo_sapiens.GRCh38.104.gtf.refflat
#由reference genome抽取genome大小資訊
#Step1
samtools faidx Homo_sapiens.GRCh38.dna.primary_assembly.fa
#Step2
cut -f1,2 Homo_sapiens.GRCh38.dna.primary_assembly.fa.fai > sizes.genome
#Step3
perl -lane ‘print “\@SQ\tSN:$F[0]\tLN:$F[1]\tAS:GRCh38″‘ sizes.genome |grep -v _ >> Homo_sapiens.GRCh38.104.gtf.rRNA.refflat
#合併由GTF抽出的rRNA資訊
grep ‘gene_biotype “rRNA”‘ Homo_sapiens.GRCh38.104.gtf |awk ‘$3 == “gene”‘ |cut -f1,4,5,7,9 |perl -lane ‘/gene_id “([^”]+)”/ or die “no gene_id on $.”;print join “\t”, (@F[0,1,2,3], $1)’ |sort -k1V -k2n -k3n >> Homo_sapiens.GRCh38.104.gtf.rRNA.refflat
Statistics
Summary
Assembly | GRCh38.p13 (Genome Reference Consortium Human Build 38), INSDC Assembly GCA_000001405.28, Dec 2013 |
Base Pairs | 3,096,649,726 |
Golden Path Length | 3,096,649,726 |
Assembly provider | Genome Reference Consortium |
Annotation provider | Ensembl |
Annotation method | Full genebuild |
Genebuild started | Jan 2014 |
Genebuild released | Jul 2014 |
Genebuild last updated/patched | Mar 2021 |
Database version | 104.38 |
Gencode version | GENCODE 38 |
Gene counts (Primary assembly)
Coding genes | 20,442 (incl 644 readthrough) |
Non coding genes | 23,982 |
Small non coding genes | 4,865 |
Long non coding genes | 16,896 (incl 307 readthrough) |
Misc non coding genes | 2,221 |
Pseudogenes | 15,228 (incl 6 readthrough) |
Gene transcripts | 237,081 |
Gene counts (Alternative sequence)
Coding genes | 3,053 (incl 26 readthrough) |
Non coding genes | 1,555 |
Small non coding genes | 297 |
Long non coding genes | 1,071 (incl 25 readthrough) |
Misc non coding genes | 187 |
Pseudogenes | 1,799 |
Gene transcripts | 21,638 |
Other
Genscan gene predictions | 51,756 |
Short Variants | 714,562,852 |
Structural variants | 6,768,792 |
Index and modification
Index
Index software | File list |
STAR rsem-prepare-reference | chrLength.txt chrName.txt chrNameLength.txt chrStart.txt exonGeTrInfo.tab exonInfo.tab geneInfo.tab Genome genomeParameters.txt GRCh38.104.genome.chrlist GRCh38.104.genome.grp GRCh38.104.genome.idx.fa GRCh38.104.genome.n2g.idx.fa GRCh38.104.genome.seq GRCh38.104.genome.ti GRCh38.104.genome.transcripts.fa Log.out SA SAindex sjdbInfo.txt sjdbList.fromGTF.out.tab sjdbList.out.tab transcriptInfo.tab |
gtfToGenePred samtools perl | Homo_sapiens.GRCh38.104.gtf.refflat Homo_sapiens.GRCh38.104.gtf.rRNA.refflat Homo_sapiens.GRCh38.dna.primary_assembly.fa.fai sizes.genome |
Bundle files
Type | File list |
NA |