今天小编给大家分享一下R语言 TCGAbiolinks包的参数有哪些的相关知识点,内容详细,逻辑清晰,相信大部分人都还太了解这方面的知识,所以分享这篇文章给大家参考一下,希望大家阅读完这篇文章后有所收获,下面我们一起来了解一下吧。
local({r <- getOption("repos") r["CRAN"] <- "http://mirrors.tuna.tsinghua.edu.cn/CRAN/" options(repos=r)}) if (!requireNamespace("BiocManager", quietly=TRUE)){ install.packages("BiocManager") } options(BioC_mirror="https://mirrors.tuna.tsinghua.edu.cn/bioconductor") BiocManager::install("TCGAbiolinks") library(TCGAbiolinks)
下载数据分为三步,分别用到TCGAbiolinks包中三个函数:
1)查询数据 GDCquery()
2)下载数据 getResults()
3)保存整理数据 GDCprepare()
以上三步中重点介绍第一个GDCquery()使用方法,其参数最多12个,而且每个参数可设置的选项也非常多,剩下两个函数,使用相对简单了。以下为使用方法和参数说明:
GDCquery(project, data.category, data.type, workflow.type, legacy = FALSE, access, platform, file.type, barcode, data.format, experimental.strategy, sample.type)
简单的使用举例:
query <- GDCquery(project = "TCGA-ACC", data.category = "Copy Number Variation", data.type = "Copy Number Segment")
1.project
可以通过getGDCprojects()$project_id,获取TCGA中最新的不同癌种的项目号,更新项目信息对应癌症名称:https://www.亿速云.com/article/1061
> getGDCprojects()$project_id [1] "TCGA-MESO" "TCGA-READ" "TCGA-SARC" [4] "TCGA-ACC" "TCGA-LGG" "TCGA-THCA" [7] "TARGET-CCSK" "TARGET-NBL" "BEATAML1.0-CRENOLANIB" [10] "TARGET-AML" "TCGA-SKCM" "TCGA-CHOL" [13] "TCGA-KIRC" "TCGA-BRCA" "VAREPOP-APOLLO" [16] "HCMI-CMDC" "ORGANOID-PANCREATIC" "TCGA-GBM" [19] "TCGA-OV" "FM-AD" "TCGA-UCEC" [22] "TARGET-ALL-P3" "CGCI-BLGSP" "TARGET-ALL-P2" [25] "TCGA-LAML" "TCGA-DLBC" "TCGA-KICH" [28] "TCGA-THYM" "TCGA-UVM" "TCGA-PRAD" [31] "TCGA-LUSC" "TCGA-TGCT" "CPTAC-3" [34] "BEATAML1.0-COHORT" "TCGA-STAD" "TCGA-LIHC" [37] "TCGA-COAD" "TARGET-OS" "TARGET-RT" [40] "CTSP-DLBCL1" "TCGA-HNSC" "TCGA-ESCA" [43] "TCGA-CESC" "TCGA-PCPG" "TCGA-KIRP" [46] "TCGA-UCS" "TCGA-PAAD" "TCGA-LUAD" [49] "TARGET-WT" "MMRF-COMMPASS" "TCGA-BLCA" [52] "NCICCR-DLBCL" "TARGET-ALL-P1"
2.data.category
可以使用TCGAbiolinks:::getProjectSummary(project)查看project中有哪些数据类型,如查询"TCGA-ACC",有7种数据类型,case_count为病人数,file_count为对应的文件数。下载表达谱,可以设置data.category="Transcriptome Profiling":
> TCGAbiolinks:::getProjectSummary("TCGA-ACC") $data_categories case_count file_count data_category 1 80 397 Transcriptome Profiling 2 92 361 Copy Number Variation 3 92 744 Simple Nucleotide Variation 4 80 80 DNA Methylation 5 92 105 Clinical 6 92 352 Sequencing Reads 7 92 517 Biospecimen $case_count [1] 92 $file_count [1] 2556 $file_size [1] 3.920606e+12
3.data.type
这个参数受到上一个参数的影响,不同的data.category,会有不同的data.type,如下表所示:
如果下载表达数据,常用的设置如下: #下载rna-seq转录组的表达数据 data.type = "Gene Expresion Quantification" #下载miRNA表达数据数据 data.type = "miRNA Expression Quantification" #下载Copy Number Variation数据 data.type = "Copy Number Segment"
4.workflow.type
这个参数受到上两个参数的影响,不同的data.category和不同的data.type,会有不同的workflow.type
5 legacy
这个参数主要是设置TCGA数据有两不同入口可以下载,GDC Legacy Archive 和 GDC Data Portal,以下是官方的解释两种数据Legacy or Harmonized区别:大致意思为:Legacy 数据hg19和hg18为参考基因组(老数据)而且已经不再更新了,Harmonized数据以hg38为参考基因组的数据(新数据),现在一般选择Harmonized。
Different sources: Legacy vs Harmonized There are two available sources to download GDC data using TCGAbiolinks:
GDC Legacy Archive : provides access to an unmodified copy of data that was previously stored in CGHub and in the TCGA Data Portal hosted by the TCGA Data Coordinating Center (DCC), in which uses as references GRCh47 (hg19) and GRCh46 (hg18).
GDC harmonized database: data available was harmonized against GRCh48 (hg38) using GDC Bioinformatics Pipelines which provides methods to the standardization of biospecimen and clinical data.
Harmonized data options (legacy = FALSE) Legacy archive data options (legacy = TRUE)
不同的的数据(新老Legacy or Harmonized),里面存储的数据会有差异,会影响前面data.category、 data.type 、 前面三个参数可以设置的值如下:
6 access
Filter by access type. Possible values: controlled, open,筛选数据是否开放,这个一般不用设置,不开放的数据也没必要了,所以都设置成:access=“open" |
7.platform
涉及到数据来源的平台,如芯片数据,甲基化数据等等平台的筛选,一般不做设置,除非要筛选特定平台的数据:
Example: | ||
CGH- 1x1M_G4447A | IlluminaGA_RNASeqV2 | |
AgilentG4502A_07 | IlluminaGA_mRNA_DGE | |
Human1MDuo | HumanMethylation450 | |
HG-CGH-415K_G4124A | IlluminaGA_miRNASeq | |
HumanHap550 | IlluminaHiSeq_miRNASeq | |
ABI | H-miRNA_8x15K | |
HG-CGH-244A | SOLiD_DNASeq | |
IlluminaDNAMethylation_OMA003_CPI | IlluminaGA_DNASeq_automated | |
IlluminaDNAMethylation_OMA002_CPI | HG-U133_Plus_2 | |
HuEx- 1_0-st-v2 | Mixed_DNASeq | |
H-miRNA_8x15Kv2 | IlluminaGA_DNASeq_curated | |
MDA_RPPA_Core | IlluminaHiSeq_TotalRNASeqV2 | |
HT_HG-U133A | IlluminaHiSeq_DNASeq_automated | |
diagnostic_images | microsat_i | |
IlluminaHiSeq_RNASeq | SOLiD_DNASeq_curated | |
IlluminaHiSeq_DNASeqC | Mixed_DNASeq_curated | |
IlluminaGA_RNASeq | IlluminaGA_DNASeq_Cont_automated | |
IlluminaGA_DNASeq | IlluminaHiSeq_WGBS | |
pathology_reports | IlluminaHiSeq_DNASeq_Cont_automated | |
Genome_Wide_SNP_6 | bio | |
tissue_images | Mixed_DNASeq_automated | |
HumanMethylation27 | Mixed_DNASeq_Cont_curated | |
IlluminaHiSeq_RNASeqV2 | Mixed_DNASeq_Cont |
8 file.type
这个参数不用设置
9 barcode
A list of barcodes to filter the files to download,可以指定要下载的样品,例如:
barcode =c"TCGA-14-0736-02A-01R-2005-01""TCGA-06-0211-02A-02R-2005-01"
10 data.format
可以设置的选项为不同格式的文件: ("VCF", "TXT", "BAM","SVS","BCR XML","BCR SSF XML", "TSV", "BCR Auxiliary XML", "BCR OMF XML", "BCR Biotab", "MAF", "BCR PPS XML", "XLSX"),通常情况下不用设置,默认就行;
11 experimental.strategy
用于过滤不同的实验方法得到的数据:
Harmonized: WXS, RNA-Seq, miRNA-Seq, Genotyping Array.
Legacy: WXS, RNA-Seq, miRNA-Seq, Genotyping Array, DNA-Seq, Methylation array, Protein expression array, WXS,CGH array, VALIDATION, Gene expression array,WGS, MSI-Mono-Dinucleotide Assay, miRNA expression array, Mixed strategies, AMPLICON, Exon array, Total RNA-Seq, Capillary sequencing, Bisulfite-Seq
12 sample.type
对样本的类型进行过滤,例如,原发癌组织,复发癌等等;
学习完成了所有的参数,这里也有举例使用:
query <- GDCquery(project = "TCGA-ACC", data.category = "Copy Number Variation", data.type = "Copy Number Segment") ## Not run: query <- GDCquery(project = "TARGET-AML", data.category = "Transcriptome Profiling", data.type = "miRNA Expression Quantification", workflow.type = "BCGSC miRNA Profiling", barcode = c("TARGET-20-PARUDL-03A-01R","TARGET-20-PASRRB-03A-01R")) query <- GDCquery(project = "TARGET-AML", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "HTSeq - Counts", barcode = c("TARGET-20-PADZCG-04A-01R","TARGET-20-PARJCR-09A-01R")) query <- GDCquery(project = "TCGA-ACC", data.category = "Copy Number Variation", data.type = "Masked Copy Number Segment", sample.type = c("Primary solid Tumor")) query.met <- GDCquery(project = c("TCGA-GBM","TCGA-LGG"), legacy = TRUE, data.category = "DNA methylation", platform = "Illumina Human Methylation 450") query <- GDCquery(project = "TCGA-ACC", data.category = "Copy number variation", legacy = TRUE, file.type = "hg19.seg", barcode = c("TCGA-OR-A5LR-01A-11D-A29H-01"))
上面的GDCquery()命令完成之后我们就可以用GDCdownload()函数下载数据了,如果数据很多,如果中间中断可以重复运行GDCdownload()函数继续下载,直到所有的数据下载完成,使用举例如下:
query <-GDCquery(project = "TCGA-GBM", data.category = "Gene expression", data.type = "Gene expression quantification", platform = "Illumina HiSeq", file.type = "normalized_results", experimental.strategy = "RNA-Seq", barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"), legacy = TRUE)GDCdownload(query, method = "client", files.per.chunk = 10, directory="D:/data")
具体参数说明如下,主要设置的参数:
method如果设置为client 需要将gdc-client软件所在的路径添加到环境变量中,参考:gdc-client下载TCGA数据;
query,为GDCquery查询的结果,
files.per.chunk = 10,设置同时下载的数量,如果网速慢建议设置的小一些,
directory="D:/data" 数据存储的路径;
GDCprepare可以自动的帮我们获得基因表达数据:
data <- GDCprepare(query = query, save = TRUE, directory = "D:/data", #注意和GDCdownload设置的路径一致GDCprepare才可以找到下载的数据然后去处理。 save.filename = "GBM.RData") #存储一下,方便下载直接读取
获得了data数据之后,就可以往下数据挖掘了
以上就是“R语言 TCGAbiolinks包的参数有哪些”这篇文章的所有内容,感谢各位的阅读!相信大家阅读完这篇文章都有很大的收获,小编每天都会为大家更新不同的知识,如果还想学习更多的知识,请关注亿速云行业资讯频道。
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。