This vignette provides a task-oriented description of BioDataome package, examples of user interaction with BioDataome and some examples of analysing BioDataome datasets. BioDataome is a collection of uniformly preprocessed and automatically annotated datasets for data-driven biology. The processed data can be accessed via the BioDataome website in .csv format and the BioDataome package via github. BioDataome package contains all the functions used to download, preprocess and annotate gene expression and methylation microarray data from Gene Expression Omnibus, as well as RNASeq data from recount.
BioDataome 0.0.0.9000
BioDataome package is on GitHub. To install an R package from GitHub you first need to install the devtools package.
and then load devtools package.
library(devtools)
To install BioDataome package, type:
install_github("mensxmachina/BioDataome")
To load BioDataome package, type:
library(BioDataome)
BioDataome website and BioDataome package currently hold datasets from five different microarray technologies: four gene expression (GPL570, GPL96, GPL6244 and GPL1261) and the GPL13534 Human Methylation BeadChip from GEO and the GPL11154 high-throughput sequencing technology from recount. Therefore, all functions that take as an input argument a technology id, work only with one of the above technology ids.
In BioDataome users may select certain criteria from the drop-down lists and form a complex query. For example, users may be interested in downloading all parkinson’s disease datasets of the GPL570 platform. When users select parkinson’s disease from the disease drop-down menu and GPL570 from the technology drop-down menu, BioDataome datasets are filtered out to match these criteria and a list with all the metadata related to the results of their query is formed. Subsequently, users may click on the download button to download the list, that includes a download link for each dataset.
#Download the filtered list matching specific criteria
#If you downloaded the metadata csv file in your currect directory you can load it with the command:
#metadata<-read.csv(file=file.path(getwd(),"metadata.csv"), sep=",", header=TRUE, stringsAsFactors = F)
#If you have installed and loaded BioDataome package, and example metadata file can be loaded as:
metadata<-BioDataome::metadata
#Let's filter out further the results by selecting datasets with sample size greater or equal to 25 and those with no common samples with any other dataset.
subsetDsets<-metadata[which( (metadata$samples>=25) & (metadata$duplicates==" ") ),]
dataList<-lapply(subsetDsets$file,read.csv)
#first five rows and columns of the data
dataList[[1]][1:5,1:5]
#> samples X1007_s_at X1053_at X117_at X121_at
#> 1 GSM1192691 2.3591 0.4203 0.0916 0.7031
#> 2 GSM1192692 2.6960 0.2670 0.0850 0.7351
#> 3 GSM1192693 2.3694 0.4736 0.1740 0.7155
#> 4 GSM1192694 2.6472 0.3323 0.0681 0.6548
#> 5 GSM1192695 2.1304 0.5331 0.0724 0.6802
In BioDataome we have downloaded, preprocessed and annotated several thousands of omics data. All raw data have been retreived from Gene Expression Omnibus and recount. Here we provide a workflow of all the steps we followed to download, preprocess and annotate data for BioDataome either from GEO or from recount.
In BioDataome package curateGSE
function runs all necessary steps to download and preprocess a dataset from GEO. It returns a list, the first element of which is the sample phenotype metadata and the second the preprocessed data. curateGSE
function is build upon several other BioDataome functions. First, it calls GSEmetadata
for creating the sample phenotype metadata, including the output of the controlSamples
, which discovers which samples are possibly used as controls, and then calls downloadRaw
to download all raw data. Subsequently, depending on the user specified techology (whether the user designated one of the gene expression technologies curated by BioDataome or the methylation technology) curateGSE
calls preprocessGEO
or preprocessGEOMethylation
respectively, to create a numeric matrix of the preprocessed data. Keep in mind that preprocessing CEL files takes a long time, even for small datasets.
#curate dataset GSE10026 measured with technology GPL570.
GSE10026<-curateGSE("GSE10026","GPL570",getwd())
#first five rows and columns of the metadata
GSE10026[[1]][1:5,1:5]
#first five rows and columns of the preprocessed data
GSE10026[[2]][1:5,1:5]
In BioDataome package curateRecountRNASeq
function runs all necessary steps to download and preprocess a dataset from recount. It returns a list, the first element of which is the sample phenotype metadata and the second the preprocessed data.
#curate dataset SRP032775.
SRP032775<-curateRecountRNASeq("SRP032775",getwd())
#first five rows and columns of the metadata
SRP032775[[1]][1:5,1:5]
#> samples class title geo_accession status
#> SRR1026870 GSM1260479 unknown 005a GSM1260479 Public on Nov 08 2013
#> SRR1026871 GSM1260479 unknown 005a GSM1260479 Public on Nov 08 2013
#> SRR1026872 GSM1260480 unknown 005b GSM1260480 Public on Nov 08 2013
#> SRR1026873 GSM1260481 unknown 018a GSM1260481 Public on Nov 08 2013
#> SRR1026874 GSM1260482 unknown 018b GSM1260482 Public on Nov 08 2013
#first five rows and columns of the preprocessed data
SRP032775[[2]][1:5,1:5]
#> SRR1026870 SRR1026871 SRR1026872 SRR1026873 SRR1026874
#> ENSG00000000003.14 6.268570 5.945202 5.807542 6.284326 5.185733
#> ENSG00000000005.5 5.185733 5.185733 5.185733 5.185733 5.185733
#> ENSG00000000419.12 8.435561 8.424106 8.265254 8.707014 5.185733
#> ENSG00000000457.13 9.578777 9.746133 9.496131 9.634553 5.185733
#> ENSG00000000460.16 8.647349 8.643005 8.248831 8.801825 5.185733
In RNASeq data we often end up with variables of zero variance across samples, as a result of zero counts in all samples. In most types of analysis, especially differential gene expression, one approach is to exclude genes with zero counts in all the replicates and conditions from the analysis, considering that they are obviously not expressed, (http://homer.salk.edu/homer/basicTutorial/rnaseqR.html). Also, for downstream analysis, it is typical to filter genes with a total read count smaller than a given threshold in any of the experimental conditions. In either way, non-zero variance of gene counts across samples is guaranteed. In BioDataome however, we are preparing datasets for further downstream and meta-analysis, thus a common variable set must be ensured. We followed the same preprocessing pipeline on all downloaded datasets and we show here all the steps that we followed, as well as their effect on the data. To visualize this effect, we followed the steps proposed in: (http://www-huber.embl.de/users/klaus/Teaching/DESeq2Predoc2014.html)
Recount provides the coverage counts instead of typical read count matrices. The coverage counts for each disjoint exon are the sum of the base-pair coverage. The gene coverage count is the sum of the disjoint exons coverage counts. The recount function scale_counts()
computes the scaled read counts for a target library size of 40 million reads and we use this in our preprocessing pipeline, as proposed in recount quick start guide. We also estimate size factors to account for differences in sequencing depth and use the varianceStabilizingTransformation
to account for heteroscedasticity, as indicated in RNA-seq gene-level analysis.
#download dataset SRP050971
downloadRecount("SRP050971")
#> [1] "http://duffel.rail.bio/recount/SRP050971/rse_gene.Rdata"
load(file.path("SRP050971", 'rse_gene.Rdata'))
# Scale counts by taking into account the total coverage per sample
rse <- recount::scale_counts(rse_gene)
#access count matrix
data<-SummarizedExperiment::assay(rse)
#access phenotype information
pheno<-SummarizedExperiment::colData(rse)
#construct a DESeqDataSet object for further analysis
dds<-DESeq2::DESeqDataSetFromMatrix(data,pheno, ~ 1)
#Estimate the size factors for the count data
dds <- DESeq2::estimateSizeFactors(dds)
To assess the scaling effect, we plot the densities of counts for the different samples. We expect overlapping densities since most of the genes are affected by the experimental conditions.
source("https://bioconductor.org/biocLite.R")
if (!require("geneplotter")) biocLite("geneplotter")
library('geneplotter')
multidensity(SummarizedExperiment::assay(dds),
xlab="mean counts", xlim=c(0, 1000),
ylab="Density", legend=F, main="")
#account for heteroscedasticity
vsd <- DESeq2::varianceStabilizingTransformation(dds, blind=FALSE)
dataNorm<-SummarizedExperiment::assay(vsd)
#show the effect of the variance stabilizing transformation, by plotting the first sample against the second
lims <- c(-2, 20)
plot(SummarizedExperiment::assay(dds)[,1:2],
pch=16, main="before VST")
lims <- c(-2, 20)
plot(dataNorm[,1:2],
pch=16, main="after VST")
To assess the overall similarity between samples we plot a heatmap of sample-to-sample distances using the transformed values.
#prerequisites for plotting colouring a heatmap
if (!require("RColorBrewer")) install.packages("RColorBrewer")
library(RColorBrewer)
if (!require("pheatmap")) install.packages("pheatmap")
library(pheatmap)
distsRL <- dist(t(dataNorm))
mat <- as.matrix(distsRL)
condition<-SummarizedExperiment::colData(vsd)$title
condition<-strsplit(condition,"_")
condition<-sapply(condition,"[[",2)
rownames(mat) <- condition
colnames(mat) <- SummarizedExperiment::colData(vsd)$sample
hmcol <- colorRampPalette(brewer.pal(9, "Blues"))(255)
pheatmap(mat,fontsize_row=8)
To annotate datasets hosted in BioDataome, we programmatically retrieve text-mined results from PubTator through RESTful API. PubTator supports PubMed ID queries. When a datasets is not accompanied by its PubMed ID, we use GSEtoDiseaseGEO
, function that is based on GEO queries consisting of dataset accession ID and all disease terms from the disease ontology (D-O).
In most cases, either PubTator or GSEtoDiseaseGEO
return a collection of disease terms for each dataset. In this case, we first map all disease terms to the disease ontology first children nodes, i.e. bacterial infectious disease, immune system disease, cancer, etc, and keep disease terms that belong to the most common node(s).
To assign Disease-Ontology terms to a GEO study we call GSEtoDisease
which implements the rationale described above.
#Assign Disease-Ontology terms to study "GSE10245"
diseases<-GSEtoDisease("GSE10006")
diseases
#> [1] "chronic obstructive pulmonary disease"
For the RNASeq datasets from the recount, we first map the accession ids found in recount to GEO accession ids.
#Assign Disease-Ontology terms to study "SRP032775"
gse<-recountIDtoGSE("SRP032775")
diseases<-GSEtoDisease(gse)
diseases
#> [1] "malaria"
One of the most common analysis of microarray data is differential gene expression analysis. Therefore, we show here an example of how an R user can exploit BioDataome to accelerate the analysis of a single dataset. Since BioDataome hosts already processed data, analysts can avoid all the time-consuming steps of downloading and preprocessing data and focus on their downlstream analysis. In this example we follow microarray analysis steps from the Biomedical Data Science course of Rafael Irizarry and Michael Love.
#download GSE8671 preprocessed dataset from BioDataome in .rda
d<-get(load(url("http://dataome.mensxmachina.org/data/Homo%20sapiens/GPL570/GSE8671.Rda")))
#download GSE8671 preprocessed dataset from BioDataome in .rda
pheno<-get(load(url("http://dataome.mensxmachina.org/data/Homo%20sapiens/GPL570/GSE8671_Annot.Rda")))
#dimensions of dataset (rows: probes, columns:samples)
dim(d)
#> [1] 54675 64
#install and load limma package
source("https://bioconductor.org/biocLite.R")
if (!require("limma")) biocLite("limma")
library('limma')
#Fit a linear model for each gene in the expression data given the design matrix specified in column class of the phenotype data
fit <- lmFit(d, design=model.matrix(~ pheno$class))
#calculate the differential expression by empirical Bayes shrinkage
fit <- eBayes(fit)
#A table with the top 10 most statistically significant differentially expressed genes between the groups sorted by adjusted p-value:
tt <- topTable(fit, coef=2)
#create a heatmap of those top 10 highly significant genes.
heatmap(d[match(rownames(tt),row.names(d)),], labCol = FALSE)
#install and load necessary packages
if (!require("hgu133plus2.db")) biocLite("hgu133plus2.db")
if (!require("annotate")) biocLite("annotate")
library("annotate")
library('hgu133plus2.db')
Symbol <- getSYMBOL(row.names(d), "hgu133plus2.db")
#Find gene symbols for the top 10 differentially expressed genes
top10genes<-Symbol[match(rownames(tt),row.names(d))]
top10genes
#> 212942_s_at 204719_at 222696_at 206784_at 206134_at 207502_at
#> "CEMIP" "ABCA8" "AXIN2" "AQP8" "ADAMDEC1" "GUCA2B"
#> 204697_s_at 225622_at 208399_s_at 218412_s_at
#> "CHGA" "PAG1" "EDN3" "GTF2IRD1"
To infer knowledge about the differentially expressed genes we perform enrichment analysis, namely we compare them to annotated gene sets representing prior biological knowledge. We use Enrichr and the respective R package to check whether a differentially expressed input set of genes significantly overlaps with annotated gene sets. We choose Gene Ontology terms and Reactome and KEGG biological pathways as annotated gene set libraries for this example.
#Perform enrichment analysis for the top 10 differentially expressed genes
#install and download enrichR package
if (!require("enrichR")) install.packages("enrichR")
library('enrichR')
#select annotated gene set libraries
dbs <- c("GO_Molecular_Function_2017", "GO_Cellular_Component_2017", "GO_Biological_Process_2017",
"Reactome_2016","KEGG_2016")
#Perform enrichment analysis
enriched <- enrichr(top10genes, dbs)
#> Uploading data to Enrichr... Done.
#> Querying GO_Molecular_Function_2017... Done.
#> Querying GO_Cellular_Component_2017... Done.
#> Querying GO_Biological_Process_2017... Done.
#> Querying Reactome_2016... Done.
#> Querying KEGG_2016... Done.
#> Parsing results... Done.
#sort Gene Ontology molecular function terms by adjusted p-values
GOMF<-enriched[[1]]$Term[order(enriched[[1]]$Adjusted.P.value)]
#show top-5
head(GOMF, n=5)
#> [1] "xenobiotic-transporting ATPase activity (GO:0008559)"
#> [2] "clathrin heavy chain binding (GO:0032050)"
#> [3] "hyalurononglucosaminidase activity (GO:0004415)"
#> [4] "transmembrane receptor protein tyrosine kinase adaptor activity (GO:0005068)"
#> [5] "water channel activity (GO:0015250)"
#sort genes by the number of Gene Ontology molecular function terms
GeneMembershipGOMF<-sort(table(enriched[[1]]$Genes),decreasing = T)
GeneMembershipGOMF
#>
#> GTF2IRD1 CEMIP PAG1 ABCA8 ADAMDEC1 AXIN2 AQP8 EDN3
#> 4 3 3 2 2 2 1 1
#sort Gene Ontology cellular component terms by adjusted p-values
GOCC<-enriched[[2]]$Term[order(enriched[[2]]$Adjusted.P.value)]
#show top-5
head(GOCC, n=5)
#> [1] "clathrin-coated endocytic vesicle (GO:0045334)"
#> [2] "beta-catenin destruction complex (GO:0030877)"
#> [3] "secretory granule (GO:0030141)"
#> [4] "membrane raft (GO:0045121)"
#> [5] "integral component of plasma membrane (GO:0005887)"
#sort genes by the number of Gene Ontology cellular component terms
GeneMembershipCC<-sort(table(enriched[[2]]$Genes),decreasing = T)
GeneMembershipCC
#>
#> AXIN2 CEMIP CHGA ABCA8 AQP8;PAG1 PAG1
#> 2 2 2 1 1 1
#sort Gene Ontology biological process terms by adjusted p-values
GOBC<-enriched[[3]]$Term[order(enriched[[3]]$Adjusted.P.value)]
#show top-5
head(GOBC, n=5)
#> [1] "multicellular organism development (GO:0007275)"
#> [2] "inositol phosphate-mediated signaling (GO:0048016)"
#> [3] "vasoconstriction (GO:0042310)"
#> [4] "positive regulation of cAMP metabolic process (GO:0030816)"
#> [5] "regulation of the force of heart contraction (GO:0002026)"
#sort genes by the number of Gene Ontology biological process terms
GeneMembershipBC<-sort(table(enriched[[2]]$Genes),decreasing = T)
GeneMembershipBC
#>
#> AXIN2 CEMIP CHGA ABCA8 AQP8;PAG1 PAG1
#> 2 2 2 1 1 1
#sort Reactome pathway terms by adjusted p-values
Reactome<-enriched[[4]]$Term[order(enriched[[4]]$Adjusted.P.value)]
#show top-5
head(Reactome, n=5)
#> [1] "Hyaluronan biosynthesis and export_Homo sapiens_R-HSA-2142850"
#> [2] "Repression of WNT target genes_Homo sapiens_R-HSA-4641265"
#> [3] "Hyaluronan metabolism_Homo sapiens_R-HSA-2142845"
#> [4] "Binding of TCF/LEF:CTNNB1 to target gene promoters_Homo sapiens_R-HSA-4411364"
#> [5] "Passive transport by Aquaporins_Homo sapiens_R-HSA-432047"
#sort genes by the number of Reactome pathway terms
GeneMembershipReactome<-sort(table(enriched[[4]]$Genes),decreasing = T)
GeneMembershipReactome
#>
#> AXIN2 EDN3 PAG1 CEMIP
#> 9 7 6 5
#> AQP8 ABCA8 AQP8;ABCA8 EDN3;AXIN2;PAG1
#> 4 1 1 1
#sort KEGG pathway terms by adjusted p-values
KEGG<-enriched[[5]]$Term[order(enriched[[5]]$Adjusted.P.value)]
#show top-5
head(KEGG,n=5)
#> [1] "Endometrial cancer_Homo sapiens_hsa05213"
#> [2] "Basal transcription factors_Homo sapiens_hsa03022"
#> [3] "ABC transporters_Homo sapiens_hsa02010"
#> [4] "Colorectal cancer_Homo sapiens_hsa05210"
#> [5] "Basal cell carcinoma_Homo sapiens_hsa05217"
#sort genes by the number of GKEGG pathway terms
GeneMembershipKEGG<-sort(table(enriched[[5]]$Genes),decreasing = T)
GeneMembershipKEGG
#>
#> AXIN2 GTF2IRD1 ABCA8 AQP8
#> 7 3 1 1