Download, preprocess, annotate and analyze omics data sets

Kleanthi Lakiotaki1*

1Computer Science Department, Univeristy of Crete

2017-12-01

Abstract

This vignette provides a task-oriented description of BioDataome package, examples of user interaction with BioDataome and some examples of analysing BioDataome datasets. BioDataome is a collection of uniformly preprocessed and automatically annotated datasets for data-driven biology. The processed data can be accessed via the BioDataome website in .csv format and the BioDataome package via github. BioDataome package contains all the functions used to download, preprocess and annotate gene expression and methylation microarray data from Gene Expression Omnibus, as well as RNASeq data from recount.

Package

BioDataome 0.0.0.9000

1 Getting started

BioDataome package is on GitHub. To install an R package from GitHub you first need to install the devtools package.

and then load devtools package.

library(devtools)

To install BioDataome package, type:

install_github("mensxmachina/BioDataome")

To load BioDataome package, type:

library(BioDataome)

1.1 \(\color{red}{Important \quad Note}\)

BioDataome website and BioDataome package currently hold datasets from five different microarray technologies: four gene expression (GPL570, GPL96, GPL6244 and GPL1261) and the GPL13534 Human Methylation BeadChip from GEO and the GPL11154 high-throughput sequencing technology from recount. Therefore, all functions that take as an input argument a technology id, work only with one of the above technology ids.

2 Download multiple BioDataome datasets matching a query

In BioDataome users may select certain criteria from the drop-down lists and form a complex query. For example, users may be interested in downloading all parkinson’s disease datasets of the GPL570 platform. When users select parkinson’s disease from the disease drop-down menu and GPL570 from the technology drop-down menu, BioDataome datasets are filtered out to match these criteria and a list with all the metadata related to the results of their query is formed. Subsequently, users may click on the download button to download the list, that includes a download link for each dataset.

#Download the filtered list matching specific criteria
#If you downloaded the metadata csv file in your currect directory you can load it with the command:

#metadata<-read.csv(file=file.path(getwd(),"metadata.csv"), sep=",", header=TRUE, stringsAsFactors = F)

#If you have installed and loaded BioDataome package, and example metadata file can be loaded as:
metadata<-BioDataome::metadata

#Let's filter out further the results by selecting datasets with sample size greater or equal to 25 and those with no common samples with any other dataset.
subsetDsets<-metadata[which( (metadata$samples>=25) & (metadata$duplicates==" ") ),]
dataList<-lapply(subsetDsets$file,read.csv)

#first five rows and columns of the data
dataList[[1]][1:5,1:5]
#>      samples X1007_s_at X1053_at X117_at X121_at
#> 1 GSM1192691     2.3591   0.4203  0.0916  0.7031
#> 2 GSM1192692     2.6960   0.2670  0.0850  0.7351
#> 3 GSM1192693     2.3694   0.4736  0.1740  0.7155
#> 4 GSM1192694     2.6472   0.3323  0.0681  0.6548
#> 5 GSM1192695     2.1304   0.5331  0.0724  0.6802

3 Download, preprocess, annotate and analyze omics data sets for BioDataome

In BioDataome we have downloaded, preprocessed and annotated several thousands of omics data. All raw data have been retreived from Gene Expression Omnibus and recount. Here we provide a workflow of all the steps we followed to download, preprocess and annotate data for BioDataome either from GEO or from recount.

3.1 Download, preprocess and annotate GEO datasets

In BioDataome package curateGSE function runs all necessary steps to download and preprocess a dataset from GEO. It returns a list, the first element of which is the sample phenotype metadata and the second the preprocessed data. curateGSE function is build upon several other BioDataome functions. First, it calls GSEmetadata for creating the sample phenotype metadata, including the output of the controlSamples, which discovers which samples are possibly used as controls, and then calls downloadRaw to download all raw data. Subsequently, depending on the user specified techology (whether the user designated one of the gene expression technologies curated by BioDataome or the methylation technology) curateGSE calls preprocessGEO or preprocessGEOMethylation respectively, to create a numeric matrix of the preprocessed data. Keep in mind that preprocessing CEL files takes a long time, even for small datasets.

#curate dataset GSE10026 measured with technology GPL570.

GSE10026<-curateGSE("GSE10026","GPL570",getwd())

#first five rows and columns of the metadata
GSE10026[[1]][1:5,1:5]

#first five rows and columns of the preprocessed data
GSE10026[[2]][1:5,1:5]

3.2 Download, preprocess and annotate recount RNASeq datasets

In BioDataome package curateRecountRNASeq function runs all necessary steps to download and preprocess a dataset from recount. It returns a list, the first element of which is the sample phenotype metadata and the second the preprocessed data.

#curate dataset SRP032775.

SRP032775<-curateRecountRNASeq("SRP032775",getwd())

#first five rows and columns of the metadata
SRP032775[[1]][1:5,1:5]
#>               samples   class title geo_accession                status
#> SRR1026870 GSM1260479 unknown  005a    GSM1260479 Public on Nov 08 2013
#> SRR1026871 GSM1260479 unknown  005a    GSM1260479 Public on Nov 08 2013
#> SRR1026872 GSM1260480 unknown  005b    GSM1260480 Public on Nov 08 2013
#> SRR1026873 GSM1260481 unknown  018a    GSM1260481 Public on Nov 08 2013
#> SRR1026874 GSM1260482 unknown  018b    GSM1260482 Public on Nov 08 2013

#first five rows and columns of the preprocessed data
SRP032775[[2]][1:5,1:5]
#>                    SRR1026870 SRR1026871 SRR1026872 SRR1026873 SRR1026874
#> ENSG00000000003.14   6.268570   5.945202   5.807542   6.284326   5.185733
#> ENSG00000000005.5    5.185733   5.185733   5.185733   5.185733   5.185733
#> ENSG00000000419.12   8.435561   8.424106   8.265254   8.707014   5.185733
#> ENSG00000000457.13   9.578777   9.746133   9.496131   9.634553   5.185733
#> ENSG00000000460.16   8.647349   8.643005   8.248831   8.801825   5.185733

In RNASeq data we often end up with variables of zero variance across samples, as a result of zero counts in all samples. In most types of analysis, especially differential gene expression, one approach is to exclude genes with zero counts in all the replicates and conditions from the analysis, considering that they are obviously not expressed, (http://homer.salk.edu/homer/basicTutorial/rnaseqR.html). Also, for downstream analysis, it is typical to filter genes with a total read count smaller than a given threshold in any of the experimental conditions. In either way, non-zero variance of gene counts across samples is guaranteed. In BioDataome however, we are preparing datasets for further downstream and meta-analysis, thus a common variable set must be ensured. We followed the same preprocessing pipeline on all downloaded datasets and we show here all the steps that we followed, as well as their effect on the data. To visualize this effect, we followed the steps proposed in: (http://www-huber.embl.de/users/klaus/Teaching/DESeq2Predoc2014.html)

Recount provides the coverage counts instead of typical read count matrices. The coverage counts for each disjoint exon are the sum of the base-pair coverage. The gene coverage count is the sum of the disjoint exons coverage counts. The recount function scale_counts() computes the scaled read counts for a target library size of 40 million reads and we use this in our preprocessing pipeline, as proposed in recount quick start guide. We also estimate size factors to account for differences in sequencing depth and use the varianceStabilizingTransformation to account for heteroscedasticity, as indicated in RNA-seq gene-level analysis.

#download dataset SRP050971

downloadRecount("SRP050971")
#> [1] "http://duffel.rail.bio/recount/SRP050971/rse_gene.Rdata"
load(file.path("SRP050971", 'rse_gene.Rdata'))
# Scale counts by taking into account the total coverage per sample
rse <- recount::scale_counts(rse_gene)
#access count matrix
data<-SummarizedExperiment::assay(rse)
#access phenotype information
pheno<-SummarizedExperiment::colData(rse)
#construct a DESeqDataSet object for further analysis
dds<-DESeq2::DESeqDataSetFromMatrix(data,pheno, ~ 1)
#Estimate the size factors for the count data
dds <- DESeq2::estimateSizeFactors(dds)

To assess the scaling effect, we plot the densities of counts for the different samples. We expect overlapping densities since most of the genes are affected by the experimental conditions.


source("https://bioconductor.org/biocLite.R")
if (!require("geneplotter")) biocLite("geneplotter")
library('geneplotter')

multidensity(SummarizedExperiment::assay(dds),
              xlab="mean counts", xlim=c(0, 1000),
              ylab="Density", legend=F, main="")


#account for heteroscedasticity
vsd <- DESeq2::varianceStabilizingTransformation(dds, blind=FALSE)
dataNorm<-SummarizedExperiment::assay(vsd)
#show the effect of the variance stabilizing transformation, by plotting the first sample against the second
lims <- c(-2, 20)
plot(SummarizedExperiment::assay(dds)[,1:2],
     pch=16, main="before VST")


lims <- c(-2, 20)
plot(dataNorm[,1:2],
       pch=16, main="after VST")

To assess the overall similarity between samples we plot a heatmap of sample-to-sample distances using the transformed values.

#prerequisites for plotting colouring a heatmap
if (!require("RColorBrewer")) install.packages("RColorBrewer")
library(RColorBrewer)
if (!require("pheatmap")) install.packages("pheatmap")
library(pheatmap)

distsRL <- dist(t(dataNorm))
mat <- as.matrix(distsRL)
condition<-SummarizedExperiment::colData(vsd)$title
condition<-strsplit(condition,"_")
condition<-sapply(condition,"[[",2)

rownames(mat) <-  condition
colnames(mat) <-  SummarizedExperiment::colData(vsd)$sample

hmcol <- colorRampPalette(brewer.pal(9, "Blues"))(255)

pheatmap(mat,fontsize_row=8)

3.3 Find datasets that share common samples

To build BioDataome website we have also identified duplicate samples by comparing all pairwise combinations of the preprocessed datasets. For each dataset, we report all other datasets that share at least one common sample. In BioDataome package we included two functions, compareDsets and compareDsetList for a user to either compare two datasets to find if they share samples, or to compare one datasets of interest to a list of datasets.

BioDataome website provides all data and metadata files in csv format for better data exchange. However, in BioDatome package a user can also load data in .rda format. For example, compareDsets and compareDsetList both accept either .csv or .rda files. In the following examples we demonstrate compareDsets with loading .rda files and compareDsetList with .csv file.

#download two preprocessed datasets from BioDataome in .rda
d1<-get(load(url("http://dataome.mensxmachina.org/data/Homo%20sapiens/GPL570/GSE86013.Rda")))
d2<-get(load(url("http://dataome.mensxmachina.org/data/Homo%20sapiens/GPL570/GSE86015.Rda")))
# compare two preprocessed datasets
commons<-compareDsets(d1,d2)
#Number of samples that d1 and d2 share
commons
#> [1] 4

Since BioDataome datasets are large we use fread from package data.table to read .csv datasets faster. In contrast to compareDsets, compareDsetList accepts paths to data files instead of data matrices.

#the path to GSE8601 dataset in BioDataome
x<-"http://dataome.mensxmachina.org/data/Homo%20sapiens/GPL570/GSE86013.csv"

Let’s suppose we would like to compare GSE8601 to datasets GSE86015, GSE9008 and GSE9119. All datasets should have been measured with the same technology, i.e. GPL570.

#create a character vector of the paths in BioDataome
y<-c("GSE86015.csv","GSE9008.csv","GSE9119.csv")
y<-paste0("http://dataome.mensxmachina.org/data/Homo%20sapiens/GPL570/",y)
#find with which of the three datasets our dataset of interest shares samples
commonGSEs<-compareDsetList(x,y)
commonGSEs
#> [1] "GSE86015"

3.4 Annotate datasets with Disease-Ontology terms

To annotate datasets hosted in BioDataome, we programmatically retrieve text-mined results from PubTator through RESTful API. PubTator supports PubMed ID queries. When a datasets is not accompanied by its PubMed ID, we use GSEtoDiseaseGEO, function that is based on GEO queries consisting of dataset accession ID and all disease terms from the disease ontology (D-O).

In most cases, either PubTator or GSEtoDiseaseGEO return a collection of disease terms for each dataset. In this case, we first map all disease terms to the disease ontology first children nodes, i.e. bacterial infectious disease, immune system disease, cancer, etc, and keep disease terms that belong to the most common node(s).

To assign Disease-Ontology terms to a GEO study we call GSEtoDisease which implements the rationale described above.

#Assign Disease-Ontology terms to study "GSE10245"
diseases<-GSEtoDisease("GSE10006")
diseases
#> [1] "chronic obstructive pulmonary disease"

For the RNASeq datasets from the recount, we first map the accession ids found in recount to GEO accession ids.

#Assign Disease-Ontology terms to study "SRP032775"
gse<-recountIDtoGSE("SRP032775")
diseases<-GSEtoDisease(gse)
diseases
#> [1] "malaria"

4 Differential expression analysis of microarray data

One of the most common analysis of microarray data is differential gene expression analysis. Therefore, we show here an example of how an R user can exploit BioDataome to accelerate the analysis of a single dataset. Since BioDataome hosts already processed data, analysts can avoid all the time-consuming steps of downloading and preprocessing data and focus on their downlstream analysis. In this example we follow microarray analysis steps from the Biomedical Data Science course of Rafael Irizarry and Michael Love.

#download GSE8671 preprocessed dataset from BioDataome in .rda
d<-get(load(url("http://dataome.mensxmachina.org/data/Homo%20sapiens/GPL570/GSE8671.Rda")))

#download GSE8671 preprocessed dataset from BioDataome in .rda
pheno<-get(load(url("http://dataome.mensxmachina.org/data/Homo%20sapiens/GPL570/GSE8671_Annot.Rda")))
#dimensions of dataset (rows: probes, columns:samples)
dim(d)
#> [1] 54675    64

#install and load limma package
source("https://bioconductor.org/biocLite.R")
if (!require("limma")) biocLite("limma")
library('limma')

#Fit a linear model for each gene in the expression data given the design matrix specified in column class of the phenotype data
fit <- lmFit(d, design=model.matrix(~ pheno$class))
#calculate the differential expression by empirical Bayes shrinkage
fit <- eBayes(fit)
#A table with the top 10 most statistically significant differentially expressed genes between the groups sorted by adjusted p-value:
tt <- topTable(fit, coef=2)

#create a heatmap of those top 10 highly significant genes. 
heatmap(d[match(rownames(tt),row.names(d)),], labCol = FALSE)

5 Map probe set ids to gene symbols.

#install and load necessary packages
if (!require("hgu133plus2.db")) biocLite("hgu133plus2.db")
if (!require("annotate")) biocLite("annotate")

library("annotate")
library('hgu133plus2.db')

Symbol <- getSYMBOL(row.names(d), "hgu133plus2.db")
#Find gene symbols for the top 10 differentially expressed genes 
top10genes<-Symbol[match(rownames(tt),row.names(d))]
top10genes
#> 212942_s_at   204719_at   222696_at   206784_at   206134_at   207502_at 
#>     "CEMIP"     "ABCA8"     "AXIN2"      "AQP8"  "ADAMDEC1"    "GUCA2B" 
#> 204697_s_at   225622_at 208399_s_at 218412_s_at 
#>      "CHGA"      "PAG1"      "EDN3"  "GTF2IRD1"

6 Enrichment analysis

To infer knowledge about the differentially expressed genes we perform enrichment analysis, namely we compare them to annotated gene sets representing prior biological knowledge. We use Enrichr and the respective R package to check whether a differentially expressed input set of genes significantly overlaps with annotated gene sets. We choose Gene Ontology terms and Reactome and KEGG biological pathways as annotated gene set libraries for this example.

#Perform enrichment analysis for the top 10 differentially expressed genes 

#install and download enrichR package
if (!require("enrichR")) install.packages("enrichR")
library('enrichR')

#select annotated gene set libraries
dbs <- c("GO_Molecular_Function_2017", "GO_Cellular_Component_2017", "GO_Biological_Process_2017",
         "Reactome_2016","KEGG_2016")
#Perform enrichment analysis
enriched <- enrichr(top10genes, dbs)
#> Uploading data to Enrichr... Done.
#>   Querying GO_Molecular_Function_2017... Done.
#>   Querying GO_Cellular_Component_2017... Done.
#>   Querying GO_Biological_Process_2017... Done.
#>   Querying Reactome_2016... Done.
#>   Querying KEGG_2016... Done.
#> Parsing results... Done.

#sort Gene Ontology molecular function terms by adjusted p-values
GOMF<-enriched[[1]]$Term[order(enriched[[1]]$Adjusted.P.value)]
#show top-5 
head(GOMF, n=5)
#> [1] "xenobiotic-transporting ATPase activity (GO:0008559)"                        
#> [2] "clathrin heavy chain binding (GO:0032050)"                                   
#> [3] "hyalurononglucosaminidase activity (GO:0004415)"                             
#> [4] "transmembrane receptor protein tyrosine kinase adaptor activity (GO:0005068)"
#> [5] "water channel activity (GO:0015250)"
#sort genes by the number of Gene Ontology molecular function terms
GeneMembershipGOMF<-sort(table(enriched[[1]]$Genes),decreasing = T)
GeneMembershipGOMF
#> 
#> GTF2IRD1    CEMIP     PAG1    ABCA8 ADAMDEC1    AXIN2     AQP8     EDN3 
#>        4        3        3        2        2        2        1        1

#sort Gene Ontology cellular component terms by adjusted p-values
GOCC<-enriched[[2]]$Term[order(enriched[[2]]$Adjusted.P.value)]
#show top-5 
head(GOCC, n=5)
#> [1] "clathrin-coated endocytic vesicle (GO:0045334)"    
#> [2] "beta-catenin destruction complex (GO:0030877)"     
#> [3] "secretory granule (GO:0030141)"                    
#> [4] "membrane raft (GO:0045121)"                        
#> [5] "integral component of plasma membrane (GO:0005887)"
#sort genes by the number of Gene Ontology cellular component terms
GeneMembershipCC<-sort(table(enriched[[2]]$Genes),decreasing = T)
GeneMembershipCC
#> 
#>     AXIN2     CEMIP      CHGA     ABCA8 AQP8;PAG1      PAG1 
#>         2         2         2         1         1         1

#sort Gene Ontology biological process terms by adjusted p-values
GOBC<-enriched[[3]]$Term[order(enriched[[3]]$Adjusted.P.value)]
#show top-5 
head(GOBC, n=5)
#> [1] "multicellular organism development (GO:0007275)"           
#> [2] "inositol phosphate-mediated signaling (GO:0048016)"        
#> [3] "vasoconstriction (GO:0042310)"                             
#> [4] "positive regulation of cAMP metabolic process (GO:0030816)"
#> [5] "regulation of the force of heart contraction (GO:0002026)"
#sort genes by the number of Gene Ontology biological process terms
GeneMembershipBC<-sort(table(enriched[[2]]$Genes),decreasing = T)
GeneMembershipBC
#> 
#>     AXIN2     CEMIP      CHGA     ABCA8 AQP8;PAG1      PAG1 
#>         2         2         2         1         1         1

#sort Reactome pathway terms by adjusted p-values
Reactome<-enriched[[4]]$Term[order(enriched[[4]]$Adjusted.P.value)]
#show top-5 
head(Reactome, n=5)
#> [1] "Hyaluronan biosynthesis and export_Homo sapiens_R-HSA-2142850"                
#> [2] "Repression of WNT target genes_Homo sapiens_R-HSA-4641265"                    
#> [3] "Hyaluronan metabolism_Homo sapiens_R-HSA-2142845"                             
#> [4] "Binding of TCF/LEF:CTNNB1 to target gene promoters_Homo sapiens_R-HSA-4411364"
#> [5] "Passive transport by Aquaporins_Homo sapiens_R-HSA-432047"
#sort genes by the number of Reactome pathway terms
GeneMembershipReactome<-sort(table(enriched[[4]]$Genes),decreasing = T)
GeneMembershipReactome
#> 
#>           AXIN2            EDN3            PAG1           CEMIP 
#>               9               7               6               5 
#>            AQP8           ABCA8      AQP8;ABCA8 EDN3;AXIN2;PAG1 
#>               4               1               1               1

#sort KEGG pathway terms by adjusted p-values
KEGG<-enriched[[5]]$Term[order(enriched[[5]]$Adjusted.P.value)]
#show top-5 
head(KEGG,n=5)
#> [1] "Endometrial cancer_Homo sapiens_hsa05213"         
#> [2] "Basal transcription factors_Homo sapiens_hsa03022"
#> [3] "ABC transporters_Homo sapiens_hsa02010"           
#> [4] "Colorectal cancer_Homo sapiens_hsa05210"          
#> [5] "Basal cell carcinoma_Homo sapiens_hsa05217"
#sort genes by the number of GKEGG pathway terms
GeneMembershipKEGG<-sort(table(enriched[[5]]$Genes),decreasing = T)
GeneMembershipKEGG
#> 
#>    AXIN2 GTF2IRD1    ABCA8     AQP8 
#>        7        3        1        1