Supervised gene module discovery with Grand Forest

This vignette demonstrates how to use Grand Forest to find a gene module associated with survival in non-small cell lung cancer.

Network preparation

First we need to prepare the gene-gene interaction network. We can obtain a protein-protein interaction network from BioGRID using the simpIntLists package from Bioconductor.

library(data.table)
library(tidyverse)
library(simpIntLists)

data("HumanBioGRIDInteractionEntrezId")

# convert edge lists into two-column data frame
edges <- lapply(HumanBioGRIDInteractionEntrezId, function(x) {
  data.frame(
    source = as.character(x$name),
    target = as.character(x$interactors),
    stringsAsFactors = FALSE
  )
})
edges <- rbindlist(edges)

head(edges)
#>    source target
#> 1:   6416   2318
#> 2:   6416 192176
#> 3:   6416   9043
#> 4:   6416   5599
#> 5:   6416   5871
#> 6:   6416   5609

The resulting data frame should contain two columns with gene identifiers as character strings. In this example we will be using Entrez Gene IDs, but any other identifier can be used, provided the same type is used in the experimental data as well.

Expression data preparation

Next we download a gene expression data set from non-small cell lung cancer patients. The dataset was extracted from GSE30219. Besides gene expression values, we also obtain survival information for each patient. The os_time-column contains the follow-up time for each patient in months, and the os_event indicates whether an event has happened (0 = no event, 1 = death). Each remaining column corresponds to a gene.

# Download lung cancer expression data set with survival times.
D <- read_csv("https://grandforest.compbio.sdu.dk/files/survival_example.csv.gz")

print(D[1:8,1:8])
#> # A tibble: 8 x 8
#>   os_time os_event `780` `100616237` `5982` `3310` `7849` `2978`
#>     <int>    <int> <dbl>       <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1     221        0 10.2        10.2    7.16   6.73   5.40   4.33
#> 2      11        1  8.66        8.66   8.27   6.27   5.15   5.28
#> 3      57        1 10.3        10.3    8.68   5.77   4.97   7.79
#> 4      43        1 10.9        10.9    8.05   6.32   5.02   4.57
#> 5      95        1 10.8        10.8    8.33   6.22   5.48   4.52
#> 6      34        1 10.6        10.6    7.64   7.21   4.95   4.14
#> 7      12        0  9.85        9.85   7.27   6.22   5.12   4.46
#> 8      33        1 10.2        10.2    8.77   6.56   5.02   4.77

Training the Grand Forest model

A Grand Forest model is trained on the expression data using the grandforest function. We supply the expression data and network with the data and graph_data arguments. We also indicate which columns are reponse variables by setting the dependent.variable.name and status.variable.name arguments.

library(grandforest)

model <- grandforest(
  data = D,
  graph_data = edges,
  dependent.variable.name = "os_time",
  status.variable.name = "os_event",
  num.trees=500 # 10000 trees recommended for real analysis
)

print(model)
#> Grand Forest result
#> 
#> Call:
#>  grandforest(data = D, graph_data = edges, dependent.variable.name = "os_time",      status.variable.name = "os_event", num.trees = 500) 
#> 
#> Type:                             Survival 
#> Number of trees:                  500 
#> Sample size:                      268 
#> Number of independent variables:  22600 
#> Mtry:                             151 
#> Target node size:                 3 
#> Variable importance mode:         impurity 
#> Splitrule:                        logrank 
#> Number of unique death times:     137 
#> OOB prediction error:             0.3645463

In this example we only train 500 decision trees but in a real analysis we recommend using at least 10000 trees for optimal results.

Once we have trained a model we can obtain gene importance estimates using the importance-method. We can use this to obtain a table of the 25 most important genes:

library(org.Hs.eg.db) # for mapping Entrez IDs to gene names

top25 <- importance(model) %>%
  sort(decreasing=TRUE) %>%
  head(25) %>%
  as_data_frame %>%
  rownames_to_column(var="gene") %>%
  mutate(label=mapIds(org.Hs.eg.db, gene, "SYMBOL", "ENTREZID"))

print(top25)
#> # A tibble: 25 x 3
#>    gene  value label 
#>    <chr> <dbl> <chr> 
#>  1 7316  2.30  UBC   
#>  2 7157  1.15  TP53  
#>  3 672   0.533 BRCA1 
#>  4 1956  0.517 EGFR  
#>  5 6714  0.482 SRC   
#>  6 3065  0.463 HDAC1 
#>  7 2885  0.442 GRB2  
#>  8 1499  0.438 CTNNB1
#>  9 5781  0.413 PTPN11
#> 10 2534  0.371 FYN   
#> # ... with 15 more rows

ggplot(top25, aes(reorder(label, -value), value)) +
  geom_bar(stat="identity") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  labs(x="gene", y="importance")

Extract and visualize gene module

We can also visualize the gene module as a network. Here we extract the subnetwork induced by the 25 genes and visualize the network using geomnet.

library(geomnet)

subnetwork <- filter(edges, source %in% top25$gene & target %in% top25$gene)

net.df <- fortify(as.edgedf(subnetwork), top25)

ggplot(net.df, aes(from_id=from_id, to_id=to_id)) +
  geom_net(aes(colour=importance, label=label),
    layout.alg = "circle", directed=FALSE,
    colour = "lightblue", size = 15,
    labelon = TRUE, labelcolour="black", vjust = 0.5, fontsize=3
  ) +
  theme_net()

Session info

#> R version 3.4.4 (2018-03-15)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 16.04.4 LTS
#> 
#> Matrix products: default
#> BLAS: /usr/lib/libblas/libblas.so.3.6.0
#> LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] geomnet_0.2.0        bindrcpp_0.2.2       org.Hs.eg.db_3.5.0  
#>  [4] AnnotationDbi_1.40.0 IRanges_2.12.0       S4Vectors_0.16.0    
#>  [7] Biobase_2.38.0       BiocGenerics_0.24.0  grandforest_0.1     
#> [10] simpIntLists_1.14.0  forcats_0.3.0        stringr_1.3.1       
#> [13] dplyr_0.7.5          purrr_0.2.4          readr_1.1.1         
#> [16] tidyr_0.8.1          tibble_1.4.2         ggplot2_2.2.1       
#> [19] tidyverse_1.2.1      data.table_1.11.4   
#> 
#> loaded via a namespace (and not attached):
#>  [1] httr_1.3.1           bit64_0.9-7          jsonlite_1.5        
#>  [4] viridisLite_0.3.0    network_1.13.0.1     modelr_0.1.2        
#>  [7] assertthat_0.2.0     blob_1.1.1           cellranger_1.1.0    
#> [10] yaml_2.1.19          pillar_1.2.2         RSQLite_2.1.1       
#> [13] backports_1.1.2      lattice_0.20-35      glue_1.2.0          
#> [16] digest_0.6.15        rvest_0.3.2          colorspace_1.3-2    
#> [19] htmltools_0.3.6      Matrix_1.2-14        plyr_1.8.4          
#> [22] psych_1.8.4          pkgconfig_2.0.1      broom_0.4.4         
#> [25] haven_1.1.1          scales_0.5.0         sna_2.4             
#> [28] lazyeval_0.2.1       cli_1.0.0            mnormt_1.5-5        
#> [31] statnet.common_4.0.0 magrittr_1.5         crayon_1.3.4        
#> [34] readxl_1.1.0         memoise_1.1.0        evaluate_0.10.1     
#> [37] nlme_3.1-137         xml2_1.2.0           foreign_0.8-70      
#> [40] tools_3.4.4          hms_0.4.2            plotly_4.7.1        
#> [43] munsell_0.4.3        compiler_3.4.4       rlang_0.2.0         
#> [46] grid_3.4.4           rstudioapi_0.7       htmlwidgets_1.2     
#> [49] labeling_0.3         rmarkdown_1.9        gtable_0.2.0        
#> [52] DBI_1.0.0            curl_3.2             reshape2_1.4.3      
#> [55] R6_2.2.2             lubridate_1.7.4      knitr_1.20          
#> [58] bit_1.1-13           utf8_1.1.3           bindr_0.1.1         
#> [61] rprojroot_1.3-2      stringi_1.2.2        Rcpp_0.12.17        
#> [64] tidyselect_0.2.4