Unsupervised endophenotyping with Grand Forest

Simon J. Larsen

2018-06-13

This vignette demonstrates how to perform an unsupervised analysis with Grand Forest to search for de novo endophenotypes and gene modules in non-small cell lung cancer.

Network preparation

First we need to prepare the gene-gene interaction network. We can obtain a protein-protein interaction network from BioGRID using the simpIntLists package from Bioconductor.

library(data.table)
library(tidyverse)
library(simpIntLists)

data("HumanBioGRIDInteractionEntrezId")

# convert edge lists into two-column data frame
edges <- lapply(HumanBioGRIDInteractionEntrezId, function(x) {
  data.frame(
    source = as.character(x$name),
    target = as.character(x$interactors),
    stringsAsFactors = FALSE
  )
})
edges <- rbindlist(edges)

head(edges)
#>    source target
#> 1:   6416   2318
#> 2:   6416 192176
#> 3:   6416   9043
#> 4:   6416   5599
#> 5:   6416   5871
#> 6:   6416   5609

The resulting data frame should contain two columns with gene identifiers as character strings. In this example we will be using Entrez Gene IDs, but any other identifier can be used, provided the same type is used in the experimental data as well.

Expression data preparation

Next we download a gene expression data set from non-small cell lung cancer patients. The dataset was extracted from GSE30219. The first two columns contain survival information for each patient. Since we will be doing unsupervised analysis, we will simply remove these columns. The resulting data frame contains a column for each gene with gene expression values for each patient.

# Download lung cancer expression data set with survival times.
D <- read_csv("https://grandforest.compbio.sdu.dk/files/survival_example.csv.gz")

# remove survival information and keep for latter
survival <- D[,1:2]
D <- D[,-(1:2)]

print(D[1:8,1:8])
#> # A tibble: 8 x 8
#>   `780` `100616237` `5982` `3310` `7849` `2978` `7318` `100847079`
#>   <dbl>       <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>       <dbl>
#> 1 10.2        10.2    7.16   6.73   5.40   4.33   8.02        8.02
#> 2  8.66        8.66   8.27   6.27   5.15   5.28   7.00        7.00
#> 3 10.3        10.3    8.68   5.77   4.97   7.79   7.18        7.18
#> 4 10.9        10.9    8.05   6.32   5.02   4.57   8.90        8.90
#> 5 10.8        10.8    8.33   6.22   5.48   4.52   6.42        6.42
#> 6 10.6        10.6    7.64   7.21   4.95   4.14   8.12        8.12
#> 7  9.85        9.85   7.27   6.22   5.12   4.46   7.46        7.46
#> 8 10.2        10.2    8.77   6.56   5.02   4.77   6.64        6.64

Training the Grand Forest model

We can train an unsupervised model using the grandforest_unsupervised helper function. The function will generate a background distribution from the input data and train a Grand Forest model to recognize the foreground from the background.

library(grandforest)

model <- grandforest_unsupervised(
  data = D,
  graph_data = edges,
  num.trees = 1000
)

In this example we only train 1000 decision trees but in a real analysis we recommend using at least 10000 trees for optimal results.

Once we have trained a model we can obtain gene importance estimates using the importance-method. We can use this to obtain a table of the 25 most important genes:

library(org.Hs.eg.db) # for mapping Entrez IDs to gene names

top25 <- importance(model) %>%
  sort(decreasing=TRUE) %>%
  head(25) %>%
  as_data_frame %>%
  rownames_to_column(var="gene") %>%
  mutate(label=mapIds(org.Hs.eg.db, gene, "SYMBOL", "ENTREZID"))

print(top25)
#> # A tibble: 25 x 3
#>    gene  value label
#>    <chr> <dbl> <chr>
#>  1 7316  1.33  UBC  
#>  2 7157  1.02  TP53 
#>  3 672   0.883 BRCA1
#>  4 1956  0.675 EGFR 
#>  5 3065  0.647 HDAC1
#>  6 367   0.642 AR   
#>  7 2885  0.556 GRB2 
#>  8 6667  0.477 SP1  
#>  9 6714  0.473 SRC  
#> 10 2099  0.471 ESR1 
#> # ... with 15 more rows
ggplot(top25, aes(reorder(label, -value), value)) +
  geom_bar(stat="identity") +
  theme_classic() + theme(axis.text.x=element_text(angle=45, hjust=1)) +
  labs(x="gene", y="importance")

Extract and visualize gene module

We can also visualize the gene module as a network. Here we extract the subnetwork induced by the 25 genes and visualize the network using geomnet.

library(geomnet)

subnetwork <- filter(edges, source %in% top25$gene & target %in% top25$gene)

net.df <- fortify(as.edgedf(subnetwork), top25)

ggplot(net.df, aes(from_id=from_id, to_id=to_id)) +
  geom_net(aes(colour=importance, label=label),
    layout.alg = "circle", directed=FALSE,
    colour = "lightblue", size = 15,
    labelon = TRUE, labelcolour="black", vjust = 0.5, fontsize=3
  ) +
  theme_net()

Clustering patients

We can stratify the patients into new endophenotypes by clustering them using the genes in the gene module we obtained with Grand Forest.

We first extract the 25 most important genes and scale/center the expression values. Then we cluster the patients into two groups using k-means clustering.

# Extract module features and scale/center values.
D.scaled <- scale(D[,top25$gene], center=TRUE, scale=TRUE)
colnames(D.scaled) <- top25$label

# Cluster into two groups using k-means
cl <- kmeans(D.scaled, centers=2, nstart=20)
print(cl$cluster)
#>   [1] 2 2 1 1 1 2 2 2 1 1 1 2 2 2 2 2 2 1 1 1 1 2 1 1 1 1 2 2 1 1 2 1 1 2 2
#>  [36] 2 2 2 2 2 2 2 1 1 2 2 1 1 2 1 1 1 2 2 2 2 2 2 1 1 2 1 1 1 1 1 1 1 1 1
#>  [71] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 1 1 2 1 1 2 2 1 2 2 2 2 1 1 1 2
#> [106] 1 1 2 2 1 2 1 2 2 2 1 2 1 2 2 1 2 1 2 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1
#> [141] 2 1 1 1 2 1 1 2 1 1 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2
#> [176] 2 2 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [211] 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [246] 2 2 1 1 1 1 1 1 1 1 1 2 1 2 2 1 1 1 1 1 2 1 2

We can visualize the clustering as a heatmap to check which genes appear to be up- and downregulated in each group.

library(ComplexHeatmap)
Heatmap(D.scaled, split = cl$cluster, name = "expression")

We can also use the survival information we removed earlier to compare the survival curves for the two endophenotypes we identified. Here we observe that the stratification we identified is highly significantly associated to overall survival.

library(survival)
library(survminer)
cl.survival <- data.frame(survival, cluster=cl$cluster)
ggsurvplot(survfit(Surv(os_time, os_event)~cluster, data=cl.survival), pval=TRUE)$plot

Session info

#> R version 3.4.4 (2018-03-15)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 16.04.4 LTS
#> 
#> Matrix products: default
#> BLAS: /usr/lib/libblas/libblas.so.3.6.0
#> LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#>  [1] grid      parallel  stats4    stats     graphics  grDevices utils    
#>  [8] datasets  methods   base     
#> 
#> other attached packages:
#>  [1] survminer_0.4.2       ggpubr_0.1.6          magrittr_1.5         
#>  [4] survival_2.42-3       ComplexHeatmap_1.17.1 geomnet_0.2.0        
#>  [7] bindrcpp_0.2.2        org.Hs.eg.db_3.5.0    AnnotationDbi_1.40.0 
#> [10] IRanges_2.12.0        S4Vectors_0.16.0      Biobase_2.38.0       
#> [13] BiocGenerics_0.24.0   grandforest_0.1       simpIntLists_1.14.0  
#> [16] forcats_0.3.0         stringr_1.3.1         dplyr_0.7.5          
#> [19] purrr_0.2.4           readr_1.1.1           tidyr_0.8.1          
#> [22] tibble_1.4.2          ggplot2_2.2.1         tidyverse_1.2.1      
#> [25] data.table_1.11.4    
#> 
#> loaded via a namespace (and not attached):
#>  [1] nlme_3.1-137         cmprsk_2.2-7         lubridate_1.7.4     
#>  [4] bit64_0.9-7          RColorBrewer_1.1-2   httr_1.3.1          
#>  [7] rprojroot_1.3-2      tools_3.4.4          backports_1.1.2     
#> [10] utf8_1.1.3           R6_2.2.2             DBI_1.0.0           
#> [13] lazyeval_0.2.1       colorspace_1.3-2     GetoptLong_0.1.6    
#> [16] tidyselect_0.2.4     gridExtra_2.3        mnormt_1.5-5        
#> [19] bit_1.1-13           curl_3.2             compiler_3.4.4      
#> [22] cli_1.0.0            rvest_0.3.2          xml2_1.2.0          
#> [25] network_1.13.0.1     plotly_4.7.1         labeling_0.3        
#> [28] scales_0.5.0         survMisc_0.5.4       psych_1.8.4         
#> [31] digest_0.6.15        foreign_0.8-70       rmarkdown_1.9       
#> [34] pkgconfig_2.0.1      htmltools_0.3.6      htmlwidgets_1.2     
#> [37] rlang_0.2.0          GlobalOptions_0.0.13 readxl_1.1.0        
#> [40] rstudioapi_0.7       RSQLite_2.1.1        shape_1.4.4         
#> [43] bindr_0.1.1          zoo_1.8-1            jsonlite_1.5        
#> [46] statnet.common_4.0.0 Matrix_1.2-14        Rcpp_0.12.17        
#> [49] munsell_0.4.3        stringi_1.2.2        yaml_2.1.19         
#> [52] plyr_1.8.4           blob_1.1.1           crayon_1.3.4        
#> [55] lattice_0.20-35      haven_1.1.1          splines_3.4.4       
#> [58] circlize_0.4.3       hms_0.4.2            sna_2.4             
#> [61] knitr_1.20           pillar_1.2.2         rjson_0.2.19        
#> [64] reshape2_1.4.3       glue_1.2.0           evaluate_0.10.1     
#> [67] modelr_0.1.2         cellranger_1.1.0     gtable_0.2.0        
#> [70] km.ci_0.5-2          assertthat_0.2.0     xtable_1.8-2        
#> [73] broom_0.4.4          viridisLite_0.3.0    KMsurv_0.1-5        
#> [76] memoise_1.1.0