bootsPLS: bootstrap subsamplings of sPLS-DA for classification and signature identification

We illustrate the bootsPLS package using data included in the package. The data is coming from Mesenchymal Stem Cells (MSC), which was used in the paper:

A Molecular Classification of Human Mesenchymal Stromal Cells, F. Rohart et al. 2016. PeerJ, DOI 10.7717/peerj.1845

library(bootsPLS)
## Loading required package: mixOmics
## Loading required package: MASS
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Loaded mixOmics 6.3.1
## 
## Visit http://www.mixOmics.org for more details about our methods.
## Any bug reports or comments? Notify us at mixomics at math.univ-toulouse.fr or https://bitbucket.org/klecao/package-mixomics/issues
## 
## Thank you for using mixOmics!
## 
## Loaded bootsPLS 1.1.1
set.seed(13) # for reproducibilty of this illustration
data(MSC)
# we shuffle the observation (initially samples are all MSC followed by all Non-MSC)
ind.random = sample(1:nrow(MSC$X)) 
X=MSC$X[ind.random,]
Y=MSC$Y[ind.random] 
dim(X)
## [1] 150 200
summary(Y)
##     MSC Non-MSC 
##      50     100

bootsPLS

This function performs sPLS-DA on many subsamplings of X,Y in order to assess the stability of the selected variables. This allows to build a final model using only the most stable variables. The function returns a bootsPLS object in which the variables selected over the many replications are recorded. Other outputs are of course available.
`bootsPLS’ can be run several times, on different clusters for example. In that case, the argument save.file is useful as the outputs are saved into a Rdata file, making it easier to combine several outputs.

The default distance used to classify samples during the Cross-Validation process is “max.dist”. Also available are “centroids.dist” and “mahalanobis.dist”. It is important that users be consistent with the distance used: if “max.dist” is used for “bootsPLS”, it must be used for the “prediction” and “CI.prediction” functions.

# we first run the function for only 1 iteration (many=1) and we show the progress of the algorithm
boot1=bootsPLS(X=X,Y=Y,ncomp=3,many=1,showProgress=TRUE, nrepeat=5)
## Warning in bootsPLS(X = X, Y = Y, ncomp = 3, many = 1, showProgress = TRUE, : Zero- or near-zero variance predictors have been discarded.
##  See $nzv for problematic predictors.
# we now run the function for 20 iterations without showing the progress
boot2=bootsPLS(X=X,Y=Y,ncomp=3,many=24,cpus=2,showProgress=FALSE)
## Warning in bootsPLS(X = X, Y = Y, ncomp = 3, many = 24, cpus = 2, showProgress = FALSE): Zero- or near-zero variance predictors have been discarded.
##  See $nzv for problematic predictors.

compile.bootsPLS.object

This function combines a list of bootsPLS objects. The entry is either a list of bootsPLS object, a vector of filenames combined with a path or a path/pattern arguments.

boot=compile.bootsPLS.object(bootsPLS.list=list(boot1,boot2))
## object 1 
## object 2

plot.bootsPLS

Plot the frequency of selection for each variable. Note that different options are available, e.g. to change the shapes or colors of the points.

# plot all the variables
plot(boot)

# only plot variables that were selected at least more than 30% of the time on one component
#plot(boot,light=0.3,legend.position=FALSE)

# remove the variable names
plot(boot,light=0.3,name.var=FALSE)

component.selection

Function to decide on the “optimal” number of components.

out=component.selection(boot)
## Number of chosen components: 1
ncomp.opt=out$opt

We can plot the processus, where the solid line is the p-value threshold. If say component 2 is above the threshold line but component 3 is below, this implies that a model with 2 components should be used.

plot(out)

variable.selection

Function to decide on the “optimal” number of variables per component. See the reference paper (and supplemental information) for more details. In a nutshell, the process assesses whether adding variables will improve significantly the prediction performance. The variables are added by decreasing the frequency of selection; e.g. the model starts with all variables selected 100% of the time, then decreases the percentage until another variable is added (this can be a set of variables if two or more variables have the same frequency of selection).

# we do the process for all the components in the model (without choosing the optimal number of components)
out=variable.selection(boot)
## Component 1 under progress.
## Calculating subsamplings for the top 9 variables on component 1 
## Calculating subsamplings for the top 11 variables on component 1 
## Calculating subsamplings for the top 12 variables on component 1 
## Calculating subsamplings for the top 13 variables on component 1 
## Calculating subsamplings for the top 14 variables on component 1 
## Calculating subsamplings for the top 15 variables on component 1 
## Calculating subsamplings for the top 16 variables on component 1 
## Calculating subsamplings for the top 17 variables on component 1 
## Calculating subsamplings for the top 19 variables on component 1 
## Calculating subsamplings for the top 20 variables on component 1 
## Calculating subsamplings for the top 22 variables on component 1 
## Calculating subsamplings for the top 24 variables on component 1 
## Calculating subsamplings for the top 25 variables on component 1 
## Calculating subsamplings for the top 27 variables on component 1 
## Calculating subsamplings for the top 29 variables on component 1 
## Calculating subsamplings for the top 30 variables on component 1 
## Calculating subsamplings for the top 32 variables on component 1 
## Calculating subsamplings for the top 40 variables on component 1 
## Calculating subsamplings for the top 46 variables on component 1 
## Calculating subsamplings for the top 175 variables on component 1
## 
## Number of chosen variables on component 1: 12
## Component 2 under progress.
## Calculating subsamplings for the top 1 variables on component 2 
## Calculating subsamplings for the top 2 variables on component 2 
## Calculating subsamplings for the top 3 variables on component 2 
## Calculating subsamplings for the top 4 variables on component 2 
## Calculating subsamplings for the top 5 variables on component 2 
## Calculating subsamplings for the top 6 variables on component 2 
## Calculating subsamplings for the top 9 variables on component 2 
## Calculating subsamplings for the top 15 variables on component 2 
## Calculating subsamplings for the top 24 variables on component 2 
## Calculating subsamplings for the top 44 variables on component 2 
## Calculating subsamplings for the top 175 variables on component 2
## 
## Number of chosen variables on component 2: 1
## Component 3 under progress.
## Calculating subsamplings for the top 1 variables on component 3 
## Calculating subsamplings for the top 2 variables on component 3 
## Calculating subsamplings for the top 3 variables on component 3 
## Calculating subsamplings for the top 7 variables on component 3 
## Calculating subsamplings for the top 10 variables on component 3 
## Calculating subsamplings for the top 12 variables on component 3 
## Calculating subsamplings for the top 22 variables on component 3 
## Calculating subsamplings for the top 48 variables on component 3 
## Calculating subsamplings for the top 94 variables on component 3 
## Calculating subsamplings for the top 175 variables on component 3
## 
## Number of chosen variables on component 3: 1

The signature is

print(out$signature)
## $comp.1
##  [1] "ENSG00000125351" "ENSG00000228981" "ENSG00000184995"
##  [4] "ENSG00000198822" "ENSG00000164106" "ENSG00000165349"
##  [7] "ENSG00000133980" "ENSG00000040531" "ENSG00000005243"
## [10] "ENSG00000129354" "ENSG00000087077" "ENSG00000182814"
## 
## $comp.2
## [1] "ENSG00000087077"
## 
## $comp.3
## [1] "ENSG00000175197"

A dot above the significance line indicates a set of parameters that significantly improves the prediction performance. The last dot above the line (from left to right) is the signature, on a per component basis.

plot(out)

fit.model

This function can select the number of components and the number of variables (auto.fit=TRUE) and then fits the model based on these parameters.

with automatic tuning

fit=fit.model(boot,auto.tune=TRUE, showProgress=FALSE)
# plot(fit$component.selection)
# plot(fit$variable.selection)

without automatic tuning

We here force the number of components to be 2.

fit=fit.model(boot,ncomp=2, showProgress=FALSE)
signature=fit$data$signature # we chose a model with 2 components
print(signature)
## $comp.1
##  [1] "ENSG00000125351" "ENSG00000228981" "ENSG00000184995"
##  [4] "ENSG00000198822" "ENSG00000164106" "ENSG00000165349"
##  [7] "ENSG00000133980" "ENSG00000040531" "ENSG00000005243"
## [10] "ENSG00000129354" "ENSG00000087077" "ENSG00000182814"
## 
## $comp.2
## [1] "ENSG00000087077"

We can also use the “fit.model” function on a specific signature.

fit=fit.model(boot,ncomp=2,signature=signature)

plotIndiv

We can use the plotIndiv function from the mixOmics package

plotIndiv(fit) 

plotIndiv(fit,ind.names=FALSE) 

Prediction

We look at the prediction values for one of three distances. Note that you should use the same distance as the one used during the `bootsPLS’ call. Here we use the “max.dist” default distance.

pred=prediction(fit)
head(pred$predicted.learn$max.dist)
##              comp 1    comp 2   
## GSM568483    "Non-MSC" "Non-MSC"
## GSM698439    "MSC"     "MSC"    
## GSM378817    "Non-MSC" "Non-MSC"
## 5935420012_H "MSC"     "MSC"    
## GSM372149    "Non-MSC" "Non-MSC"
## GSM1277645   "MSC"     "MSC"
# head(pred$predicted.learn$centroids.dis)
# head(pred$predicted.learn$mahalanobis.dist)

We compare them to the real values of Y and get a confusion matrix, as well as a Balanced Error Rate (BER).

conf = mixOmics::get.confusion_matrix(truth = fit$Y, predicted = pred$predicted.learn$max.dist[,2])
conf
##         predicted.as.MSC predicted.as.Non-MSC
## MSC                   50                    0
## Non-MSC                9                   91
mixOmics::get.BER(conf)
## [1] 0.045

Now we can predict on another dataset, if any. Here we use the learning set (hence prediction should be the same for the $predicted.learn and $predicted.test)

pred=prediction(fit,X.test=X)
head(pred$Y.hat.test[,,'comp.2'])
##                      MSC     Non-MSC
## GSM568483     0.44257770  0.55742230
## GSM698439     1.09348624 -0.09348624
## GSM378817    -0.04513959  1.04513959
## 5935420012_H  0.76714357  0.23285643
## GSM372149     0.36380714  0.63619286
## GSM1277645    0.74543194  0.25456806
head(pred$predicted.test$max.dist)
##              comp 1    comp 2   
## GSM568483    "Non-MSC" "Non-MSC"
## GSM698439    "MSC"     "MSC"    
## GSM378817    "Non-MSC" "Non-MSC"
## 5935420012_H "MSC"     "MSC"    
## GSM372149    "Non-MSC" "Non-MSC"
## GSM1277645   "MSC"     "MSC"
# head(pred$predicted.test$centroids.dis)
# head(pred$predicted.test$mahalanobis.dist)

CI.prediction

Function to calculate Confidence Interval based on the predicted values. This fits a predictive model for each of the “many” subsamplings and use these predicted values to calculate a CI. The function outputs a CI (lwr and upr values) for each component (here 3), each sample (here 10 samples) and each outcome category (here MSC and Non-MSC).

CI=CI.prediction(fit,X.test=X[1:10,])
print(CI$CI)
## $comp.1
## $comp.1$MSC
##                                              lwr         upr
## GSM568483                             0.38564826  0.45480045
## GSM698439                             1.04500340  1.10777355
## GSM378817                            -0.16054241 -0.11652231
## 5935420012_H                          0.67857394  0.73953743
## GSM372149                             0.30873657  0.35969154
## GSM1277645                            0.64839093  0.71730182
## GSM540412                             0.24250583  0.29512492
## GSM609794_5202764005792184120204.D09 -0.03587477  0.04497949
## GSM898031                            -0.08061319 -0.02726905
## 5935420017_J                          0.77268768  0.84540096
## 
## $comp.1$`Non-MSC`
##                                             lwr        upr
## GSM568483                             0.5451996  0.6143517
## GSM698439                            -0.1077735 -0.0450034
## GSM378817                             1.1165223  1.1605424
## 5935420012_H                          0.2604626  0.3214261
## GSM372149                             0.6403085  0.6912634
## GSM1277645                            0.2826982  0.3516091
## GSM540412                             0.7048751  0.7574942
## GSM609794_5202764005792184120204.D09  0.9550205  1.0358748
## GSM898031                             1.0272691  1.0806132
## 5935420017_J                          0.1545990  0.2273123
## 
## 
## $comp.2
## $comp.2$MSC
##                                              lwr          upr
## GSM568483                             0.41314664  0.482020451
## GSM698439                             1.06129399  1.124713608
## GSM378817                            -0.06642636 -0.025844092
## 5935420012_H                          0.73452656  0.796931439
## GSM372149                             0.34311045  0.390940229
## GSM1277645                            0.70801422  0.781671376
## GSM540412                             0.25465053  0.309910665
## GSM609794_5202764005792184120204.D09 -0.26841175 -0.179978658
## GSM898031                            -0.04228441  0.005075762
## 5935420017_J                          0.83856700  0.906129977
## 
## $comp.2$`Non-MSC`
##                                              lwr         upr
## GSM568483                             0.51797955  0.58685336
## GSM698439                            -0.12471361 -0.06129399
## GSM378817                             1.02584409  1.06642636
## 5935420012_H                          0.20306856  0.26547344
## GSM372149                             0.60905977  0.65688955
## GSM1277645                            0.21832862  0.29198578
## GSM540412                             0.69008934  0.74534947
## GSM609794_5202764005792184120204.D09  1.17997866  1.26841175
## GSM898031                             0.99492424  1.04228441
## 5935420017_J                          0.09387002  0.16143300
# show the predicted values and their confidence interval based on the subsamplings
head(cbind(pred$Y.hat.test[1:10,1,'comp.2'],CI$CI$"comp.2"$"MSC"))
##                                  lwr         upr
## GSM568483     0.44257770  0.41314664  0.48202045
## GSM698439     1.09348624  1.06129399  1.12471361
## GSM378817    -0.04513959 -0.06642636 -0.02584409
## 5935420012_H  0.76714357  0.73452656  0.79693144
## GSM372149     0.36380714  0.34311045  0.39094023
## GSM1277645    0.74543194  0.70801422  0.78167138
head(cbind(pred$Y.hat.test[1:10,2,'comp.2'],CI$CI$"comp.2"$"Non-MSC"))
##                                 lwr         upr
## GSM568483     0.55742230  0.5179795  0.58685336
## GSM698439    -0.09348624 -0.1247136 -0.06129399
## GSM378817     1.04513959  1.0258441  1.06642636
## 5935420012_H  0.23285643  0.2030686  0.26547344
## GSM372149     0.63619286  0.6090598  0.65688955
## GSM1277645    0.25456806  0.2183286  0.29198578

plot.predictCI

We can plot the Confidence Intervals for a specific level and a specific component. A dash line is plotted by default but should only be considered when the distance used to predict the class of samples is set to “max.dist”. In the simple case of a two levels outcome, the 0.5 line represents the threshold between the levels, and any samples for which the Confidence Interval is overlaying 0.5 does not have a clear predicted class.

Note that this function can be called directly from “CI.prediction” or using the $out.CI output from the “prediction” function.

pred=prediction(fit,X.test=X, CI=TRUE)
plot(pred$out.CI, ncomp=2)

# we color each confidence interval by the predicted class
plot(pred$out.CI, ncomp=2, col = color.mixo(factor(pred$predicted.test$"max.dist"[,2]))) #second component

#plot(CI) #from previous section

SessionInfo()

## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] bootsPLS_1.1.1  mixOmics_6.3.1  ggplot2_2.2.1   lattice_0.20-35
## [5] MASS_7.3-47    
## 
## loaded via a namespace (and not attached):
##  [1] purrr_0.2.4        reshape2_1.4.2     colorspace_1.3-2  
##  [4] snow_0.4-2         miniUI_0.1.1       htmltools_0.3.6   
##  [7] yaml_2.1.14        rlang_0.1.4        glue_1.2.0        
## [10] RColorBrewer_1.1-2 bindrcpp_0.2       matrixStats_0.52.2
## [13] plyr_1.8.4         questionr_0.6.2    bindr_0.1         
## [16] stringr_1.2.0      munsell_0.4.3      gtable_0.2.0      
## [19] htmlwidgets_0.9    evaluate_0.10.1    labeling_0.3      
## [22] knitr_1.17         rmdformats_0.3.3   httpuv_1.3.5      
## [25] parallel_3.4.3     highr_0.6          rARPACK_0.11-0    
## [28] Rcpp_0.12.13       xtable_1.8-2       corpcor_1.6.9     
## [31] scales_0.5.0       backports_1.1.1    jsonlite_1.5      
## [34] mime_0.5           RSpectra_0.12-0    gridExtra_2.3     
## [37] ellipse_0.3-8      digest_0.6.12      stringi_1.1.5     
## [40] bookdown_0.5       dplyr_0.7.4        shiny_1.0.5       
## [43] grid_3.4.3         rprojroot_1.2      tools_3.4.3       
## [46] magrittr_1.5       rgl_0.98.1         lazyeval_0.2.1    
## [49] tibble_1.3.4       tidyr_0.7.2        pkgconfig_2.0.1   
## [52] Matrix_1.2-12      assertthat_0.2.0   rmarkdown_1.6     
## [55] rstudioapi_0.7     R6_2.2.2           igraph_1.1.2      
## [58] compiler_3.4.3

F. Rohart

14 December 2017