--- title: 'bootsPLS: bootstrap subsamplings of sPLS-DA for classification and signature identification' author: "F. Rohart" date: "14 December 2017" output: rmdformats::readthedown: code_folding: show highlight: pygments --- We illustrate the bootsPLS package using data included in the package. The data is coming from Mesenchymal Stem Cells (MSC), which was used in the paper: *A Molecular Classification of Human Mesenchymal Stromal Cells*, **F. Rohart *et al.* ** 2016. PeerJ, DOI 10.7717/peerj.1845 ```{r, message = TRUE,echo=TRUE} library(bootsPLS) set.seed(13) # for reproducibilty of this illustration data(MSC) # we shuffle the observation (initially samples are all MSC followed by all Non-MSC) ind.random = sample(1:nrow(MSC$X)) X=MSC$X[ind.random,] Y=MSC$Y[ind.random] dim(X) summary(Y) ``` # bootsPLS This function performs sPLS-DA on many subsamplings of X,Y in order to assess the stability of the selected variables. This allows to build a final model using only the most stable variables. The function returns a bootsPLS object in which the variables selected over the many replications are recorded. Other outputs are of course available. `bootsPLS' can be run several times, on different clusters for example. In that case, the argument save.file is useful as the outputs are saved into a Rdata file, making it easier to combine several outputs. The default distance used to classify samples during the Cross-Validation process is "max.dist". Also available are "centroids.dist" and "mahalanobis.dist". It is important that users be consistent with the distance used: if "max.dist" is used for "bootsPLS", it must be used for the "prediction" and "CI.prediction" functions. ```{r,echo=TRUE,fig.width=11,fig.height=7,results="hide"} # we first run the function for only 1 iteration (many=1) and we show the progress of the algorithm boot1=bootsPLS(X=X,Y=Y,ncomp=3,many=1,showProgress=TRUE, nrepeat=5) # we now run the function for 20 iterations without showing the progress boot2=bootsPLS(X=X,Y=Y,ncomp=3,many=24,cpus=2,showProgress=FALSE) ``` # compile.bootsPLS.object This function combines a list of bootsPLS objects. The entry is either a list of bootsPLS object, a vector of filenames combined with a path or a path/pattern arguments. ```{r,echo=TRUE,fig.width=11,fig.height=7,} boot=compile.bootsPLS.object(bootsPLS.list=list(boot1,boot2)) ``` # plot.bootsPLS Plot the frequency of selection for each variable. Note that different options are available, e.g. to change the shapes or colors of the points. ```{r,echo=TRUE,fig.width=11,fig.height=7,} # plot all the variables plot(boot) # only plot variables that were selected at least more than 30% of the time on one component #plot(boot,light=0.3,legend.position=FALSE) # remove the variable names plot(boot,light=0.3,name.var=FALSE) ``` # component.selection Function to decide on the "optimal" number of components. ```{r,echo=TRUE,fig.width=11,fig.height=7,} out=component.selection(boot) ncomp.opt=out$opt ``` We can plot the processus, where the solid line is the p-value threshold. If say component 2 is above the threshold line but component 3 is below, this implies that a model with 2 components should be used. ```{r,echo=TRUE,fig.width=11,fig.height=7,} plot(out) ``` # variable.selection Function to decide on the "optimal" number of variables per component. See the reference paper (and supplemental information) for more details. In a nutshell, the process assesses whether adding variables will improve significantly the prediction performance. The variables are added by decreasing the frequency of selection; e.g. the model starts with all variables selected 100% of the time, then decreases the percentage until another variable is added (this can be a set of variables if two or more variables have the same frequency of selection). ```{r,echo=TRUE,fig.width=11,fig.height=7,} # we do the process for all the components in the model (without choosing the optimal number of components) out=variable.selection(boot) ``` The signature is ```{r} print(out$signature) ``` A dot above the significance line indicates a set of parameters that significantly improves the prediction performance. The last dot above the line (from left to right) is the signature, on a per component basis. ```{r,echo=TRUE,fig.width=11,fig.height=7,} plot(out) ``` # fit.model This function can select the number of components and the number of variables (auto.fit=TRUE) and then fits the model based on these parameters. ## with automatic tuning ```{r,echo=TRUE,fig.width=11,fig.height=7,} fit=fit.model(boot,auto.tune=TRUE, showProgress=FALSE) # plot(fit$component.selection) # plot(fit$variable.selection) ``` ## without automatic tuning We here force the number of components to be 2. ```{r,echo=TRUE,fig.width=11,fig.height=7,} fit=fit.model(boot,ncomp=2, showProgress=FALSE) signature=fit$data$signature # we chose a model with 2 components print(signature) ``` We can also use the "fit.model" function on a specific signature. ```{r,echo=TRUE,fig.width=11,fig.height=7,} fit=fit.model(boot,ncomp=2,signature=signature) ``` ## plotIndiv We can use the plotIndiv function from the mixOmics package ```{r,echo=TRUE,fig.width=11,fig.height=7,} plotIndiv(fit) plotIndiv(fit,ind.names=FALSE) ``` # Prediction We look at the prediction values for one of three distances. Note that you should use the same distance as the one used during the `bootsPLS' call. Here we use the "max.dist" default distance. ```{r,echo=TRUE,fig.width=11,fig.height=7,} pred=prediction(fit) head(pred$predicted.learn$max.dist) # head(pred$predicted.learn$centroids.dis) # head(pred$predicted.learn$mahalanobis.dist) ``` We compare them to the real values of Y and get a confusion matrix, as well as a Balanced Error Rate (BER). ```{r,echo=TRUE,fig.width=11,fig.height=7,} conf = mixOmics::get.confusion_matrix(truth = fit$Y, predicted = pred$predicted.learn$max.dist[,2]) conf mixOmics::get.BER(conf) ``` Now we can predict on another dataset, if any. Here we use the learning set (hence prediction should be the same for the \$predicted.learn and \$predicted.test) ```{r,echo=TRUE,fig.width=11,fig.height=7,} pred=prediction(fit,X.test=X) head(pred$Y.hat.test[,,'comp.2']) head(pred$predicted.test$max.dist) # head(pred$predicted.test$centroids.dis) # head(pred$predicted.test$mahalanobis.dist) ``` # CI.prediction Function to calculate Confidence Interval based on the predicted values. This fits a predictive model for each of the "many" subsamplings and use these predicted values to calculate a CI. The function outputs a CI (lwr and upr values) for each component (here 3), each sample (here 10 samples) and each outcome category (here MSC and Non-MSC). ```{r,echo=TRUE,fig.width=11,fig.height=7,} CI=CI.prediction(fit,X.test=X[1:10,]) print(CI$CI) ``` ```{r,echo=TRUE,fig.width=11,fig.height=7,} # show the predicted values and their confidence interval based on the subsamplings head(cbind(pred$Y.hat.test[1:10,1,'comp.2'],CI$CI$"comp.2"$"MSC")) head(cbind(pred$Y.hat.test[1:10,2,'comp.2'],CI$CI$"comp.2"$"Non-MSC")) ``` # plot.predictCI We can plot the Confidence Intervals for a specific level and a specific component. A dash line is plotted by default but should only be considered when the distance used to predict the class of samples is set to "max.dist". In the simple case of a two levels outcome, the 0.5 line represents the threshold between the levels, and any samples for which the Confidence Interval is overlaying 0.5 does not have a clear predicted class. Note that this function can be called directly from "CI.prediction" or using the $out.CI output from the "prediction" function. ```{r,echo=TRUE,fig.width=11,fig.height=7,} pred=prediction(fit,X.test=X, CI=TRUE) plot(pred$out.CI, ncomp=2) # we color each confidence interval by the predicted class plot(pred$out.CI, ncomp=2, col = color.mixo(factor(pred$predicted.test$"max.dist"[,2]))) #second component #plot(CI) #from previous section ``` # SessionInfo() ```{r, echo=FALSE} sessionInfo() ```