---
title: 'bootsPLS: bootstrap subsamplings of sPLS-DA for classification and signature
  identification'
author: "F. Rohart"
date: "14 December 2017"
output:
  rmdformats::readthedown:
    code_folding: show
    highlight: pygments
---


We illustrate the bootsPLS package using data included in the package. The data is coming from Mesenchymal Stem Cells (MSC), which was used in the paper:  

*A Molecular Classification of Human Mesenchymal Stromal Cells*, **F. Rohart *et al.* ** 2016. PeerJ, DOI 10.7717/peerj.1845
  
```{r, message = TRUE,echo=TRUE}
library(bootsPLS)
set.seed(13) # for reproducibilty of this illustration
data(MSC)
# we shuffle the observation (initially samples are all MSC followed by all Non-MSC)
ind.random = sample(1:nrow(MSC$X)) 
X=MSC$X[ind.random,]
Y=MSC$Y[ind.random] 
dim(X)
summary(Y)
```


# bootsPLS
This function performs sPLS-DA on many subsamplings of X,Y in order to assess the stability of the selected variables. This allows to build a final model using only the most stable variables.
The function returns a bootsPLS object in which the variables selected over the many replications are recorded. Other outputs are of course available.  
`bootsPLS' can be run several times, on different clusters for example. In that case, the argument save.file  is useful as the outputs are saved into a Rdata file, making it easier to combine several outputs.  

The default distance used to classify samples during the Cross-Validation process is "max.dist". Also available are "centroids.dist" and "mahalanobis.dist". It is important that users be consistent with the distance used: if "max.dist" is used for "bootsPLS", it must be used for the "prediction" and "CI.prediction" functions.

```{r,echo=TRUE,fig.width=11,fig.height=7,results="hide"}
# we first run the function for only 1 iteration (many=1) and we show the progress of the algorithm
boot1=bootsPLS(X=X,Y=Y,ncomp=3,many=1,showProgress=TRUE, nrepeat=5)
# we now run the function for 20 iterations without showing the progress
boot2=bootsPLS(X=X,Y=Y,ncomp=3,many=24,cpus=2,showProgress=FALSE)
```


# compile.bootsPLS.object
This function combines a list of bootsPLS objects. The entry is either a list of bootsPLS object, a vector of filenames combined with a path or a path/pattern arguments.
```{r,echo=TRUE,fig.width=11,fig.height=7,}
boot=compile.bootsPLS.object(bootsPLS.list=list(boot1,boot2))
```


# plot.bootsPLS
Plot the frequency of selection for each variable. Note that different options are available, e.g. to change the shapes or colors of the points.
```{r,echo=TRUE,fig.width=11,fig.height=7,}
# plot all the variables
plot(boot)
# only plot variables that were selected at least more than 30% of the time on one component
#plot(boot,light=0.3,legend.position=FALSE)

# remove the variable names
plot(boot,light=0.3,name.var=FALSE)
```


# component.selection
Function to decide on the "optimal" number of components.
```{r,echo=TRUE,fig.width=11,fig.height=7,}
out=component.selection(boot)
ncomp.opt=out$opt
```

We can plot the processus, where the solid line is the p-value threshold. If say component 2 is above the threshold line but component 3 is below, this implies that a model with 2 components should be used.
```{r,echo=TRUE,fig.width=11,fig.height=7,}
plot(out)
```

# variable.selection
Function to decide on the "optimal" number of variables per component. See the reference paper (and supplemental information) for more details. In a nutshell, the process assesses whether adding variables will improve significantly the prediction performance. The variables are added by decreasing the frequency of selection; e.g. the model starts with all variables selected 100% of the time, then decreases the percentage until another variable is added (this can be a set of variables if two or more variables have the same frequency of selection). 

```{r,echo=TRUE,fig.width=11,fig.height=7,}
# we do the process for all the components in the model (without choosing the optimal number of components)
out=variable.selection(boot)
```

The signature is 
```{r}
print(out$signature)
```


A dot above the significance line indicates a set of parameters that significantly improves the prediction performance. The last dot above the line (from left to right) is the signature, on a per component basis.
```{r,echo=TRUE,fig.width=11,fig.height=7,}
plot(out)
```

# fit.model
This function can select the number of components and the number of variables (auto.fit=TRUE) and then fits the model based on these parameters.

## with automatic tuning

```{r,echo=TRUE,fig.width=11,fig.height=7,}
fit=fit.model(boot,auto.tune=TRUE, showProgress=FALSE)
# plot(fit$component.selection)
# plot(fit$variable.selection)
```

## without automatic tuning

We here force the number of components to be 2.
```{r,echo=TRUE,fig.width=11,fig.height=7,}
fit=fit.model(boot,ncomp=2, showProgress=FALSE)
signature=fit$data$signature # we chose a model with 2 components
print(signature)
```


We can also use the "fit.model" function on a specific signature.
```{r,echo=TRUE,fig.width=11,fig.height=7,}
fit=fit.model(boot,ncomp=2,signature=signature)
```


## plotIndiv

We can use the plotIndiv function from the mixOmics package
```{r,echo=TRUE,fig.width=11,fig.height=7,}
plotIndiv(fit) 
plotIndiv(fit,ind.names=FALSE) 
```


# Prediction

We look at the prediction values for one of three distances. Note that you should use the same distance as the one used during the `bootsPLS' call. Here we use the "max.dist" default distance.
```{r,echo=TRUE,fig.width=11,fig.height=7,}
pred=prediction(fit)
head(pred$predicted.learn$max.dist)
# head(pred$predicted.learn$centroids.dis)
# head(pred$predicted.learn$mahalanobis.dist)
```

We compare them to the real values of Y and get a confusion matrix, as well as a Balanced Error Rate (BER).
```{r,echo=TRUE,fig.width=11,fig.height=7,}
conf = mixOmics::get.confusion_matrix(truth = fit$Y, predicted = pred$predicted.learn$max.dist[,2])
conf
mixOmics::get.BER(conf)
```

Now we can predict on another dataset, if any. Here we use the learning set (hence prediction should be the same for the \$predicted.learn and \$predicted.test)
```{r,echo=TRUE,fig.width=11,fig.height=7,}
pred=prediction(fit,X.test=X)
head(pred$Y.hat.test[,,'comp.2'])
head(pred$predicted.test$max.dist)
# head(pred$predicted.test$centroids.dis)
# head(pred$predicted.test$mahalanobis.dist)
```


# CI.prediction

Function to calculate Confidence Interval based on the predicted values. This fits a predictive model for each of the "many" subsamplings and use these predicted values to calculate a CI.
The function outputs a CI (lwr and upr values) for each component (here 3), each sample (here 10 samples) and each outcome category (here MSC and Non-MSC).
```{r,echo=TRUE,fig.width=11,fig.height=7,}
CI=CI.prediction(fit,X.test=X[1:10,])
print(CI$CI)
```


```{r,echo=TRUE,fig.width=11,fig.height=7,}
# show the predicted values and their confidence interval based on the subsamplings
head(cbind(pred$Y.hat.test[1:10,1,'comp.2'],CI$CI$"comp.2"$"MSC"))
head(cbind(pred$Y.hat.test[1:10,2,'comp.2'],CI$CI$"comp.2"$"Non-MSC"))

```

# plot.predictCI

We can plot the Confidence Intervals for a specific level and a specific component. A dash line is plotted by default but should only be considered when the distance used to predict the class of samples is set to "max.dist". In the simple case of a two levels outcome, the 0.5 line represents the threshold between the levels, and any samples for which the Confidence Interval is overlaying 0.5 does not have a clear predicted class. 

Note that this function can be called directly from "CI.prediction" or using the $out.CI output from the "prediction" function.
```{r,echo=TRUE,fig.width=11,fig.height=7,}
pred=prediction(fit,X.test=X, CI=TRUE)
plot(pred$out.CI, ncomp=2)
# we color each confidence interval by the predicted class
plot(pred$out.CI, ncomp=2, col = color.mixo(factor(pred$predicted.test$"max.dist"[,2]))) #second component

#plot(CI) #from previous section
```


# SessionInfo()

```{r, echo=FALSE}
sessionInfo()
```