Ensemble Partial Least Squares for Model Applicability Domain Evaluation

Model applicability domain evaluation with ensemble partial least squares.

Usage

enpls.ad(
  x,
  y,
  xtest,
  ytest,
  maxcomp = NULL,
  cvfolds = 5L,
  space = c("sample", "variable"),
  method = c("mc", "boot"),
  reptimes = 500L,
  ratio = 0.8,
  parallel = 1L
)

Arguments

x: Predictor matrix of the training set.
y: Response vector of the training set.
xtest: List, with the i-th component being the i-th test set's predictor matrix (see example code below).
ytest: List, with the i-th component being the i-th test set's response vector (see example code below).
maxcomp: Maximum number of components included within each model. If not specified, will use the maximum number possible (considering cross-validation and special cases where n is smaller than p).
cvfolds: Number of cross-validation folds used in each model for automatic parameter selection, default is 5.
space: Space in which to apply the resampling method. Can be the sample space ("sample") or the variable space ("variable").
method: Resampling method. "mc" (Monte-Carlo resampling) or "boot" (bootstrapping). Default is "mc".
reptimes: Number of models to build with Monte-Carlo resampling or bootstrapping.
ratio: Sampling ratio used when method = "mc".
parallel: Integer. Number of CPU cores to use. Default is 1 (not parallelized).

Value

A list containing:

tr.error.mean - absolute mean prediction error for training set
tr.error.median - absolute median prediction error for training set
tr.error.sd - prediction error sd for training set
tr.error.matrix - raw prediction error matrix for training set
te.error.mean - list of absolute mean prediction error for test set(s)
te.error.median - list of absolute median prediction error for test set(s)
te.error.sd - list of prediction error sd for test set(s)
te.error.matrix - list of raw prediction error matrix for test set(s)

Note

Note that for space = "variable", method could only be "mc", since bootstrapping in the variable space will create duplicated variables, and that could cause problems.

Author

Nan Xiao <https://nanx.me>

Examples

data("alkanes")
x <- alkanes$x
y <- alkanes$y

# training set
x.tr <- x[1:100, ]
y.tr <- y[1:100]

# two test sets
x.te <- list(
  "test.1" = x[101:150, ],
  "test.2" = x[151:207, ]
)
y.te <- list(
  "test.1" = y[101:150],
  "test.2" = y[151:207]
)

set.seed(42)
ad <- enpls.ad(
  x.tr, y.tr, x.te, y.te,
  space = "variable", method = "mc",
  ratio = 0.9, reptimes = 50
)
print(ad)
#> Model Applicability Domain Evaluation by ENPLS
#> ---
#> Absolute mean prediction error for each training set sample:
#>   [1]  1.143535290  0.266577478  0.075668338  1.131416799  0.103337151
#>   [6]  1.062594738  0.023209713  0.700521215  0.468235064  0.673075458
#>  [11]  0.089540802  0.401803647  3.489442389  0.627893821  0.222504312
#>  [16]  3.221940312  0.894591039  0.096840929  0.792751494  1.236641601
#>  [21]  0.001584416  0.339331094  0.609968357  0.286792550  0.471706657
#>  [26]  0.691666512  0.219584091  0.687592096  0.737439448  0.692377554
#>  [31]  1.577144836  0.747749279  1.237863334  0.791600110  0.830273570
#>  [36]  1.187375782  0.416252943  1.386693400  1.209712475  1.110794824
#>  [41]  2.140243438  2.399984105  1.428537318  1.055821644  1.517716920
#>  [46]  1.590342977  0.706991956  0.914251843  1.973206057  1.939137967
#>  [51]  1.671019536  0.091573195  3.972970145  0.955507259  0.415111831
#>  [56]  0.931788087  2.508476277  2.759110197  1.600821991  0.404102398
#>  [61]  0.963920649  3.568594663 14.075284215  5.965384961  0.849943296
#>  [66]  1.141524697  2.287166025  1.359689933  1.504079464  2.344010556
#>  [71]  0.629486701  1.049468037  1.268353928  2.135248556  1.616434750
#>  [76]  2.119979067  0.269447046  1.830802524  1.784506205  0.637978496
#>  [81]  1.039257790  0.413567656  5.704702300  0.307698959  2.670416866
#>  [86]  0.691503674  2.185178349  2.001917127  1.529464596  0.374889846
#>  [91]  1.283773190  3.774151143  2.857702789  4.220121496  6.504020855
#>  [96]  1.762052847  3.572747930  1.679601429  1.776315457  1.140765375
#> ---
#> Prediction error SD for each training set sample:
#>   [1] 0.6632510 1.0747613 0.6228398 0.3909717 0.6276894 0.2926764 0.4835952
#>   [8] 0.2839815 0.3790893 0.2795853 0.2936881 0.2066487 0.7131719 0.3041280
#>  [15] 0.1152717 0.5529949 0.2850841 0.1779433 0.4157889 0.3811892 0.2819642
#>  [22] 0.2720929 0.4005652 0.3226532 0.2209190 0.4454591 0.3876323 0.2040831
#>  [29] 0.4670555 0.4136547 0.3290377 0.1708938 0.4155863 0.1579950 0.1666879
#>  [36] 0.2073140 0.5295909 0.4394888 0.3281213 0.2408321 0.4505811 0.1982922
#>  [43] 0.2434650 0.6775080 0.5017698 0.4212760 0.5761228 0.3315790 0.2303946
#>  [50] 0.2397685 0.3025108 0.1862496 0.3459049 0.3907732 0.2002375 0.3334323
#>  [57] 0.2800619 0.2489747 0.4342413 0.3122039 0.3108747 0.1818030 0.3497313
#>  [64] 0.1975296 0.2514031 0.5769985 0.2638713 0.2774563 0.3553371 0.3684134
#>  [71] 0.2015989 0.2458771 0.2710750 0.2248169 0.3510436 0.4010946 0.5246710
#>  [78] 0.5445661 0.2722141 0.3716945 0.3971595 0.1601688 0.2975253 0.2694639
#>  [85] 0.1819710 0.3799211 0.1644366 0.4944335 0.1960038 0.3795414 0.3421108
#>  [92] 0.3548105 0.6303994 0.5065287 0.2099194 0.1725128 0.2470902 0.4008990
#>  [99] 0.1955731 0.4749818
#> ---
#> Absolute mean prediction error for each test set sample:
#> [[1]]
#>  [1]  1.65850329  0.38377988  1.67378051  0.05017774  4.39395510  0.10970090
#>  [7]  0.80312305  2.00112429  3.30352901  2.39702103  2.49110746  2.57201965
#> [13]  3.40522658  1.66881500  0.18343101  2.68753476  3.38755741  0.45423958
#> [19]  2.05966181 11.37567054 12.46448491 10.29298124 13.73648078  7.69634489
#> [25]  9.60840997 61.52340940 12.87208410 11.51112678 11.74810709  6.98371976
#> [31]  3.21827472 12.00306032 12.10550647 12.84024755  4.05105088 12.90499545
#> [37] 11.90523171  4.25191528  2.50046794 12.77903509  6.03664168 11.53383212
#> [43]  4.61455483  2.44420160 12.70236566  7.43903336  3.71605764  2.86329295
#> [49]  7.64329131  7.18551395
#> 
#> [[2]]
#>  [1]  3.4543176  2.4871241 11.3411211 10.5407448  0.6383966  2.5794987
#>  [7]  1.3770894  7.8395977  1.2465770  0.4117513 36.6052600 31.2661905
#> [13] 35.1483035 31.4823204 35.0510448 29.4380334 40.1687529 34.5458973
#> [19] 23.2587832 77.0500127 28.5465786 23.0528668 25.4344415 34.8137632
#> [25] 30.6215780 22.2266004 19.7494606  0.2107381  0.8672154  3.3958982
#> [31]  2.7987705  2.3278965  1.7741967  0.7847005  6.0851973  1.5202727
#> [37]  5.0502711 46.4425419  0.7740034  1.3427991  3.9044839  1.0792567
#> [43]  4.9727341  0.7154980  3.8805581 15.7410126  3.2326191 14.0260989
#> [49]  2.5844449  6.2848282 12.2295047 10.8379296 11.2892066 12.5826622
#> [55]  7.0478476 12.8170478 11.4202239
#> 
#> ---
#> Prediction error SD for each test set sample:
#> [[1]]
#>  [1]   0.3435652   0.4202025   0.2808564   0.4494669   0.5071351   0.6865988
#>  [7]   0.2336136   0.4024350   0.4419844   0.6074236   0.4874105   0.6118569
#> [13]   0.5662671   0.7046519   0.5319585   0.6922911   0.6478133   0.7873733
#> [19]   0.9220921   9.1907515   8.9865306   8.7179049   8.6820109   7.3214560
#> [25]   6.3265450 100.7454688   8.0020688   8.0038258   7.6711559   8.0594401
#> [31]   7.9284787   7.0233280   7.5094411   7.8319137   7.8441271   7.8257605
#> [37]   7.4775638   7.7852585   7.7400842   6.8966159   7.6805341   7.6719323
#> [43]   7.7588242   7.7021805   7.6900458   7.6252742   7.6487169   7.5851825
#> [49]   7.6648408   7.4873894
#> 
#> [[2]]
#>  [1]   7.6684654   7.5798389   7.6259424   7.2935751   7.9845870   7.5090642
#>  [7]   7.5267845   6.6601347   7.4769873   7.5602014  23.1031563  19.9672830
#> [13]  20.0716499  21.5910269  20.7678312  21.0867461  21.3929124  21.3106238
#> [19]  21.2005159  88.4988937  21.1606533  21.1062933  20.6057930  20.9342275
#> [25]  20.9589177  20.7578514  20.7028630   0.2224852   0.6708101   0.5494225
#> [31]   0.3928201   0.5017335   0.3570617   0.3469072   0.5381546   0.1064181
#> [37]   0.3604482 105.9905919   0.2603238   0.3310628   0.2514998   0.1795206
#> [43]   0.3312007   0.3384145   0.7288403   8.9270759   7.4587752   8.2109699
#> [49]   7.9356333   7.9803569   7.9508740   6.9278026   7.8746692   7.7807236
#> [55]   7.9334114   7.9810786   7.7136865
#> 
plot(ad)

# the interactive plot requires a HTML viewer
if (FALSE) {
plot(ad, type = "interactive")
}