The function converts a graph to a collection of
nodewise-based models: each mediator or sink variable can be expressed as
a function of its parents. Based on the assumed type of relationship,
i.e. linear or non-linear, SEMml()
fits a ML model to each
node (variable) with non-zero incoming connectivity.
The model fitting is performed equation-by equation (r=1,...,R)
times, where R is the number of mediators and sink nodes.
SEMml(
graph,
data,
outcome = NULL,
algo = "sem",
thr = NULL,
nboot = 0,
ncores = 2,
verbose = FALSE,
...
)
An igraph object.
A matrix with rows corresponding to subjects, and columns to graph nodes (variables).
A character vector (as.fctor) of labels for a categorical output (target). If NULL (default), the categorical output (target) will not be considered.
ML method used for nodewise-network predictions. Six algorithms can be specified:
algo="sem"
(default) for a linear SEM, see SEMrun
.
algo="tree"
for a CART model, see rpart
.
algo="rf"
for a random forest model, see ranger
.
algo="xgb"
for a XGBoost model, see xgboost
.
algo="nn"
for a small neural network model (1 hidden layer and 10 nodes), see nnet
.
algo="dnn"
for a large neural network model (1 hidden layers and 1000 nodes), see dnn
.
A numeric value [0-1] indicating the threshold to apply to the variable importance values to color the graph. If thr = NULL (default), the threshold is set to thr = 0.5*max(abs(variable importance values)).
number of bootstrap samples that will be used to compute cheap (lower, upper) CIs for all input variable weights. As a default, nboot = 0.
number of cpu cores (default = 2)
A logical value. If FALSE (default), the processed graph will not be plotted to screen.
Currently ignored.
An S3 object of class "ML" is returned. It is a list of 5 objects:
"fit", a list of ML model objects, including: the estimated covariance matrix (Sigma), the estimated model errors (Psi), the fitting indices (fitIdx), and the parameterEstimates, i.e., the variable importance measures (VarImp).
"gest", the data.frame of variable importances (parameterEstimates) of outcome levels, if outcome != NULL.
"model", a list of all the fitted non-linear nodewise-based models (tree, rf, xgb, nn or dnn).
"graph", the induced DAG of the input graph mapped on data variables. The DAG with colored edge/nodes based on the variable importance measures, i.e., if abs(VarImp) > thr will be highlighted in red (VarImp > 0) or blue (VarImp < 0). If the outcome vector is given, nodes with variable importances summed over the outcome levels, i.e. sum(VarImp[outcome levels])) > thr, will be highlighted in pink.
"data", input data subset mapping graph nodes.
Using the default algo="sem"
, the usual output of a linear nodewise-based,
SEM, see SEMrun
(algo="cggm"), will be returned.
By mapping data onto the input graph, SEMml()
creates
a set of nodewise-based models based on the directed links, i.e.,
a set of edges pointing in the same direction, between two nodes
in the input graph that are causally relevant to each other.
The mediator or sink variables can be characterized in detail as
functions of their parents. An ML model (sem, tree, rf, xgb, nn, dnn)
can then be fitted to each variable with non-zero inbound connectivity.
With R representing the number of mediators and sink nodes in the
network, the model fitting process is performed equation-by-equation
(r=1,...,R) times.
If boot != 0, the function will implement the cheap bootstrapping proposed by Lam (2002) to generate uncertainties, i.e. 90 for ML parameters. Bootstrapping can be enabled by setting a small number (1 to 10) of bootstrap samples. Note, however, that the computation can be time-consuming for massive MLs, even with cheap bootstrapping!
Grassi M., Palluzzi F., and Tarantino B. (2022). SEMgraph: An R Package for Causal Network Analysis of High-Throughput Data with Structural Equation Models. Bioinformatics, 38 (20), 4829–4830 <https://doi.org/10.1093/bioinformatics/btac567>
Breiman L., Friedman J.H., Olshen R.A., and Stone, C.J. (1984) Classification and Regression Trees. Chapman and Hall/CRC.
Breiman L. (2001). Random Forests, Machine Learning 45(1), 5-32.
Chen T., and Guestrin C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Ripley B.D. (1996). Pattern Recognition and Neural Networks. Cambridge University Press.
Lam, H. (2022). Cheap bootstrap for input uncertainty quantification. WSC '22: Proceedings of the Winter Simulation Conference, 2318 - 2329.
# \donttest{
# Load Amyotrophic Lateral Sclerosis (ALS)
ig<- alsData$graph
data<- alsData$exprs
data<- transformData(data)$data
#> Conducting the nonparanormal transformation via shrunkun ECDF...done.
group<- alsData$group
#...with train-test (0.5-0.5) samples
set.seed(123)
train<- sample(1:nrow(data), 0.5*nrow(data))
start<- Sys.time()
# ... tree
res1<- SEMml(ig, data[train, ], algo="tree")
#> Running SEM model via ML...
#> done.
#>
#> TREE solver ended normally after 23 iterations
#>
#> logL:-45.080145 srmr:0.201877
# ... rf
res2<- SEMml(ig, data[train, ], algo="rf")
#> Running SEM model via ML...
#> done.
#>
#> RF solver ended normally after 23 iterations
#>
#> logL:-33.16687 srmr:0.086188
# ... xgb
res3<- SEMml(ig, data[train, ], algo="xgb")
#> Running SEM model via ML...
#> done.
#>
#> XGB solver ended normally after 23 iterations
#>
#> logL:70.10035 srmr:0.001439
# ... nn
res4<- SEMml(ig, data[train, ], algo="nn")
#> Running SEM model via ML...
#> done.
#>
#> NN solver ended normally after 23 iterations
#>
#> logL:-33.873642 srmr:0.135954
end<- Sys.time()
print(end-start)
#> Time difference of 2.329741 secs
#visualizaation of the colored dag for algo="nn"
gplot(res4$graph, l="dot", main="nn")
#Comparison of fitting indices (in train data)
res1$fit$fitIdx #tree
#> logL amse rmse srmr
#> -45.0801450 0.6403463 0.8002164 0.2018772
res2$fit$fitIdx #rf
#> logL amse rmse srmr
#> -33.16686964 0.23129978 0.48093636 0.08618768
res3$fit$fitIdx #xgb
#> logL amse rmse srmr
#> 7.010035e+01 1.865892e-04 1.365977e-02 1.439282e-03
res4$fit$fitIdx #nn
#> logL amse rmse srmr
#> -33.8736420 0.3859230 0.6212270 0.1359541
#Comparison of parameter estimates (in train data)
parameterEstimates(res1$fit) #tree
#> lhs op rhs VarImp lower upper
#> 1 10452 ~ 6647 16.112 0 0
#> 43 84134 ~ 6647 24.516 0 0
#> 38 596 ~ 6647 10.728 0 0
#> 17 4747 ~ 6647 19.529 0 0
#> 41 79139 ~ 6647 25.164 0 0
#> 27 5530 ~ 6647 27.370 0 0
#> 28 5532 ~ 6647 32.388 0 0
#> 29 5533 ~ 6647 20.453 0 0
#> 30 5534 ~ 6647 30.462 0 0
#> 31 5535 ~ 6647 25.923 0 0
#> 44 842 ~ 54205 32.106 0 0
#> 23 54205 ~ 581 35.206 0 0
#> 24 54205 ~ 572 6.647 0 0
#> 25 54205 ~ 596 6.135 0 0
#> 26 54205 ~ 598 15.418 0 0
#> 45 842 ~ 317 6.470 0 0
#> 42 836 ~ 842 28.768 0 0
#> 4 1616 ~ 7132 14.271 0 0
#> 5 1616 ~ 7133 19.620 0 0
#> 6 4217 ~ 1616 12.718 0 0
#> 36 5606 ~ 4217 11.445 0 0
#> 37 5608 ~ 4217 35.143 0 0
#> 2 1432 ~ 5606 15.674 0 0
#> 32 5600 ~ 5606 12.193 0 0
#> 34 5603 ~ 5606 14.378 0 0
#> 39 6300 ~ 5606 11.595 0 0
#> 3 1432 ~ 5608 22.355 0 0
#> 33 5600 ~ 5608 19.935 0 0
#> 35 5603 ~ 5608 13.768 0 0
#> 40 6300 ~ 5608 25.313 0 0
#> 18 4747 ~ 1432 22.556 0 0
#> 7 4741 ~ 1432 24.848 0 0
#> 12 4744 ~ 1432 19.416 0 0
#> 19 4747 ~ 5600 11.291 0 0
#> 8 4741 ~ 5600 7.237 0 0
#> 13 4744 ~ 5600 16.367 0 0
#> 20 4747 ~ 5603 13.228 0 0
#> 9 4741 ~ 5603 13.976 0 0
#> 14 4744 ~ 5603 17.849 0 0
#> 21 4747 ~ 6300 3.397 0 0
#> 10 4741 ~ 6300 2.549 0 0
#> 15 4744 ~ 6300 10.920 0 0
#> 22 4747 ~ 5630 8.191 0 0
#> 11 4741 ~ 5630 9.045 0 0
#> 16 4744 ~ 5630 10.093 0 0
parameterEstimates(res2$fit) #rf
#> lhs op rhs VarImp lower upper
#> 1 10452 ~ 6647 69.045 0 0
#> 43 84134 ~ 6647 69.784 0 0
#> 38 596 ~ 6647 65.912 0 0
#> 17 4747 ~ 6647 17.437 0 0
#> 41 79139 ~ 6647 69.236 0 0
#> 27 5530 ~ 6647 68.541 0 0
#> 28 5532 ~ 6647 67.560 0 0
#> 29 5533 ~ 6647 64.931 0 0
#> 30 5534 ~ 6647 69.706 0 0
#> 31 5535 ~ 6647 67.608 0 0
#> 44 842 ~ 54205 41.088 0 0
#> 23 54205 ~ 581 33.786 0 0
#> 24 54205 ~ 572 11.875 0 0
#> 25 54205 ~ 596 8.051 0 0
#> 26 54205 ~ 598 18.897 0 0
#> 45 842 ~ 317 29.358 0 0
#> 42 836 ~ 842 68.118 0 0
#> 4 1616 ~ 7132 32.931 0 0
#> 5 1616 ~ 7133 33.630 0 0
#> 6 4217 ~ 1616 66.157 0 0
#> 36 5606 ~ 4217 67.708 0 0
#> 37 5608 ~ 4217 69.613 0 0
#> 2 1432 ~ 5606 29.902 0 0
#> 32 5600 ~ 5606 31.934 0 0
#> 34 5603 ~ 5606 35.574 0 0
#> 39 6300 ~ 5606 29.589 0 0
#> 3 1432 ~ 5608 39.435 0 0
#> 33 5600 ~ 5608 35.348 0 0
#> 35 5603 ~ 5608 31.411 0 0
#> 40 6300 ~ 5608 38.616 0 0
#> 18 4747 ~ 1432 13.906 0 0
#> 7 4741 ~ 1432 20.510 0 0
#> 12 4744 ~ 1432 17.933 0 0
#> 19 4747 ~ 5600 10.738 0 0
#> 8 4741 ~ 5600 13.765 0 0
#> 13 4744 ~ 5600 17.691 0 0
#> 20 4747 ~ 5603 14.071 0 0
#> 9 4741 ~ 5603 15.839 0 0
#> 14 4744 ~ 5603 13.586 0 0
#> 21 4747 ~ 6300 6.476 0 0
#> 10 4741 ~ 6300 7.967 0 0
#> 15 4744 ~ 6300 8.983 0 0
#> 22 4747 ~ 5630 11.374 0 0
#> 11 4741 ~ 5630 14.927 0 0
#> 16 4744 ~ 5630 14.499 0 0
parameterEstimates(res3$fit) #xgb
#> lhs op rhs VarImp lower upper
#> 1 10452 ~ 6647 1.000 0 0
#> 43 84134 ~ 6647 1.000 0 0
#> 38 596 ~ 6647 1.000 0 0
#> 17 4747 ~ 6647 0.263 0 0
#> 41 79139 ~ 6647 1.000 0 0
#> 27 5530 ~ 6647 1.000 0 0
#> 28 5532 ~ 6647 1.000 0 0
#> 29 5533 ~ 6647 1.000 0 0
#> 30 5534 ~ 6647 1.000 0 0
#> 31 5535 ~ 6647 1.000 0 0
#> 44 842 ~ 54205 0.694 0 0
#> 23 54205 ~ 581 0.632 0 0
#> 24 54205 ~ 572 0.104 0 0
#> 25 54205 ~ 596 0.048 0 0
#> 26 54205 ~ 598 0.216 0 0
#> 45 842 ~ 317 0.306 0 0
#> 42 836 ~ 842 1.000 0 0
#> 4 1616 ~ 7132 0.430 0 0
#> 5 1616 ~ 7133 0.570 0 0
#> 6 4217 ~ 1616 1.000 0 0
#> 36 5606 ~ 4217 1.000 0 0
#> 37 5608 ~ 4217 1.000 0 0
#> 2 1432 ~ 5606 0.527 0 0
#> 32 5600 ~ 5606 0.409 0 0
#> 34 5603 ~ 5606 0.521 0 0
#> 39 6300 ~ 5606 0.381 0 0
#> 3 1432 ~ 5608 0.473 0 0
#> 33 5600 ~ 5608 0.591 0 0
#> 35 5603 ~ 5608 0.479 0 0
#> 40 6300 ~ 5608 0.619 0 0
#> 18 4747 ~ 1432 0.300 0 0
#> 7 4741 ~ 1432 0.397 0 0
#> 12 4744 ~ 1432 0.266 0 0
#> 19 4747 ~ 5600 0.092 0 0
#> 8 4741 ~ 5600 0.142 0 0
#> 13 4744 ~ 5600 0.349 0 0
#> 20 4747 ~ 5603 0.194 0 0
#> 9 4741 ~ 5603 0.264 0 0
#> 14 4744 ~ 5603 0.202 0 0
#> 21 4747 ~ 6300 0.036 0 0
#> 10 4741 ~ 6300 0.028 0 0
#> 15 4744 ~ 6300 0.047 0 0
#> 22 4747 ~ 5630 0.115 0 0
#> 11 4741 ~ 5630 0.168 0 0
#> 16 4744 ~ 5630 0.136 0 0
parameterEstimates(res4$fit) #nn
#> lhs op rhs VarImp lower upper
#> 1 10452 ~ 6647 -130.584 0 0
#> 43 84134 ~ 6647 490.219 0 0
#> 38 596 ~ 6647 75.989 0 0
#> 17 4747 ~ 6647 4.713 0 0
#> 41 79139 ~ 6647 148.552 0 0
#> 27 5530 ~ 6647 636.768 0 0
#> 28 5532 ~ 6647 -53.667 0 0
#> 29 5533 ~ 6647 114.084 0 0
#> 30 5534 ~ 6647 -488.401 0 0
#> 31 5535 ~ 6647 -280.155 0 0
#> 44 842 ~ 54205 -13.376 0 0
#> 23 54205 ~ 581 -31.073 0 0
#> 24 54205 ~ 572 10.238 0 0
#> 25 54205 ~ 596 -38.787 0 0
#> 26 54205 ~ 598 3.948 0 0
#> 45 842 ~ 317 23.398 0 0
#> 42 836 ~ 842 85.317 0 0
#> 4 1616 ~ 7132 206.959 0 0
#> 5 1616 ~ 7133 5.475 0 0
#> 6 4217 ~ 1616 326.475 0 0
#> 36 5606 ~ 4217 271.178 0 0
#> 37 5608 ~ 4217 -23.244 0 0
#> 2 1432 ~ 5606 3.231 0 0
#> 32 5600 ~ 5606 11.823 0 0
#> 34 5603 ~ 5606 106.083 0 0
#> 39 6300 ~ 5606 94.168 0 0
#> 3 1432 ~ 5608 30.960 0 0
#> 33 5600 ~ 5608 36.585 0 0
#> 35 5603 ~ 5608 95.357 0 0
#> 40 6300 ~ 5608 1.114 0 0
#> 18 4747 ~ 1432 1.397 0 0
#> 7 4741 ~ 1432 18.983 0 0
#> 12 4744 ~ 1432 -48.739 0 0
#> 19 4747 ~ 5600 14.978 0 0
#> 8 4741 ~ 5600 29.250 0 0
#> 13 4744 ~ 5600 -44.499 0 0
#> 20 4747 ~ 5603 -16.879 0 0
#> 9 4741 ~ 5603 0.224 0 0
#> 14 4744 ~ 5603 6.527 0 0
#> 21 4747 ~ 6300 -4.203 0 0
#> 10 4741 ~ 6300 15.670 0 0
#> 15 4744 ~ 6300 30.669 0 0
#> 22 4747 ~ 5630 -3.214 0 0
#> 11 4741 ~ 5630 -8.594 0 0
#> 16 4744 ~ 5630 65.776 0 0
#Comparison of VarImp (in train data)
table(E(res1$graph)$color) #tree
#>
#> gray50 red2
#> 25 20
table(E(res2$graph)$color) #rf
#>
#> gray50 red2
#> 27 18
table(E(res3$graph)$color) #xgb
#>
#> gray50 red2
#> 25 20
table(E(res4$graph)$color) #nn
#>
#> gray50 red2 royalblue3
#> 41 3 1
#Comparison of AMSE, R2, SRMR (in test data)
print(predict(res1, data[-train, ])$PE) #tree
#> amse r2 srmr
#> 0.8208979 0.1791021 0.2269379
print(predict(res2, data[-train, ])$PE) #rf
#> amse r2 srmr
#> 0.8260251 0.1739749 0.1760094
print(predict(res3, data[-train, ])$PE) #xgb
#> amse r2 srmr
#> 0.8005846 0.1994154 0.1543352
print(predict(res4, data[-train, ])$PE) #nn
#> amse r2 srmr
#> NaN NaN 0.1844971
#...with a categorical (as.factor) outcome
outcome <- factor(ifelse(group == 0, "control", "case")); table(outcome)
#> outcome
#> case control
#> 139 21
res5 <- SEMml(ig, data[train, ], outcome[train], algo="tree")
#> Running SEM model via ML...
#> done.
#>
#> TREE solver ended normally after 25 iterations
#>
#> logL:-48.72171 srmr:0.196861
gplot(res5$graph)
table(E(res5$graph)$color)
#>
#> gray50 red2
#> 25 20
table(V(res5$graph)$color)
#>
#> pink white
#> 2 11
pred <- predict(res5, data[-train, ], outcome[-train], verbose=TRUE)
#> amse r2 srmr
#> 0.7783295 0.2216705 0.2253818
yhat <- pred$Yhat[ ,levels(outcome)]; head(yhat)
#> case control
#> 1 0.4773726 -0.4773726
#> 2 0.4773726 -0.4773726
#> 3 0.4773726 -0.4773726
#> 4 0.4773726 -0.4773726
#> 5 0.4773726 -0.4773726
#> 6 -0.6137648 0.6137648
yobs <- outcome[-train]; head(yobs)
#> [1] case case case case case case
#> Levels: case control
classificationReport(yobs, yhat, verbose=TRUE)$stats
#> pred
#> yobs case control
#> case 57 17
#> control 2 4
#>
#> precision recall f1 accuracy mcc support
#> case 0.9661017 0.7702703 0.8571429 0.7625 0.261562 74
#> control 0.1904762 0.6666667 0.2962963 0.7625 0.261562 6
#> macro avg 0.5782889 0.7184685 0.5767196 0.7625 0.261562 80
#> weighted avg 0.9079298 0.7625000 0.8150794 0.7625 0.261562 80
#> support_prop
#> case 0.925
#> control 0.075
#> macro avg 1.000
#> weighted avg 1.000
# }