Nodewise SEM train using Machine Learning (ML)

The function converts a graph to a collection of nodewise-based models: each mediator or sink variable can be expressed as a function of its parents. Based on the assumed type of relationship, i.e. linear or non-linear, SEMml() fits a ML model to each node (variable) with non-zero incoming connectivity. The model fitting is performed equation-by equation (r=1,...,R) times, where R is the number of mediators and sink nodes.

SEMml(
  graph,
  data,
  outcome = NULL,
  algo = "sem",
  thr = NULL,
  nboot = 0,
  ncores = 2,
  verbose = FALSE,
  ...
)

Arguments

graph

An igraph object.

data

A matrix with rows corresponding to subjects, and columns to graph nodes (variables).

outcome

A character vector (as.fctor) of labels for a categorical output (target). If NULL (default), the categorical output (target) will not be considered.

algo

ML method used for nodewise-network predictions. Six algorithms can be specified:

algo="sem" (default) for a linear SEM, see SEMrun.
algo="tree" for a CART model, see rpart.
algo="rf" for a random forest model, see ranger.
algo="xgb" for a XGBoost model, see xgboost.
algo="nn" for a small neural network model (1 hidden layer and 10 nodes), see nnet.
algo="dnn" for a large neural network model (1 hidden layers and 1000 nodes), see dnn.

thr

A numeric value [0-1] indicating the threshold to apply to the variable importance values to color the graph. If thr = NULL (default), the threshold is set to thr = 0.5*max(abs(variable importance values)).

nboot

number of bootstrap samples that will be used to compute cheap (lower, upper) CIs for all input variable weights. As a default, nboot = 0.

ncores

number of cpu cores (default = 2)

verbose

A logical value. If FALSE (default), the processed graph will not be plotted to screen.

...

Currently ignored.

Value

An S3 object of class "ML" is returned. It is a list of 5 objects:

"fit", a list of ML model objects, including: the estimated covariance matrix (Sigma), the estimated model errors (Psi), the fitting indices (fitIdx), and the parameterEstimates, i.e., the variable importance measures (VarImp).
"gest", the data.frame of variable importances (parameterEstimates) of outcome levels, if outcome != NULL.
"model", a list of all the fitted non-linear nodewise-based models (tree, rf, xgb, nn or dnn).
"graph", the induced DAG of the input graph mapped on data variables. The DAG with colored edge/nodes based on the variable importance measures, i.e., if abs(VarImp) > thr will be highlighted in red (VarImp > 0) or blue (VarImp < 0). If the outcome vector is given, nodes with variable importances summed over the outcome levels, i.e. sum(VarImp[outcome levels])) > thr, will be highlighted in pink.
"data", input data subset mapping graph nodes.

Using the default algo="sem", the usual output of a linear nodewise-based, SEM, see SEMrun (algo="cggm"), will be returned.

Details

By mapping data onto the input graph, SEMml() creates a set of nodewise-based models based on the directed links, i.e., a set of edges pointing in the same direction, between two nodes in the input graph that are causally relevant to each other. The mediator or sink variables can be characterized in detail as functions of their parents. An ML model (sem, tree, rf, xgb, nn, dnn) can then be fitted to each variable with non-zero inbound connectivity. With R representing the number of mediators and sink nodes in the network, the model fitting process is performed equation-by-equation (r=1,...,R) times.

If boot != 0, the function will implement the cheap bootstrapping proposed by Lam (2002) to generate uncertainties, i.e. 90 for ML parameters. Bootstrapping can be enabled by setting a small number (1 to 10) of bootstrap samples. Note, however, that the computation can be time-consuming for massive MLs, even with cheap bootstrapping!

References

Grassi M., Palluzzi F., and Tarantino B. (2022). SEMgraph: An R Package for Causal Network Analysis of High-Throughput Data with Structural Equation Models. Bioinformatics, 38 (20), 4829–4830 <https://doi.org/10.1093/bioinformatics/btac567>

Breiman L., Friedman J.H., Olshen R.A., and Stone, C.J. (1984) Classification and Regression Trees. Chapman and Hall/CRC.

Breiman L. (2001). Random Forests, Machine Learning 45(1), 5-32.

Chen T., and Guestrin C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Ripley B.D. (1996). Pattern Recognition and Neural Networks. Cambridge University Press.

Lam, H. (2022). Cheap bootstrap for input uncertainty quantification. WSC '22: Proceedings of the Winter Simulation Conference, 2318 - 2329.

Author

Mario Grassi mario.grassi@unipv.it

Examples


# \donttest{
# Load Amyotrophic Lateral Sclerosis (ALS)
ig<- alsData$graph
data<- alsData$exprs
data<- transformData(data)$data
#> Conducting the nonparanormal transformation via shrunkun ECDF...done.
group<- alsData$group

#...with train-test (0.5-0.5) samples
set.seed(123)
train<- sample(1:nrow(data), 0.5*nrow(data))

start<- Sys.time()
# ... tree
res1<- SEMml(ig, data[train, ], algo="tree")
#> Running SEM model via ML...
#>  done.
#> 
#> TREE solver ended normally after 23 iterations
#> 
#>  logL:-45.080145  srmr:0.201877

# ... rf
res2<- SEMml(ig, data[train, ], algo="rf")
#> Running SEM model via ML...
#>  done.
#> 
#> RF solver ended normally after 23 iterations
#> 
#>  logL:-33.16687  srmr:0.086188

# ... xgb
res3<- SEMml(ig, data[train, ], algo="xgb")
#> Running SEM model via ML...
#>  done.
#> 
#> XGB solver ended normally after 23 iterations
#> 
#>  logL:70.10035  srmr:0.001439

# ... nn
res4<- SEMml(ig, data[train, ], algo="nn")
#> Running SEM model via ML...
#>  done.
#> 
#> NN solver ended normally after 23 iterations
#> 
#>  logL:-33.873642  srmr:0.135954

end<- Sys.time()
print(end-start)
#> Time difference of 2.329741 secs

#visualizaation of the colored dag for algo="nn"
gplot(res4$graph, l="dot", main="nn")


#Comparison of fitting indices (in train data)
res1$fit$fitIdx #tree
#>        logL        amse        rmse        srmr 
#> -45.0801450   0.6403463   0.8002164   0.2018772 
res2$fit$fitIdx #rf
#>         logL         amse         rmse         srmr 
#> -33.16686964   0.23129978   0.48093636   0.08618768 
res3$fit$fitIdx #xgb
#>         logL         amse         rmse         srmr 
#> 7.010035e+01 1.865892e-04 1.365977e-02 1.439282e-03 
res4$fit$fitIdx #nn
#>        logL        amse        rmse        srmr 
#> -33.8736420   0.3859230   0.6212270   0.1359541 

#Comparison of parameter estimates (in train data)
parameterEstimates(res1$fit) #tree
#>      lhs op   rhs VarImp lower upper
#> 1  10452  ~  6647 16.112     0     0
#> 43 84134  ~  6647 24.516     0     0
#> 38   596  ~  6647 10.728     0     0
#> 17  4747  ~  6647 19.529     0     0
#> 41 79139  ~  6647 25.164     0     0
#> 27  5530  ~  6647 27.370     0     0
#> 28  5532  ~  6647 32.388     0     0
#> 29  5533  ~  6647 20.453     0     0
#> 30  5534  ~  6647 30.462     0     0
#> 31  5535  ~  6647 25.923     0     0
#> 44   842  ~ 54205 32.106     0     0
#> 23 54205  ~   581 35.206     0     0
#> 24 54205  ~   572  6.647     0     0
#> 25 54205  ~   596  6.135     0     0
#> 26 54205  ~   598 15.418     0     0
#> 45   842  ~   317  6.470     0     0
#> 42   836  ~   842 28.768     0     0
#> 4   1616  ~  7132 14.271     0     0
#> 5   1616  ~  7133 19.620     0     0
#> 6   4217  ~  1616 12.718     0     0
#> 36  5606  ~  4217 11.445     0     0
#> 37  5608  ~  4217 35.143     0     0
#> 2   1432  ~  5606 15.674     0     0
#> 32  5600  ~  5606 12.193     0     0
#> 34  5603  ~  5606 14.378     0     0
#> 39  6300  ~  5606 11.595     0     0
#> 3   1432  ~  5608 22.355     0     0
#> 33  5600  ~  5608 19.935     0     0
#> 35  5603  ~  5608 13.768     0     0
#> 40  6300  ~  5608 25.313     0     0
#> 18  4747  ~  1432 22.556     0     0
#> 7   4741  ~  1432 24.848     0     0
#> 12  4744  ~  1432 19.416     0     0
#> 19  4747  ~  5600 11.291     0     0
#> 8   4741  ~  5600  7.237     0     0
#> 13  4744  ~  5600 16.367     0     0
#> 20  4747  ~  5603 13.228     0     0
#> 9   4741  ~  5603 13.976     0     0
#> 14  4744  ~  5603 17.849     0     0
#> 21  4747  ~  6300  3.397     0     0
#> 10  4741  ~  6300  2.549     0     0
#> 15  4744  ~  6300 10.920     0     0
#> 22  4747  ~  5630  8.191     0     0
#> 11  4741  ~  5630  9.045     0     0
#> 16  4744  ~  5630 10.093     0     0
parameterEstimates(res2$fit) #rf
#>      lhs op   rhs VarImp lower upper
#> 1  10452  ~  6647 69.045     0     0
#> 43 84134  ~  6647 69.784     0     0
#> 38   596  ~  6647 65.912     0     0
#> 17  4747  ~  6647 17.437     0     0
#> 41 79139  ~  6647 69.236     0     0
#> 27  5530  ~  6647 68.541     0     0
#> 28  5532  ~  6647 67.560     0     0
#> 29  5533  ~  6647 64.931     0     0
#> 30  5534  ~  6647 69.706     0     0
#> 31  5535  ~  6647 67.608     0     0
#> 44   842  ~ 54205 41.088     0     0
#> 23 54205  ~   581 33.786     0     0
#> 24 54205  ~   572 11.875     0     0
#> 25 54205  ~   596  8.051     0     0
#> 26 54205  ~   598 18.897     0     0
#> 45   842  ~   317 29.358     0     0
#> 42   836  ~   842 68.118     0     0
#> 4   1616  ~  7132 32.931     0     0
#> 5   1616  ~  7133 33.630     0     0
#> 6   4217  ~  1616 66.157     0     0
#> 36  5606  ~  4217 67.708     0     0
#> 37  5608  ~  4217 69.613     0     0
#> 2   1432  ~  5606 29.902     0     0
#> 32  5600  ~  5606 31.934     0     0
#> 34  5603  ~  5606 35.574     0     0
#> 39  6300  ~  5606 29.589     0     0
#> 3   1432  ~  5608 39.435     0     0
#> 33  5600  ~  5608 35.348     0     0
#> 35  5603  ~  5608 31.411     0     0
#> 40  6300  ~  5608 38.616     0     0
#> 18  4747  ~  1432 13.906     0     0
#> 7   4741  ~  1432 20.510     0     0
#> 12  4744  ~  1432 17.933     0     0
#> 19  4747  ~  5600 10.738     0     0
#> 8   4741  ~  5600 13.765     0     0
#> 13  4744  ~  5600 17.691     0     0
#> 20  4747  ~  5603 14.071     0     0
#> 9   4741  ~  5603 15.839     0     0
#> 14  4744  ~  5603 13.586     0     0
#> 21  4747  ~  6300  6.476     0     0
#> 10  4741  ~  6300  7.967     0     0
#> 15  4744  ~  6300  8.983     0     0
#> 22  4747  ~  5630 11.374     0     0
#> 11  4741  ~  5630 14.927     0     0
#> 16  4744  ~  5630 14.499     0     0
parameterEstimates(res3$fit) #xgb
#>      lhs op   rhs VarImp lower upper
#> 1  10452  ~  6647  1.000     0     0
#> 43 84134  ~  6647  1.000     0     0
#> 38   596  ~  6647  1.000     0     0
#> 17  4747  ~  6647  0.263     0     0
#> 41 79139  ~  6647  1.000     0     0
#> 27  5530  ~  6647  1.000     0     0
#> 28  5532  ~  6647  1.000     0     0
#> 29  5533  ~  6647  1.000     0     0
#> 30  5534  ~  6647  1.000     0     0
#> 31  5535  ~  6647  1.000     0     0
#> 44   842  ~ 54205  0.694     0     0
#> 23 54205  ~   581  0.632     0     0
#> 24 54205  ~   572  0.104     0     0
#> 25 54205  ~   596  0.048     0     0
#> 26 54205  ~   598  0.216     0     0
#> 45   842  ~   317  0.306     0     0
#> 42   836  ~   842  1.000     0     0
#> 4   1616  ~  7132  0.430     0     0
#> 5   1616  ~  7133  0.570     0     0
#> 6   4217  ~  1616  1.000     0     0
#> 36  5606  ~  4217  1.000     0     0
#> 37  5608  ~  4217  1.000     0     0
#> 2   1432  ~  5606  0.527     0     0
#> 32  5600  ~  5606  0.409     0     0
#> 34  5603  ~  5606  0.521     0     0
#> 39  6300  ~  5606  0.381     0     0
#> 3   1432  ~  5608  0.473     0     0
#> 33  5600  ~  5608  0.591     0     0
#> 35  5603  ~  5608  0.479     0     0
#> 40  6300  ~  5608  0.619     0     0
#> 18  4747  ~  1432  0.300     0     0
#> 7   4741  ~  1432  0.397     0     0
#> 12  4744  ~  1432  0.266     0     0
#> 19  4747  ~  5600  0.092     0     0
#> 8   4741  ~  5600  0.142     0     0
#> 13  4744  ~  5600  0.349     0     0
#> 20  4747  ~  5603  0.194     0     0
#> 9   4741  ~  5603  0.264     0     0
#> 14  4744  ~  5603  0.202     0     0
#> 21  4747  ~  6300  0.036     0     0
#> 10  4741  ~  6300  0.028     0     0
#> 15  4744  ~  6300  0.047     0     0
#> 22  4747  ~  5630  0.115     0     0
#> 11  4741  ~  5630  0.168     0     0
#> 16  4744  ~  5630  0.136     0     0
parameterEstimates(res4$fit) #nn
#>      lhs op   rhs   VarImp lower upper
#> 1  10452  ~  6647 -130.584     0     0
#> 43 84134  ~  6647  490.219     0     0
#> 38   596  ~  6647   75.989     0     0
#> 17  4747  ~  6647    4.713     0     0
#> 41 79139  ~  6647  148.552     0     0
#> 27  5530  ~  6647  636.768     0     0
#> 28  5532  ~  6647  -53.667     0     0
#> 29  5533  ~  6647  114.084     0     0
#> 30  5534  ~  6647 -488.401     0     0
#> 31  5535  ~  6647 -280.155     0     0
#> 44   842  ~ 54205  -13.376     0     0
#> 23 54205  ~   581  -31.073     0     0
#> 24 54205  ~   572   10.238     0     0
#> 25 54205  ~   596  -38.787     0     0
#> 26 54205  ~   598    3.948     0     0
#> 45   842  ~   317   23.398     0     0
#> 42   836  ~   842   85.317     0     0
#> 4   1616  ~  7132  206.959     0     0
#> 5   1616  ~  7133    5.475     0     0
#> 6   4217  ~  1616  326.475     0     0
#> 36  5606  ~  4217  271.178     0     0
#> 37  5608  ~  4217  -23.244     0     0
#> 2   1432  ~  5606    3.231     0     0
#> 32  5600  ~  5606   11.823     0     0
#> 34  5603  ~  5606  106.083     0     0
#> 39  6300  ~  5606   94.168     0     0
#> 3   1432  ~  5608   30.960     0     0
#> 33  5600  ~  5608   36.585     0     0
#> 35  5603  ~  5608   95.357     0     0
#> 40  6300  ~  5608    1.114     0     0
#> 18  4747  ~  1432    1.397     0     0
#> 7   4741  ~  1432   18.983     0     0
#> 12  4744  ~  1432  -48.739     0     0
#> 19  4747  ~  5600   14.978     0     0
#> 8   4741  ~  5600   29.250     0     0
#> 13  4744  ~  5600  -44.499     0     0
#> 20  4747  ~  5603  -16.879     0     0
#> 9   4741  ~  5603    0.224     0     0
#> 14  4744  ~  5603    6.527     0     0
#> 21  4747  ~  6300   -4.203     0     0
#> 10  4741  ~  6300   15.670     0     0
#> 15  4744  ~  6300   30.669     0     0
#> 22  4747  ~  5630   -3.214     0     0
#> 11  4741  ~  5630   -8.594     0     0
#> 16  4744  ~  5630   65.776     0     0

#Comparison of VarImp (in train data)
table(E(res1$graph)$color) #tree
#> 
#> gray50   red2 
#>     25     20 
table(E(res2$graph)$color) #rf
#> 
#> gray50   red2 
#>     27     18 
table(E(res3$graph)$color) #xgb
#> 
#> gray50   red2 
#>     25     20 
table(E(res4$graph)$color) #nn
#> 
#>     gray50       red2 royalblue3 
#>         41          3          1 

#Comparison of AMSE, R2, SRMR (in test data)
print(predict(res1, data[-train, ])$PE) #tree
#>      amse        r2      srmr 
#> 0.8208979 0.1791021 0.2269379 
print(predict(res2, data[-train, ])$PE) #rf
#>      amse        r2      srmr 
#> 0.8260251 0.1739749 0.1760094 
print(predict(res3, data[-train, ])$PE) #xgb
#>      amse        r2      srmr 
#> 0.8005846 0.1994154 0.1543352 
print(predict(res4, data[-train, ])$PE) #nn
#>      amse        r2      srmr 
#>       NaN       NaN 0.1844971 

#...with a categorical (as.factor) outcome
outcome <- factor(ifelse(group == 0, "control", "case")); table(outcome) 
#> outcome
#>    case control 
#>     139      21 

res5 <- SEMml(ig, data[train, ], outcome[train], algo="tree")
#> Running SEM model via ML...
#>  done.
#> 
#> TREE solver ended normally after 25 iterations
#> 
#>  logL:-48.72171  srmr:0.196861
gplot(res5$graph)

table(E(res5$graph)$color)
#> 
#> gray50   red2 
#>     25     20 
table(V(res5$graph)$color)
#> 
#>  pink white 
#>     2    11 

pred <- predict(res5, data[-train, ], outcome[-train], verbose=TRUE)
#>      amse        r2      srmr 
#> 0.7783295 0.2216705 0.2253818 
yhat <- pred$Yhat[ ,levels(outcome)]; head(yhat)
#>         case    control
#> 1  0.4773726 -0.4773726
#> 2  0.4773726 -0.4773726
#> 3  0.4773726 -0.4773726
#> 4  0.4773726 -0.4773726
#> 5  0.4773726 -0.4773726
#> 6 -0.6137648  0.6137648
yobs <- outcome[-train]; head(yobs)
#> [1] case case case case case case
#> Levels: case control
classificationReport(yobs, yhat, verbose=TRUE)$stats
#>          pred
#> yobs      case control
#>   case      57      17
#>   control    2       4
#> 

#>              precision    recall        f1 accuracy      mcc support
#> case         0.9661017 0.7702703 0.8571429   0.7625 0.261562      74
#> control      0.1904762 0.6666667 0.2962963   0.7625 0.261562       6
#> macro avg    0.5782889 0.7184685 0.5767196   0.7625 0.261562      80
#> weighted avg 0.9079298 0.7625000 0.8150794   0.7625 0.261562      80
#>              support_prop
#> case                0.925
#> control             0.075
#> macro avg           1.000
#> weighted avg        1.000
# }