MACHINE LEARNING VIA H2O: ENSAMBLE DECISION TREES

With the new h2oml suite, use machine learning via H2O to uncover insights from data when traditional statistical models fall short. Use ensemble decision trees—gradient boosting machine (GBM) and random forest—to perform classification or regression. Tune hyperparameters, use validation or cross-validation (CV), evaluate model performance, explain predictions, and more.

 

Stata users have long relied on linear regression, logistic regression, and traditional statistical models to uncover insights from their data. There are many applications where relationships between responses and predictors are often complex and nonlinear, and these classical methods may fall short in capturing more intricate data patterns.

  • What if your predictors interact in ways linear (or logistic) regression models cannot capture?
  • What if your model’s accuracy plateaus despite careful variable selection?
  • What if you need models that are robust to missing data and multicollinearity while also generalizing well beyond the scope of your observed data?
  • And best of all, what if you can achieve all the above without the need to sacrifice explainability of predictions for predictive power?

 

This is where GBMs and random forests revolutionize the way you analyze your data in Stata. With seamless access to H2O’s machine learning algorithms from within Stata, you can now harness the power of high-performance predictive models without leaving your familiar Stata environment. Simply use commands with intuitive Stata syntax to train sophisticated ensemble learning models that outperform traditional techniques. Think machine learning is a “black box”? Not anymore! With tools such as Shapley additive explanations (SHAP) values, partial dependence plots (PDPs), and variable importance rankings, GBM and random forest provide powerful predictions while maintaining explainability—no tradeoffs needed.

 

Let’s quickly look at one possible workflow below. Also see the summary of all commands and features in Commands for Stata integration with H2O machine learning and more examples in Let’s see it work.

 

  1. Setup and data preparation

Initialize an H2O cluster, and import your current Stata dataset into an H2O frame named data:

 

 

Split data into the training (80%) and validation (20%) frames:

 

 

  1. Reference (baseline) model

Fit a GBM model to a binary response using default hyperparameters:

 

 

Or use 3-fold CV instead of a validation frame:

 

 

Store the reference (baseline) model:

 

 

  1. User-specified hyperparameters and tuning

Specify different values for any of the 11 hyperparameters. For example, specify 200 trees and a learning rate of 0.2:

 

 

Store the model with user-specified hyperparameters:

 

 

Perform hyperparameter tuning on the number of trees and learning rate:

 

 

Store the best-tuned model:

 

 

Customize tuning settings by specifying accuracy as the tuning metric, and perform a random grid-search method:

 

 

  1. Evaluate model performance

Check for overfitting or underfitting using score history plots:

 

 

Assess CV performance:

 

 

View grid summary for configurations of hyperparameters ranked by model performance:

 

 

Compare the top 10 models using various metrics:

 

 

Manually select a specific model:

 

 

Evaluate model metrics:

 

 

  1. Compare different methods

Suppose you have repeated steps 2 through 4 to select the best random forest model rf_tuned. You may compare the GBM model gbm_tuned with the random forest model rf_tuned as follows:

 


  1. Make predictions

Use the best model to make predictions for the data frame data:

 


  1. Explain model results

Assess variable importance:

 


Produce PDP and an individual conditional expectation (ICE) plot:

 


Analyze SHAP values for predictor contributions:

 


Commands for Stata integration with H2O machine learning

 

Supervised learning
h2oml gbregress Gradient boosting regression
h2oml gbbinclass Gradient boosting binary classification
h2oml gbmulticlass Gradient boosting multiclass classification
h2oml rfregress Random forest regression
h2oml rfbinclass Random forest binary classification
h2oml rfmulticlass Random forest multiclass classification

 

Estimation results and postestimation frame
h2omlest Catalog H2O estimation results
h2omlpostestframe Specify frame for postestimation analysis

 

Tuning and estimation summaries
h2omlestat metrics Display performance metrics
h2omlgof Goodness of fit for machine learning methods
h2omlestat cvsummary Display CV summary
h2omlestat gridsummary Display grid-search summary
h2omlexplore Explore models after grid search
h2omlselect Select model after grid search
h2omlgraph scorehistory Produce score history plot

 

Performance after binary classification
h2omlestat threshmetric Display threshold-based metrics
h2omlestat confmatrix Display confusion matrix
h2omlgraph prcurve Produce precision–recall curve plot
h2omlgraph roc Produce receiver operating characteristic (ROC) curve plot

 

Performance after regression
h2omlgraph rvfplot Produce residual-versus-fitted plot
h2omlgraph rvpplot Produce residual-versus-predictor plot
Predictions
h2omlpredict Prediction of continuous responses, probabilities, and classes
Machine learning explainability
h2omlgraph varimp Produce variable importance plot
h2omlgraph permimp Produce permutation importance plot
h2omlgraph pdp Produce partial dependence plot
h2omlgraph ice Produce individual conditional expectation plot
h2omlgraph shapvalues Produce SHAP values plot for individual observations
h2omlgraph shapsummary Produce SHAP beeswarm plot

 

Save decision tree
h2omltree Save decision tree DOT file and display rule set

 


EXAMPLE DATASET: TELCO CUSTOMER CHURN DATASET

To demonstrate the h2oml suite, we focus on a GBM for binary classification. Other models such as random forest or GBM for regression have similar syntax and workflow. We’ll analyze data from a fictional telecommunications company called Telco, which offers home phone and internet services in California. The dataset, provided by IBM, contains information on 7,043 customers with 26 variables. Our main objective is to develop a predictive model that can identify which customers are at risk of churning—meaning they might terminate their services with Telco. Our binary response, churn, indicates whether a customer discontinued service in the past month or remained with Telco.

 

The predictors (features) include customer demographics (such as gender and age); account details (such as contract type and tenure length); and service subscriptions (such as internet, phone, and security service).

 

We want to predict customer behavior patterns that signal potential churn, which could help Telco take proactive steps to retain valuable customers.

 

PREPARE YOUR DATA FOR H2O MACHINE LEARNING IN STATA

We first load our dataset and initiate an H2O cluster.

 

 

We then put the current dataset into an H2O frame, churn, and make it the current H2O frame.

 

 

We split our current frame into training and testing frames with 80% of observations in the training frame. Later on, we will use CV on the training frame during estimation to control for overfitting.

 

 

For convenience, we create a global macro, predictors, to store the names of predictors.

 

 

REFERENCE GBM MODEL

We fit a GBM model with 3-fold stratified CV (SCV) using default values for other settings and h2orseed(19) for reproducibility. SCV is particularly useful for classification tasks with imbalanced classes (levels of churn).

 

 

The header provides model details, showing that h2oml gbbinclass uses Bernoulli loss (other loss functions are available for regression with h2oml gbregress). The training frame contains 5,643 observations, and we use 3-fold SCV on it. The hyperparameters section (Model parameters) reports both user-specified and actual values used by the algorithm for hyperparameters, which may differ if early stopping is specified.

 

The metric summary table presents binary classification performance metrics for the training and CV data. We will rely on the area under the precision–recall curve (AUCPR) as the metric of choice because we have imbalanced classes. AUCPR ranges from 0 to 1, with 1 meaning perfect performance. Although CV metrics are our main focus, we also check training metrics to ensure slight overfitting and to avoid underfitting of the model. A positive difference between training AUCPR and CV AUCPR (0.8024 − 0.6585 = 0.1439) is expected, but a large gap may suggest overfitting, meaning the model may not generalize well to new data.

 

We store the reference estimation results for later comparison:

 

 

Assessing metric variability across the three folds ensures model performance is not tied to a specific data split. High variation in CV metrics may indicate poor generalization to new data. This can be done by using the h2omlestat cvsummary command; see example 2 of [H2OML] h2oml and [H2OML] h2omestat cvsummary.

 

MODEL SELECTION AND HYPERPARAMETER TUNING

Our baseline model’s CV AUCPR was 0.6585. To improve model performance, we will tune hyperparameters. GBM has 11 tunable parameters, but we will tune only ntrees() (number of trees) and predsamprate() (predictor sampling rate) within a small search space for simplicity. Hyperparameter tuning is an iterative procedure, and this example illustrates only the tuning specification.

 

 

The header displays tuning details, including the tuning method (Cartesian), tuning metric (AUCPR), and grid-search ranges for hyperparameters. The selected values, 100 for ntrees() and 0.15 for predsamprate(), correspond to the best-performing model. The rest of the output presents the hyperparameter values and the metric summary for the optimal model.

 

By tuning, we increased the CV AUCPR from 0.6585 to 0.6739. The improvement is small because we explored only a small portion of the hyperparameter space in this example.

 

Let’s store this best-tuned model for later use:

 

 

We may obtain the grid-search summary by using the h2omlestat gridsummary command. This command lists the configurations of the hyperparameters we are tuning ranked by AUCPR.

 

 

We may wish to compare the first two models based on other metrics by using the h2omlexplore command:

 

 

If we choose a model other than the best-tuned one (perhaps a model with slightly worse performance but using fewer trees), we can select it via the h2omlselect command; see [H2OML] h2omlselect.

 

Let’s compare the best model, gbm_tuned, with the reference model from the previous section, gbm_default, based on other metrics by using the h2omlgof command.

 

 

The output shows training results followed by CV results. Looking at the CV results, we see that tuning improved performance across all metrics: lower log loss, mean class error, MSE, and RMSE and higher AUC, AUCPR, and Gini coefficient. This indicates that the tuned model has better model performance.

 

We may also refine the list of predictors based on variable importance:

 

 

Based on the above graph, we may decide to drop the predictor onlinebackup.

 

METHOD SELECTION: GBM VERSUS RANDOM FOREST

Suppose we trained a random forest for binary classification using h2oml rfbinclass, following the same steps as before. For simplicity, suppose that after hyperparameter tuning, the best model is

 

 

Now we compare the tuned random forest method (rf_tuned) with the tuned GBM method (gbm_tuned) using the test frame (the testing frame created during the frame split). We use the h2omlpostestframe command to specify the name of the frame, test in our case, to be used by all subsequent postestimation commands for computations for both the rf_tuned and the gbm_tuned results:

 

 

© Copyright 1996–2026 StataCorp LLC. All rights reserved.

Instead of listing the performance metrics on the test frame separately for each method using h2omlestat metrics, we use h2omlgof to show the results side by side:

 

 

GBM outperforms random forest because it has a higher AUCPR, making it the preferred method. We can further compare the methods using ROC curves, where a curve closer to the upper-left corner indicates better performance. See [H2OML] h2omlgraph roc.

 

 

Based on the ROC results, as we expected, the GBM method slightly outperforms the random forest method.

 

Another popular approach to compare classification predictions between different methods and models is by using a confusion matrix, which reports the numbers of correctly and incorrectly predicted outcomes. See [H2OML] h2omlestat confmatrix and example 4 of [H2OML] h2oml.

 

PREDICTION ON NEW DATA

Suppose the company collected new data stored in newchurn.dta. It wants to predict the probability of churn for these new customers based on the GBM model gbm_tuned. Let’s read the new dataset as an H2O frame newchurn.

 

 

We use h2omlpredict to predict churn probabilities and classes. By default, it predicts classes (Yes or No); to get probabilities, specify the pr option. In the previous section, we set test as the postestimation frame via h2omlposttestframe. So, by default, h2omlpredict will use the test frame for predictions. To obtain predictions for the new dataset, specify frame(newchurn). Below, we predict both classes and probabilities using the GBM model gbm_tuned.

 

 

The generated variables for the classes (churnhat) and class probabilities (churnprob1 and churnprob2) are stored in the newchurn frame because we specified frame(newchurn). Let’s list the predicted classes and probabilities.

 

 

For example, churnprob2 (abbreviated to churnp~2) shows a 22% chance of churn for the first customer and 78% for the second. The predicted class (Yes or No) in churnhat is assigned based on whether churnprob2 exceeds the default F1-optimal threshold of 0.2378. This value is obtained using h2omlestat threshmetric, which displays thresholds that optimize various metrics. To use a custom threshold, specify it with the threshold() option in h2omlpredict.

 

EXPLAINING PREDICTIONS

One of the key challenges in machine learning is understanding why a model makes specific predictions. Explainability ensures that predictions are not only accurate but also interpretable and justifiable.

 

Global models describe the average behavior of a machine learning model. Examples include the following:

  • Variable importance
  • Global surrogate models (simple interpretable models approximating machine learning predictions)
  • PDPs

 

Local models explain individual predictions by approximating the model’s behavior for a single observation. Examples include the following:

  • ICE curves
  • SHAP values

 

GLOBAL EXPLAINABILITY MEHODS

We have already seen an example of a variable importance graph; therefore, we focus on building a surrogate model here. We start by restoring the best GBM model (gbm_tuned) trained earlier.

 


Next we use the GBM model to predict churn for the entire frame, rather than just the testing frame, by specifying the frame(churn) option:

 


To improve explainability, we build a global surrogate model using, for example, a classification tree to approximate the predictions churnhat from model gbm_tuned.

 

First, we switch the working frame to the full churn frame:

 


Then, for illustration, we use a single decision tree (ntrees(1)) with maximum depth of 3 (maxdepth(3)) as a global surrogate model:

 

 

This classification tree serves as a simplified approximation of the complex GBM model, making the predictions more interpretable while retaining useful insights.

 

A classification tree is easier to interpret visually. Using the dotsaving() option of the h2omltree command, we can generate a DOT file that can be plotted using Graphviz for better visualization; see https://graphviz.org[H2OML] DOT extension, and [H2OML] h2omltree.

 

 

The terminal node values represent the probability of a customer not churning (No). For example, customers with one- or two-year contracts who either have no internet service or use online backup and security have the highest probability (0.997) of staying with the company. This means their churn probability is only 0.003 (1 − 0.997), making them the least likely to leave.

 

Next we analyze how important predictors (identified by h2omlgraph varimp earlier) influence churn. We use PDPs, a global explainability method that shows the marginal effect of selected predictors on predictions. We first restore the GBM model’s estimation results to ensure that subsequent postestimation commands apply to the best GBM model (gbm_tuned).

 


We use h2omlpostestframe with the notest option to set the churn frame as the active frame for postestimation analysis without treating it as a testing frame.

 


We then generate PDPs for key predictors:

 

 

The PDP pattern (red line in the plot) agrees with the results from the surrogate tree. For instance, the probability of churning (shown on the y axis) decreases for customers with a one- or two-year contract (contract) and for customers who use the company’s services longer (tenuremonths); see [H2OML] h2omgraph pdp.

 

LOCAL EXPLAINABILITY METHODS

For local explainability, we use SHAP values, which estimate each predictor’s contribution to an individual prediction. SHAP values help explain why specific customers are predicted to churn, making machine learning decisions more transparent.

 

We now use h2omlgraph shapvalues to produce SHAP values for observation 19 (female customer who used a month-to-month contract service for 9 months and has both the observed churn and predicted churnhat values of Yes) for the top 10 SHAP important predictors.

 

 

Blue bars indicate predictors that increase churn probability, whereas red bars indicate those that reduce it. The SHAP values agree with the previous findings. Particularly, a month-to-month contract, small tenuremonths, and not using online security services contribute positively to this particular customers’ churning. Note that SHAP values for binary classification are reported on the logit scale. For instance, raw prediction f(x) = 0.2063 is logit of the predicted probability that “observation 19 will churn”. You must use the inverse logit transformation to interpret them as probabilities; see [H2OML] h2omlgraph shapvalues.

 

The SHAP summary (beeswarm) plot visualizes SHAP values across all observations, showing both predictor importance and influence on the response. For illustration, we plot the top 4 most important predictors:

 

 

Predictors on the y axis are ranked by SHAP importance (largest absolute SHAP values first). Smaller normalized predictor values of contract (month to month versus one year versus two years), shown in blue, are associated with positive SHAP values, meaning that shorter contracts make customers more likely to leave (increase the probability of churning); see [H2OML] h2omlgraph shapsummary.

 

SHUTTING DOWN THE H2O CLUSTER

After you are done with your analysis, disconnect from the H2O cluster by using

 


This closes the Stata-H2O session but keeps the cluster running in the background. You can reconnect later with

 


To fully shut down the cluster and delete all resources, use