P-hat Ensemble Experiment

A simulated experiment to evaluate the validity of inferences drawn from ensembling different model predictions in the test set.

See the source document above for the full post.

What is a “P-hat model” or “P-hat ensemble?"

The method evaluated in this post is used to test whether or not a set of features adds additional signal about an outcome above and beyond the signal in a different feature set. The general approach is to fit a tuned machine learning model on the first feature set and a second tuned ML model on the union of the first and second feature sets, output the predicted probabilities in the test set (P-hats), and fit a linear model regressing the outcome on those P-hat variables. If the linear model places a significant coefficient on the larger model, it indicates that those predictions contain new information about the outcome, and one might conclude as a result that the benefit comes from “new” signal in the additional features.

What is simulations/experiments are used to test the validity of this procedure?

We consider two additional sources of variability not accounted for by the standard errors in the P-hat model: noise in the sample and noise in the features. The two experiments are roughly as follows:

  1. Bootstrapping: Tune/fit many regularized models on bootstrapped samples of the training set and repeatedly use the P-hat modeling method to test for “new” signal contained in different predictors.
  2. Adding Noise Features: Generate features that contain no additional signal about the outcome many times, tune/fit a regularized model, and repeatedly use the P-hat modeling method to test whether the generated features contain “new” signal about the outcome.

Key Findings:

We find that variability in the model fitting process coming from both the underlying sample and noise in the feature set leads a linear ensemble of the predicted probabilities to place significant coefficients on multiple sets of predictions more than 5% of the time. As a result, it is not appropriate to conclude that a significant coefficient for a set of predictions indicates that the features in the model the predictions came from have additional signal about the outcome. Instead, it appears that the regularization benefit to taking a weighted average of multiple predictor models in the test set is meaningfully large, even when the individual predictors are tuned to avoid overfitting. One takeaway might be to take more aggressive steps to reduce model variance, but the bar may simply be too high to avoid making invalid inferences when using this P-hat modeling procedure. See the source document for the full results.

Logan Crowl
Logan Crowl
Data Scientist

Data Scientist at the University of Chicago’s Crime Lab New York and Center for Applied AI