1 Introduction

In decision modeling and prediction tasks with many features, it is common to consider the amount of signal coming from a particular feature or set of features. For example, if we were looking to model judge decision-making using court records, we might wonder which types of information from the court records are relevant to understanding the final decisions. Suppose we have demographic information on each defendant as well as their conviction history and are looking to determine the amount of signal about judicial decisions coming from demographics above and beyond the signal coming from conviction history. Assuming we have relatively few total features, we might fit a linear model, so that we can estimate the association between demographic variables and the judge’s decision, holding conviction history constant.

But suppose we have a large number of variables for both demographics and criminal history and each variable is extremely granular (for example, we might have many variables for each type of prior conviction at varying time windows). We now want to deploy regularization to the full set of features to avoid overfitting, so we split our data into train and test sets and then tune and fit one LASSO model on just the set of conviction history features and one on the conviction history features and the demographic variables. Next, we output the predicted probabilities from each model in the test set and measure performance for each using the area under the receiver operating characteristic curve (ROC AUC). In theory, the difference between ROC AUC for the two models is the added signal coming from the demographic information.

This type of modeling exercise is attractive because it allows us to utilize any type of machine learning model, so we can extract as much predictive information as possible and work with data types like images that aren’t accessible to traditional statistical models. Unfortunately, it can be difficult to gauge what constitutes a significant jump in predictive signal. For example, factors like race are known to have meaningful real-world impacts on outcomes, but frequently increase performance metrics like ROC AUC by seemingly small amounts. (In the prediction task described ahead, race by itself gets us an ROC AUC of just 0.509, where 0.5 is achieved through random guessing.) As a result, we need a more sensitive method for determining what constitutes significant new signal and preferably a method with a solid theoretical basis for inference. We find a potential solution in an old friend – linear regression.

The approach that this post evaluates is loosely called “P-hat modeling” or “P-hat ensembling” because it takes the predicted probabilities from a set of different predictors and ensembles them in the test set using a linear model. That linear model is then used to identify which underlying predictors contain meaningful additional information about the desired outcome. In our previous example, we would fit a linear model of the predicted probabilities from the model trained just on conviction history and the probabilities from the model trained on both conviction history and demographic information. A significant coefficient on the predicted probabilities from the larger model provides some evidence that the demographic information is adding meaningful signal above and beyond what is contained in the criminal history features. The experiments presented in this post use simulation to evaluate whether or not that type of inference is appropriate. If we can regularly achieve significant coefficients in these P-hat models by only adding noise to the underlying predictors, that would indicate that the value of a new predictor is more than just the signal its features contain about the outcome.

The motivation for this evaluation comes from a pair of papers by Mentch & Zhou (2020) and Kobak, Lomond & Sanchez (2020), as well as guidance from Gregory Stoddard. The first paper details simulated and real-world results where adding additional randomly-generated noise features to a dataset can improve the performance of random forest models. The second paper describes how under realistic conditions, an optimal linear model trained on a high-dimensional dataset (n << p) often does better with no ridge penalty or a negative penalty than with some positive penalty. Both of these papers challenge an appealing and intuitive explanation that the predictive value originating from the inclusion of additional predictors comes from the signal provided by those variables. Instead, it seems the regularizing effect of adding noise to the modeling process may be responsible for a sizable portion of the predictive value derived from including additional features in a model. This post seeks to provide evidence on the degree to which regularization and variability in the model fitting process contribute to the results of these “P-hat models.”

2 Defining the “P-Hat Model”

“P-hat Model” or “P-hat ensemble” describes the procedure of fitting multiple predictors on a training set, extracting the predicted probabilities (\(\hat{p}\) or “P-hat”) for the test set, and regressing the outcome of interest on the P-hat variables in the test set. After fitting this linear model, inferences about the predictive value of the underlying models are frequently made using the following conditions:

Does the linear model place a “significant” coefficient on the slope of a P-hat variable (p-value < 0.05)?
Does the adjusted R-Squared for the linear model increase when you include a particular P-hat variable or a set of P-hat variables?

Generally, the significance or lack of significance of a coefficient associated with a particular P-hat set is used to determine whether that underlying predictor is providing signal about the outcome not captured by the other predictor models. Similarly, if the adjusted R-squared increases after adding a new P-hat variable, that would indicate that those predicted probabilities help explain a larger share of the variability in the outcome than the original set alone adjusting for the number of predictors. Therefore, an increase in adjusted R-squared is also frequently interpreted as an indication that the underlying predictor contains meaningful new information about the outcome.

With respect to conditions 1 and 2, it is worth noting that the adjusted R-squared for a model always increases (decreases) depending on whether the F-statistic for testing that the new coefficients are all zero is greater (less) than one in value. When only one coefficient is added, the adjusted R-squared increases or decreases based on whether or not the T-statistic for that variable is above or below 1 in magnitude (proof). As a result, there is a fundamental relationship between the “significance” of a coefficient (p-value associated with t-stat < 0.05) and its impact on the adjusted R-squared of the model.

In this exercise, we will explore the frequency with which one can trigger conditions 1 or 2 by just adding noise to the underlying model fitting process. We will put the procedure to the test using two different experiments: first by bootstrapping our sample and second by adding randomly-generated noise features. These experiments are specifically designed to test the importance of two sources of variability that the P-hat ensembles do not consider. The standard errors estimated by the P-hat ensembles take into account the variability from the sample size in the test set, but they do not consider the variability coming from the underlying model fitting procedure, including the variability from the specific sample in the training set and the noise in the set of features. If those sources of variability are relatively unimportant to the outputted probabilities compared to the true underlying signal, then conditions (1) or (2) are likely appropriate for concluding that a new set of features contains additional signal about the outcome. If, however, conditions (1) or (2) are frequently met by just adding noise to the fitting process, it would indicate that the P-hat ensemble approach is not appropriate for drawing conclusions about the signal contained in the underlying feature sets. Instead, there is likely some inherent value in ensembling multiple P-hats in the test set for the sake of regularizing any one set of predictions and this effect bleeds into the inferred value of the underlying predictors.

3 The Data

For the following experiments, we begin with US census data on employment in Iowa from 2018. The dataset itself comes from the Folktables package in Python (and associated paper), a set of datasets derived from the US census for the explicit purpose of benchmarking machine learning algorithms. There are five main prediction tasks defined by the package, and we focus on predicting whether or not an individual is employed (restricted to those between 16 and 90 years old). We subset to records from Iowa in 2018 for no particular reason other than it provides a reasonable number of real-world records (26365) to use in a repeated ML pipeline and a realistic number of features (16). We then split this dataset into a training and testing set (70%-30%) for the purpose of model evaluation. The table below shows a summary of the full dataset:

Summary Statistics of Iowa ACS Employment Data from 2018

	16-90 year olds
Outcome of Interest
Number of individuals	26,365
Percent employed	59.9%
Demographics
Percent female	50.9%
Average age	49.9
Percent white	94.6%
Percent Black	1.8%
Percent asian	1.5%
Personal Information
Percent married	55.3%
Percent born in US	95.7%
Percent not a US citizen	2.2%
Percent active or former military	98.4%
Percent with associate's degree (max)	11.4%
Percent bachelor's degree (max)	16.2%
Percent with graduate degree	6.6%
Percent who moved in past year	13.3%
Familial Background
Percent with multiple ancestries	28.3%
Percent foreign born	96.2%
Percent grandparents living with grandchildren	1.3%
Disability Information
Percent disabled	17.1%
Percent with difficulty hearing	5.7%
Percent with vision difficulty	2.7%
Percent with cognitive difficulty	6.3%

In general, we should note that the specific dataset used for this experiment is not the main focus of this post and many different datasets would have sufficed. Using a real-world dataset is important because it contains realistic variability, correlation structures, and signal-to-noise ratios, but we are not interested in drawing any specific inferences about employment in Iowa. In fact, it is worth noting that this dataset was not selected for any particular characteristics it possessed besides number of rows and columns. The results of these experiments may depend on the underlying dataset used and the size of that dataset, but we are simply picking a dataset that’s approximately the same size as previous applications of the P-hat modeling procedure to serve as a potential counterexample to the validity of the method.

4 Bootstrapping/Bagging Experiment

The first experiment for testing the robustness of the P-hat modeling procedure uses the underlying noise from the training set sample. We bootstrap the training set 500 times, fit LASSO models tuned using 10-fold cross validation to determine an optimal lambda penalty, output the predicted probabilities in the test set, and randomly select some number of the P-hat sets from the different bootstrapped predictors to ensemble in a linear model. We repeat this random draw and ensemble step 250 times to get a sense of the frequency with which conditions 1 and 2 are triggered. We also perform this experiment with three different feature sets – one small, one medium, and one large. The small feature set includes just demographic variables; the medium feature set includes demographics, familial background variables, and disability variables; and the large feature set includes demographics, familial background, disability information, and personal information variables. All categorical variables are replaced with dummies and all numerical variables are centered and scaled. Here are the steps:

Draw a sample with replacement from the training set of the same size as the training set (bootstrap).
Select 10 cross-validation folds from that bootstrapped training set.
Tune the LASSO lambda penalty across a 25-value grid.
Select the penalty with the highest cross-validated ROC AUC.
Fit a LASSO model with the preferred penalty on the full bootstrapped training sample.
Output the predictions in the test set.
Repeat steps 1-6 500 times.
Randomly select \(k\) P-hat variables.
Fit a linear probability model regressing employment status on the selected P-hat sets in the test set.
Record the number of significant non-intercept coefficients.
Measure the difference in adjusted R-squared between the linear model with all selected P-hat variables and a linear model with just one P-hat variable.
Repeat steps 8-11 250 times for each \(k = 2, 3, 5, 10, 20, 40, 80\).
Repeat steps 1-12 with a small feature set, a medium feature set, and a large feature set.

The procedure described above is a form of bagging similar to the process used in random forest models. Given the success of bagging in other contexts, it would be unsurprising to see our ensembled P-hat model outperform an individual predictor. Still, since the samples are bootstrapped from the same dataset and all models use the same feature set, each model is estimating the same underlying signal relating the predictors to the outcome. As a result, triggering conditions 1 or 2 would indicate that a particular set of P-hats does a better job of explaining the variation in the outcome within the test set, but it does not indicate that there is “new” signal about the outcome in the feature set. Instead, it would indicate that the variation in the sample used to train the model sometimes leads to predictions that better mirror the test set by chance and that there is a benefit to reducing model variance by averaging several individual predictors.

As an example, consider fitting two random forest models on a training set each with 500 trees and then ensembled their predictions in the test set. This procedure is essentially identical to fitting a single random forest model with 1000 trees. However, if the linear ensemble model indicated that both prediction sets were significant, it would be confusing to say that the second random forest model discovered new signal about the outcome since the value of giving both sets of predictions a significant coefficient is the same as the value of moving up to a 1000 tree random forest – it provides a reduction in variance that helps avoid overfitting.

4.1 Performance of the Bootstrap Predictors

Here is the performance of our bootstrap predictors in the test set using 4 different performance metrics. You can see that performance changes significantly depending on the size of the underlying predictor set, but relatively little between bootstrap samples. This is what we would expect since every predictor should have the same underlying signal relating case characteristics to the judge detention decision. We also plot the performance of an average of the 500 different predicted probabilities for each row in gold. The performance of this average is the equivalent of traditional bagging (“BootstrapAGGregatING”) or an ensemble that places equal weight on all 500 of the sets of predictions. As is the case in many predictions problems, the bagged predictor outperforms the individual predictors on average.

4.1.1 ROC AUC

4.1.2 Prec-Recall AUC

4.1.3 Mean Log-Loss

4.1.4 RMSE

4.2 P-Hat Ensemble Model

Now, we randomly draw \(k\) P-hat sets from a given feature set condition and ensemble them in the test set. We repeat this procedure 250 times for \(k = 2, 3, 5, 10, 20, 40, 80\).

We’re interested in how the variability in the training set impacts model fitting process and thus, the P-hat ensemble model. If noise in the training set is relatively unimportant, we would expect to generally see P-hat ensembles that indicate the predictions from the additional bootstrapped predictors are insignificant conditional on one set of predictions. To check this, we look at how frequently our linear ensemble models place significant coefficients on more than one of the P-hat slopes:

Percentage of P-Hat Models with 2 or more Significant P-Hat Coefficients

# of P-Hats in Ensemble	9 Model Features	18 Model Features	71 Model Features
2 P-Hats	22.8%	18.8%	80.4%
3 P-Hats	43.6%	39.6%	75.2%
5 P-Hats	80.8%	73.2%	60.8%
10 P-Hats	99.6%	96.4%	78.8%
20 P-Hats	100.0%	100.0%	96.8%
40 P-Hats	97.2%	100.0%	100.0%
80 P-Hats	100.0%	100.0%	100.0%

From the table above, we see that for all conditions, at least 15% of our iterations produce multiple significant P-hat slopes, despite the fact that all P-hats contain the same underlying signal. It’s unsurprising that the proportion increases as we ensemble more P-hats in our linear model, but it is interesting to see that the proportion is larger when the underlying predictors use the small feature set than the medium feature set and then larger again when using the largest feature set. Some of this behavior likely comes from the higher level of variability in the model fitting process when using larger predictor sets. It’s likely that many of the bootstrapped models with 71 features genuinely did find different relationships between the variables and the outcome and thus are capturing different signal, but the differences in the signal comes from the model fitting process, not the underlying data.

The increased frequency of having 2 or more significant coefficients for the smaller feature set models relative to the medium feature set models is more confusing. It’s possible that the specific features in the medium, but not the small feature set (familial background and disability), provide fairly consistent signal about the outcome and thus reduce the variance in the model fitting process. It’s also possible that these ensembles are either placing nearly all of the weight on one of P-hat variables or that they are having trouble separating the contributions of the different P-hat variables and giving them all small, insignificant coefficients. In that case, we would expect to see the proportion of ensembles with 0 significant coefficients increase as we add more P-hat variables.

Percentage of P-Hat Models with 0 Significant P-Hat Coefficients

# of P-Hats in Ensemble	9 Model Features	18 Model Features	71 Model Features
2 P-Hats	11.2%	3.2%	0.0%
3 P-Hats	17.6%	7.2%	0.0%
5 P-Hats	7.6%	6.0%	0.8%
10 P-Hats	0.4%	0.8%	2.8%
20 P-Hats	0.0%	0.0%	0.4%
40 P-Hats	0.8%	0.0%	0.0%
80 P-Hats	0.0%	0.0%	0.0%

The above behavior occurs as a result of multicollinearity. Since our predicted probabilities are generated from models trained using the same features and on datasets with the same underlying signal, we would expect the predictions to be correlated. This is particularly true for our smaller feature set models because they have lower variance in the model fitting process and thus more correlation between the predictions. As a result, our ensemble models are likely having difficulty separating out the contributions of the different P-hat variables, and thus estimating larger standard errors and smaller slopes for our different P-hat sets.

Since these models are often also used to determine which underlying predictors do not provide signal about the outcome, the presence of multicollinearity poses a problem. For example, if a P-hat ensemble uses the predicted probabilities from five different models and only one coefficient is significant, one might conclude that the four other models are not providing any additional signal beyond what is captured by the first model. But these results suggest that it’s possible the P-hats from those four models all contain the same meaningful signal and the ensemble is splitting the weights and obscuring the value of the underlying models.

That issue would theoretically be fixed by removing several of the correlated P-hat sets, using a joint F-test, or looking at the proportion of the variability in the outcome explained by the ensemble, but the interpretation of the coefficients becomes more difficult in a P-hat model where multicollinearity is present since the results are more sensitive to changes in a small fraction of observations. Additionally, we should expect multicollinearity to be present frequently during the types of exercises where P-hat ensembling is used because this tactic is employed to test the overlap in signal between feature sets where one is frequently a subset of the other and there is reason to believe the additional features could be redundant.

4.2.1 Adjusted R-Squared

As discussed in the previous section, the interpretation of the significant coefficients in our ensemble model can be difficult when trying to determine which underlying predictors provide additional signal about the outcome. If we are worried about multicollinearity though, we might decide to look at the adjusted R-squared of an ensemble before and after adding a P-hat variable to determine the value of that underlying model. Below, we do just that, looking at the difference in adjusted R-squared between our ensembled model and a linear model fit with just one of the randomly selected P-hats:

From the plots above, we see that adjusted R-squared increases when we ensemble multiple P-hats for virtually every iteration and every condition. This result essentially amounts to a confirmation that bagging works; the outcome is better explained by multiple bootstrapped predictors than any single set of predictions.

Although this finding isn’t surprising, it highlights how the results of the P-hat procedure are sensitive to variability in the underlying sample. While we don’t often compare predictors trained on different training sets, some of the fitting procedures used mirror this variability. For example, two convolutional neural networks trained on the same dataset using dropout and random image augmentation will produce different predictions based on the noise in the underlying sample introduced by those training procedures. If we ensembled the predictions of those two CNNs in the test set, it seems likely based on the above results that both predictions would receive significant slopes, in spite of the fact they are tapping into the same signal.

Note: it appears that the variance in the difference in adjusted R-squared is larger for a few of the experimental conditions. Specifically, the variability in the differences is larger when 10 P-hat variables are drawn from predictors fit on 9 model features, when 20 P-hat variables are drawn from predictors fit on 18 model features, and when 80 P-hat variables are drawn from predictors fit on 71 model features. It is possible there is some relationship between the number of model features in the underlying predictor model and the number of features in the P-hat ensemble, but an exploration of that is outside of the scope of this post at present.

5 Noise Feature Experiment

Since the P-hat model procedure is most often used to test the value of models trained using different feature sets, our second experiment adds noise to the columns instead of the rows, similar to the procedure described in Mentch & Zhou (2020).

We start with a real feature set of demographic information (race, sex, and age). Then, we randomly-generate noise features to add to our real predictors. Noise features are generated such that \(Z \sim N(0,1)\) with varying levels of correlation to the continuous age variable. Our final noise features are \(N = r \cdot X + \sqrt{(1-r^2)}\cdot Z\) for r = 0, 0.2, 0.7, and 0.95. That way, our added noise features are sometimes correlated with the other predictors, but always conditionally independent of the outcome given age. We vary both the number of added noise features and the correlation. We end up with 500 iterations of all conditions.

After we’ve generated our noise features for the train and test set, we tune and fit a LASSO model predicting employment status using the true predictors and the noise features. We output the probabilities from those predictor models in the test set, randomly sample from that set of predictions, and ensemble those P-hats in the test set along with the P-hats from the model not using any noise features. Our results should roughly mirror the process of testing whether the noise features provide any additional signal about the outcome on top of the set of real predictors. We know from construction that these features do not contain any additional signal about the outcome, so our P-hat model procedures should only rarely place significant coefficients on the P-hat variables from the noise feature predictors if that type of inference is valid.

5.1 Performance of the Noisey Predictors

Below you can see the performance of the predicted probabilities (P-hat sets) in the test sets before being ensembled. Note that as the number of added noise features increases, so too does the variability of model performance since the addition of more noise features (and features less correlated with the real predictors) increases the risk of overfitting to that noise in the training set and hurting out-of-sample predictive performance. While using a tuned LASSO model should limit this issue (the distribution of performance metrics for all model conditions is fairly narrow overall), there is still an increased chance that some noise features actually do have a relationship with the outcome in the training set that isn’t mirrored in the test set, hurting the test set model performance. Notice that the maximum performance, however, stays consistent no matter the number of added noise features since those features do not contain any new signal about the outcome. Furthermore, the performance of the model with no noise features added (plotted by the horizontal line) is higher than most of the models where noise features were added. Additionally, the performance of the averaged predictions (plotted in gold) is consistently higher than the average performance of the individual predictions and sometimes even better than the no-noise predictor, indicating again the value of regularization.

5.1.1 ROC AUC

5.1.2 Prec-Recall AUC

5.1.3 Mean Log-Loss

5.1.4 RMSE

5.2 P-Hat Ensemble Model

Next, we fit a model on just the non-noise predictors and output the predicted probabilities of detention in the test set. We then ensemble this P-hat variable with some number of randomly selected P-hat variables output from our noise feature models. We vary the number of P-hat variables selected, but always draw from the pool of P-hats sharing a common number of added noise features (q) and correlation (r). We repeat this random draw ensembling step 250 times for each condition.

Our first question is how often condition 1 is triggered. Since there is no additional signal about the outcome present in the added noise features once you have the original set of predictors, we should expect that the proportion of ensemble models that return significant coefficients on multiple P-hat variables should be small.

Percentage of P-Hat Models with 2 or more Significant P-Hat Coefficients

# of P-Hats in Ensemble	5 Added Noise Features	10 Added Noise Features	25 Added Noise Features	50 Added Noise Features	75 Added Noise Features	100 Added Noise Features	150 Added Noise Features
r = 0
2 P-Hats	2.4%	3.6%	2.0%	4.0%	7.6%	5.2%	2.0%
3 P-Hats	4.4%	4.0%	5.6%	9.6%	11.2%	12.8%	9.2%
5 P-Hats	4.8%	4.0%	9.2%	12.8%	15.6%	20.0%	14.0%
10 P-Hats	8.0%	10.0%	18.0%	35.6%	37.2%	32.0%	29.2%
20 P-Hats	24.0%	20.8%	32.4%	63.6%	56.8%	35.2%	54.0%
40 P-Hats	64.0%	46.0%	59.2%	88.8%	75.2%	68.0%	74.4%
80 P-Hats	91.6%	84.8%	87.6%	98.4%	96.8%	94.0%	90.0%
r = 0.2
2 P-Hats	2.0%	2.0%	2.0%	1.2%	2.4%	2.0%	2.8%
3 P-Hats	0.8%	2.8%	5.2%	2.8%	3.2%	4.0%	5.6%
5 P-Hats	2.8%	6.0%	6.0%	6.8%	8.8%	7.6%	8.0%
10 P-Hats	3.6%	15.6%	15.2%	10.4%	12.0%	14.4%	14.0%
20 P-Hats	14.8%	40.0%	31.6%	29.6%	21.6%	29.2%	26.4%
40 P-Hats	45.2%	80.4%	67.2%	54.0%	47.6%	64.8%	61.6%
80 P-Hats	83.2%	98.4%	93.6%	90.8%	85.2%	93.6%	94.4%
r = 0.7
2 P-Hats	5.6%	4.0%	0.8%	2.8%	2.4%	2.0%	1.6%
3 P-Hats	6.0%	4.0%	1.2%	4.8%	4.4%	5.6%	2.8%
5 P-Hats	6.8%	4.4%	2.8%	4.4%	6.0%	4.8%	2.0%
10 P-Hats	18.0%	11.2%	2.0%	8.8%	12.8%	13.6%	5.6%
20 P-Hats	42.4%	34.0%	8.8%	32.4%	34.0%	36.8%	19.6%
40 P-Hats	76.4%	70.4%	34.8%	64.0%	76.0%	76.4%	52.0%
80 P-Hats	98.0%	96.4%	74.4%	95.6%	97.6%	98.0%	88.0%
r = 0.95
2 P-Hats	15.6%	15.2%	14.0%	14.8%	7.6%	6.0%	4.4%
3 P-Hats	29.6%	21.6%	22.4%	17.2%	14.4%	9.6%	8.8%
5 P-Hats	36.8%	34.8%	32.0%	34.4%	29.6%	15.6%	15.2%
10 P-Hats	62.4%	68.8%	57.6%	64.0%	48.8%	40.0%	22.8%
20 P-Hats	88.4%	92.4%	92.0%	88.8%	80.0%	67.6%	55.2%
40 P-Hats	99.2%	100.0%	99.2%	100.0%	98.0%	93.2%	91.6%
80 P-Hats	100.0%	100.0%	100.0%	100.0%	100.0%	99.2%	99.2%

We see from the above table that condition 1 of our P-hat model test is rarely triggered when a small number of zero-correlation noise features are added to the feature set and we only ensemble a small number of P-hat sets. If our standard errors in our P-hat ensembles are properly sized, we would expect the second P-hat variable to be significant in the 2 P-hat model condition approximately 5% of the time, and we see that seems approximately correct when features are generated with little correlation to the true signal features. Unfortunately, when our noise features are highly correlated to age (\(r = 0.95\)), adding a small number of them frequently results in multiple sets of predictions contributing relevant information about the outcome. When \(r = 0.95\) and \(5 \le q \le 50\), at least one of the P-hat variables from the added noise feature models is significant over 15% of the time and that proportion only increases as more P-hat sets are included.

In some sense, the relatively low proportion of false positives when \(r = 0, 0.2,\) and \(0.7\) is encouraging and affirms the validity of the method in these contexts. Unfortunately, those contexts do not mirror the real-world uses of this P-hat modeling procedure as closely as the high-correlation conditions. Typically, we use the P-hat ensembling method when we think a new set of features does not add additional signal after accounting for the signal from the original features, not when we think the new features may not add any signal whatsoever. In other words, the P-hat model technique is generally used when it is believed that a new set of features may have significant overlap in its signal about the outcome with other features, so our null hypothesis is that these features only add noise conditional on our original set of features. This situation most closely mirrors our high-correlation conditions, where we see higher than expected rates of false positives.

In these high-correlation conditions, we also expect our predicted probabilities to be more correlated. Let’s look at the proportion of the P-hat ensembles with zero significant coefficients on P-hat variables:

Percentage of P-Hat Models with 0 Significant P-Hat Coefficients

# of P-Hats in Ensemble	5 Added Noise Features	10 Added Noise Features	25 Added Noise Features	50 Added Noise Features	75 Added Noise Features	100 Added Noise Features	150 Added Noise Features
r = 0
2 P-Hats	67.6%	46.4%	18.0%	7.2%	5.2%	0.4%	0.0%
3 P-Hats	80.8%	72.0%	31.6%	13.2%	10.4%	6.8%	0.4%
5 P-Hats	78.8%	78.4%	41.6%	15.6%	12.4%	3.2%	0.0%
10 P-Hats	56.4%	62.4%	43.6%	6.0%	10.8%	12.8%	0.8%
20 P-Hats	38.8%	41.2%	34.8%	5.2%	13.2%	21.6%	2.8%
40 P-Hats	9.6%	21.2%	13.2%	0.4%	7.2%	8.8%	6.4%
80 P-Hats	2.4%	2.0%	0.8%	0.0%	0.8%	1.2%	0.8%
r = 0.2
2 P-Hats	78.0%	63.6%	31.2%	14.4%	22.4%	14.4%	23.2%
3 P-Hats	88.4%	76.8%	53.6%	41.2%	53.6%	47.6%	51.6%
5 P-Hats	84.8%	70.0%	62.0%	61.6%	63.2%	62.4%	62.4%
10 P-Hats	78.0%	53.2%	51.2%	55.2%	56.0%	55.2%	60.8%
20 P-Hats	50.8%	24.8%	26.8%	32.8%	39.6%	32.8%	39.2%
40 P-Hats	20.0%	6.8%	13.2%	14.0%	22.8%	11.2%	10.4%
80 P-Hats	4.8%	0.8%	0.8%	2.8%	2.8%	2.0%	0.0%
r = 0.7
2 P-Hats	80.4%	75.6%	78.0%	80.8%	80.4%	77.2%	84.8%
3 P-Hats	81.2%	87.2%	90.8%	86.8%	77.6%	75.6%	86.0%
5 P-Hats	74.4%	80.0%	82.8%	77.6%	72.4%	77.2%	86.4%
10 P-Hats	47.2%	56.0%	75.6%	63.2%	53.6%	56.0%	70.0%
20 P-Hats	25.6%	33.2%	57.6%	32.8%	29.2%	26.8%	44.8%
40 P-Hats	5.6%	8.4%	28.0%	9.6%	6.8%	8.8%	16.8%
80 P-Hats	0.0%	0.0%	7.6%	0.4%	0.0%	0.8%	2.0%
r = 0.95
2 P-Hats	82.0%	84.8%	85.2%	83.2%	87.6%	89.6%	92.4%
3 P-Hats	56.4%	64.0%	66.0%	78.4%	74.8%	79.2%	80.8%
5 P-Hats	45.2%	42.0%	46.0%	50.0%	54.0%	67.6%	72.0%
10 P-Hats	13.6%	10.4%	18.8%	21.6%	30.0%	40.8%	51.6%
20 P-Hats	3.2%	2.0%	1.2%	1.6%	3.6%	12.8%	18.8%
40 P-Hats	0.0%	0.0%	0.0%	0.0%	0.4%	1.2%	3.6%
80 P-Hats	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.4%

In general, we should expect that adding more noise features should reduce the correlation in the P-hats and increasing the correlation between the real features and the noise features should increase it. As a result, one would expect that a small number of correlated noise features would produce predictions that are highly correlated with the predictions from our real model and thus suffer more often from multicollinearity.

As expected, we see in the table above that when \(r\) and \(q\) are small and we are ensembling 10 or fewer P-hats, there are no significant coefficients in our linear model the majority of the time. If we look at the results when r = 0.95 and we have 10 or fewer P-hats in our model, the percentage of ensembles with 0 significant coefficients is large no matter how many noise variables are added.

It appears that a major difficulty with using the P-hat model procedure to determine which feature sets contain meaningful additional signal is that adding features with no additional signal will produce correlated predictions and introduce multicollinearity into the ensemble. Therefore, if considering multiple correlated models at a time, one might incorrectly conclude that none of them contain additional signal.

5.2.1 Adjusted R-Squared

One solution for avoiding false negatives is to consider the change in adjusted R-squared when a set of P-hats is added to the ensemble. Shown below are the distributions of the differences in adjusted R-squared between a given P-hat ensemble and a linear model just using the baseline P-hats coming from the predictor model without any noise features.

Luckily, we see that the difference in the adjusted R-squared remains close to zero on average for ensembles with 10 or fewer P-hat variables across most conditions. As should be expected, ensembling large numbers of P-hats in the test set often increases the adjusted R-squared. As seen before, the models trained on noise features with high levels of correlation with the actual features seem to add value, with increases in the adjusted R-squared for all P-hat model sizes on average. Overall though, the increases are all relatively small (< 0.001).

5.2.1.1 r = 0

5.2.1.2 r = 0.2

5.2.1.3 r = 0.7

5.2.1.4 r = 0.95

5.2.2 Average Change in Adj. R-squared

6 Discussion/Conclusion

From these two experiments, it appears that the “P-hat model test” is sensitive to variability in the sample and feature set. The issues though are not with the linear model itself, but the inferences that can safely be drawn from it. As advertised, the linear ensemble of P-hats provides information about the usefulness of a set of predictions for understanding the outcome in the test set. If it places a significant coefficient on a P-hat variable, that set of predictions is likely providing helpful new information about the outcome. Those inferences, however, are based on the assumption that the X’s are fixed and the Y’s are random variables. The standard errors are properly sized for the variability of the outcomes in the test set as a function of the given predictions, but do not capture the variability of the P-hat variables as a result of the variation in the training set and model fitting procedure. In the simplest possible terms, the p-values in the P-hat model tell you the probability that this fixed set of predictions has a true slope of zero in the test set, but it doesn’t tell you the probability that the underlying function relating X to Y that’s being estimated in the training set produces predictions with a true slope of zero in the test set. In order to make the inference that a significant coefficient indicates signal in the feature set the underlying model was trained on, we would need our standard errors to take into account the variability from the full procedure. The above experiments have shown that at least two sources of that variability are meaningful enough to make that inference regularly invalid.

From the first experiment, we see that variability in the training sample regularly leads to a benefit from ensembling several bootstrapped predictors in the test set. This type of result, in some ways, is unsurprising given the well-known success of various ensemble methods for prediction tasks (like random forests). These procedures work because they produce smoother models – less prone to overfitting and more likely to perform well out-of-sample. But that same benefit poses issues for drawing conclusions about the individual value of any bootstrapped predictor. Even if those predictors are tuned using 10-fold cross validation to be properly regularized, we see that it is frequently not enough, with different predictors capturing different pieces of the true signal. This result is particularly true when the feature set is large and thus there is more variance in the model fitting procedure.

In some senses, the results of the second experiment are more encouraging. Part of the motivation for this exercise was to consider the value that added features provide through regularization, even when they are fundamentally non-informative about the outcome. We see from the distribution of performance metrics that our actual model with no noise features added generally outperforms the models with noise features added across all conditions. However, these results don’t guarantee there isn’t value in including non-informative noise features for other datasets or modeling regimes. In Mentch & Zhou (2020), including random generated noise features was most useful in datasets with a low signal-to-noise ratio. In that context, noise features are more likely to be useful as a regularizing tool to avoid overfitting, particularly in high-capacity models.

In terms of the P-hat modeling test, the second experiment is more of a mixed bagged. When true noise features are added to the feature set (r = 0), the P-hat model generally finds little value in placing significant coefficients on multiple sets of predictions, as one would hope. When the noise features added, however, are highly correlated with one of the real predictors, the P-hat ensemble regularly either sees multiple P-hat variables as significant or none as significant. This result likely indicates that the value of additional noise features is highly sensitive to the way in which the noise is generated relative to the noise in the dataset. For this specific dataset and its signal-to-noise ratio, it appears a relatively small amount of noise is valuable because it can provide a small boost through regularization. Adding larger amounts of noise, either to the actual features themselves or through adding more total features, seems to lead to underfitting, thus reducing the value of the P-hats from the models fit with the additional noise. Noticeably, this result still makes it difficult to draw inferences from the P-hat ensemble, because it is difficult to know if the noise in a set of additional features has the size or structure that leads to a sizable regularization benefit. Even if the benefit is small or infrequent overall, it’s difficult to rule out the possibility that the coefficient on a P-hat variable from a predictor trained on new features is significant because the noise in those features happened to reduce the fit to sample an appropriate amount.

One of the takeaways from these exercises is to take additional steps to reduce the variability in the model fitting process and avoid overfitting. The first experiment shows that bagging can be a helpful strategy to reduce the variance of the model coming from the underlying training sample. The second experiment shows that even without considering the variability in the particular sample, the amount of noise in the feature set needs to be accounted for. When the number of features being used is relatively large, reducing model variance is particularly important for proper inference so as to prevent the added features from implicitly providing value through regularization as shown in Kobak, Lomond & Sanchez (2020). If models are lower variance and properly regularized before having their predictions ensembled in the test set, it should reduce the likelihood of false positives, but it will increase the likelihood of false negatives as predictions become more correlated. Therefore, one should either remove P-hat sets that are highly correlated with predictions already in the ensemble, use a joint F-test or the change in adjusted R-squared for testing the importance of multiple added P-hat variables, or work to only add one P-hat set at a time.

Unfortunately, the reduction in model variance required to avoid false positive rates above 5% may simply be too high a bar to clear. Since out-of-sample performance is always expected to be less than or equal to in-sample performance, almost all models built and evaluated on a particular train-test split will be at least slightly overfit. Part of the appeal of the P-hat ensembling procedure is that it is more sensitive to the value of new predictions, but that sensitivity may make it nearly impossible to reduce the model variance enough to make the desired inferences. Typically, a LASSO model trained on an 18,000-row dataset and tuned using 10-fold cross validation and a 25-value penalty path might be considered sufficiently low variance for most prediction tasks, but we’ve seen in these experiments that models produced using this procedure can still regularly benefit from being averaged in the test set. We could take steps to reduce the model variance further, like bagging all models and tuning more heavily, but in many applications, this type of procedure is prohibitively time intensive and may not be sufficient. For example, models like convolution neural networks trained using stochastic gradient descent have high variance in the model fitting process and training/tuning hundreds or thousands of them on bootstrapped samples would take an enormous amount of time. Furthermore, the P-hat modeling approach is still susceptible to the randomness of the train-test split. Even if all possible steps were taken to reduce model variance and extract the appropriate signal, the specific noise in the test set may still make it regularly valuable to place significant coefficients on multiple, slightly different predictors. The P-hat linear model is fit in the test set, so it has the actual answers. That may be too powerful an advantage to overcome.

P-Hat Models: Simulation Results and Implications for Inference when Ensembling Predictions in the Test Set

Logan Crowl