A Parameter Encoder Neural Network (PENN) is an explainable machine learning technique that solves two problems associated with traditional XAI algorithms:
- It permits the calculation of local parameter distributions. Parameter distributions are often more interesting than feature contributions — particularly in economic and financial applications — since the parameters disentangle the effect from the observation (the contribution can roughly be defined as the demeaned product of effect and observation).
- It solves a problem of biased contributions that is inherent to many traditional XAI algorithms. Particularly in the setting where neural networks are powerful — in interactive, dependent processes — traditional XAI can be biased, by attributing effect to each feature independently.
At the end of the tutorial, I will have estimated the following highly nonlinear parameter functions for a simulated regression with three variables:
For more details on the mathematical background of the parameter encoder
neural network, see the paper. An
R
version of this tutorial can be found
here.
We will use a nonlinear simulated data set with k = 3
features
contained in the array x
and a continuous target y
given by
y = x[1]*beta[1] + x[2]*beta[2] + x[3]*beta[3] + error
. Here beta[1]
has the shape of a sine-curve, beta[2]
has no effect on the output
(i.e. this is simply a correlated nuisance term), and beta[3]
has a
threshold shape with 3 different regimes:
import numpy as np
# For reproducibility
np.random.seed(42)
k = 3
n = 1000
Sigma = [[1.0, 0.3, 0.3], [0.3, 1.0, 0.3], [0.3, 0.3, 1.0]]
x = np.random.multivariate_normal([0.0]*k, Sigma, n)
eps = np.random.normal(size=n)
betas = np.zeros((n,k))
betas[:,0] = np.sin(x[:,0]) * 5.0
betas[x[:,2]>0.75,2] = 5.0
betas[x[:,2]<-0.75,2] = -5.0
y = (x * betas).sum(axis=1) + eps
The following code chunks construct a PENN model using keras
to
estimate beta
. The separate functions are combined into a PENN
class
in PENN.py
. I begin by loading the necessary modules below:
# Load backend functions
from keras import backend as b
# Neural net building blocks
from keras.layers import Dense, Input, Lambda, Multiply, Add
from keras.regularizers import l2
from keras.optimizers import Adam
from keras import Model
# Necessary for the calculation of the beta-prior
from scipy.spatial import distance_matrix
import tensorflow as tf
# For the loss function to work, we need to switch off eager execution
tf.compat.v1.disable_eager_execution()
# For reproducibility
tf.random.set_seed(42)
Next, I construct the inference network with 2 layers and 10 hidden
nodes in each layer. I use a sigmoid activation function. The inference
network is completed by the output nodes for the parameter
distributions, mu
and sigma
. The PENN uses variational inference to
obtain posteriors of the local parameters. Assuming normally distributed
local parameters, mu
and sigma
parameterize the local posteriors. A
prediction is generated by sampling from the posterior:
def build(k, n, mc_draws=100, size=10, l2_penalty=0.001):
# 1. Model inputs
input_inference_nn = Input(k, name='input_inference_nn')
input_model = Input(k, name='input_model')
input_knn_prior = Input(batch_shape=(n, n), name='input_knn_prior')
input_mc = Input(tensor=b.random_normal((n, mc_draws, k)), name='input_mc')
inputs = [input_inference_nn,
input_model,
input_knn_prior,
input_mc]
# 2. Inference Network
encoder_layer_1 = Dense(size,
activation='sigmoid',
kernel_regularizer=l2(l2_penalty))(input_inference_nn)
encoder_layer_2 = Dense(size,
activation='sigmoid',
kernel_regularizer=l2(l2_penalty))(encoder_layer_1)
# ---- Parameter layers
mu = Dense(k, kernel_regularizer=l2(l2_penalty), name='mu')(encoder_layer_2)
sigma_squared = Dense(k, activation='exponential',
kernel_regularizer=l2(l2_penalty))(encoder_layer_2)
sigma = Lambda(lambda i: b.sqrt(i), name='sigma')(sigma_squared)
# 3. Posterior sample
sample = Multiply()([sigma, input_mc])
sample = Add()([sample, mu])
# 4. Generate predictions
output = Multiply()([sample, input_model])
output = Lambda(lambda i: b.sum(i, axis=2, keepdims=True), output_shape=(n, mc_draws, 1))(output)
# 5. Build model
model = Model(inputs, output)
return model
With the model function defined, I can build the PENN model, as well as supporting models (used only for inference) that extract the parameters of the posterior:
model = build(k, n)
mu_model = Model(model.inputs, model.get_layer('mu').output)
sigma_model = Model(model.inputs, model.get_layer('sigma').output)
The most important component of the PENN is the loss function, which we define next. The loss function consists of two elements: the mean squared error of the local linear model, and a Kullback-Leibler penalty enforcing stability in the parameter distributions:
def loss(y, y_pred):
mse = b.mean(b.square(y_pred - y))
mu_ = model.get_layer('mu').output
sigma_ = model.get_layer('sigma').output
input_knn_prior_ = model.inputs[2]
prior_mu = b.dot(input_knn_prior_, mu_)
prior_sigma = b.dot(input_knn_prior_, sigma_) + b.dot(input_knn_prior_, b.square(mu_ - prior_mu))
kl = b.mean(b.mean((b.log(b.sqrt(sigma_)) -
b.log(b.sqrt(prior_sigma))) -
((sigma_ + b.square(mu_ - prior_mu)) / (2 * prior_sigma)) + 0.5, axis=1))
return mse - kl * lam
Finally, the KNN-prior requires a distance matrix, and we need to set hyperparameters:
lam = 4
gam = 0.04
knn_prior = distance_matrix(x, x)
gam = knn_prior[knn_prior>0.0].min() + gam * (
knn_prior[knn_prior > 0.0].max() - knn_prior[knn_prior>0.0].min()
)
knn_prior /= gam
idx = knn_prior < 1.0
knn_prior[idx] = 1.0
knn_prior[~idx] = 0.0
knn_prior = (knn_prior.T / knn_prior.sum(axis=1)).T
With all the components in place, we can compile and fit the model:
model.compile(loss=loss, optimizer=Adam(learning_rate=0.05, clipnorm=1, clipvalue=0.5))
data = {
'input_inference_nn': x,
'input_model': x,
'input_knn_prior': knn_prior,
'input_mc': np.zeros((n, 100, k))
}
# Note that y needs to be expanded over the sampling dimension (we are sampling 100 draws)
y_expanded = np.repeat(y[:, np.newaxis, np.newaxis], 100, axis=1)
model.fit(data, y_expanded, batch_size=n, epochs=1000, verbose=0)
## <keras.callbacks.History object at 0x7f52684b4df0>
We can extract the estimated parameters using the inference models:
mu = mu_model.predict(data, batch_size=n)
Let’s plot the posterior means against the values of x
:
…and the true parameters of the simulation against x
:
That looks pretty good!
An interesting question is how the explanations obtained from the PENN model compare with SHAP values. SHAP values are contributions, which we can obtain from the estimated parameters:
phi = mu * x
phi = phi - phi.mean(axis=0)
Plotting the contributions:
Now, I fit a random forest regressor on the data using sklearn:
from sklearn.ensemble import RandomForestRegressor
mod = RandomForestRegressor(n_estimators=100)
mod.fit(x, y)
## RandomForestRegressor()
The feature contributions can be obtained using shap.TreeExplainer
:
import shap
expl = shap.TreeExplainer(mod)
phi_shap = expl.shap_values(x, y)
The plot below displays the Random Forest and PENN contributions, which match closely, as is to be expected in the case of independent features: