Key Driver Analysis (KDA) is a statistical technique used to determine the factors (or “drivers”) that most significantly impact a particular outcome or dependent variable. It is commonly used in fields like marketing, customer satisfaction, product development, and human resources to understand what influences key outcomes such as customer satisfaction, employee engagement, or product success.
This post implements a few measure of variable importance, interpreted as a key drivers analysis, for certain aspects of a payment card on customer satisfaction with that payment card. This involves calculating pearson correlations, standardized regression coefficients, “usefulness”, Shapley values for a linear regression, Johnson’s relative weights, and the mean decrease in the gini coefficient from a random forest.
Data Overview
First of all, let’s have some data overview:
data = pd.read_csv('data_for_drivers_analysis.csv')data
brand
id
satisfaction
trust
build
differs
easy
appealing
rewarding
popular
service
impact
0
1
98
3
1
0
1
1
1
0
0
1
0
1
1
179
5
0
0
0
0
0
0
0
0
0
2
1
197
3
1
0
0
1
1
1
0
1
1
3
1
317
1
0
0
0
0
1
0
1
1
1
4
1
356
4
1
1
1
1
1
1
1
1
1
...
...
...
...
...
...
...
...
...
...
...
...
...
2548
10
17800
5
1
1
0
1
0
1
1
1
1
2549
10
17808
3
1
0
0
1
0
1
1
1
0
2550
10
17893
5
0
1
1
0
0
0
0
0
0
2551
10
17984
3
1
1
0
1
0
1
0
0
0
2552
10
18073
4
0
1
0
1
0
0
0
0
0
2553 rows × 12 columns
# Calculate summary statistics for the datasetsummary_statistics = data.describe()# Display the summary statisticssummary_statistics
brand
id
satisfaction
trust
build
differs
easy
appealing
rewarding
popular
service
impact
count
2553.000000
2553.000000
2553.000000
2553.000000
2553.000000
2553.000000
2553.000000
2553.000000
2553.000000
2553.000000
2553.000000
2553.000000
mean
4.857423
8931.480611
3.386604
0.549550
0.461810
0.334508
0.536232
0.451234
0.451234
0.536232
0.467293
0.330983
std
2.830096
5114.287849
1.172006
0.497636
0.498637
0.471911
0.498783
0.497714
0.497714
0.498783
0.499027
0.470659
min
1.000000
88.000000
1.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
25%
3.000000
4310.000000
3.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
50%
4.000000
8924.000000
4.000000
1.000000
0.000000
0.000000
1.000000
0.000000
0.000000
1.000000
0.000000
0.000000
75%
6.000000
13545.000000
4.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
max
10.000000
18088.000000
5.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
As we can see, there are 2,553 rows of data that helps further analysis to answer the questions:
Is offered by a brand I trust
Helps build credit quickly
Is different from other cards
Is easy to use
Has appealing benefits or rewards
Rewards me for responsible usage
Is used by a lot of people
Provides outstanding customer service
Makes a difference in my life
Summary of Metrics
To implement measures of variable importance, interpreted as a key drivers analysis, the process will include metrics:
Pearson Correlation: Measures the linear relationship between each perception and satisfaction.
Polychoric Correlation: Estimates the correlation between two theorized normally distributed continuous latent variables from observed ordinal variables.
Standardized Coefficient: The coefficients from a linear regression model standardized to have unit variance.
Shapley Values: Measure the contribution of each feature to the model’s prediction averaged over all possible combinations of features.
Johnson’s Epsilon: Reflects the relative importance of predictors adjusted for multicollinearity. = Mean Decrease in Gini Coefficient: Reflects the importance of each feature in reducing impurity in a Random Forest model.
XGBoost: Feature importance from the XGBoost model, based on how useful each feature is in reducing the objective function’s error.
First of all, let’s do some preparation for the model:
# Assign X as individual variables in regressionX_list = ['trust', 'build', 'differs', 'easy', 'appealing', 'rewarding', 'popular', 'service', 'impact']X = data[X_list]# Assign y as dependent variables in regressiony = data['satisfaction']
Pearson Correlations
The formula for calculating the Pearson correlation coefficient 𝑟 is:
The Pearson correlation coefficient (r) measures the strength and direction of the linear relationship between two variables. It is widely used in statistics and data analysis to assess whether and how strongly pairs of variables are related. The value of the Pearson correlation coefficient ranges from -1 to 1:
Polychoric correlation is used to estimate the correlation between two theorized normally distributed continuous latent variables from two observed ordinal variables. This type of correlation is particularly useful when dealing with ordinal data, where the variables are measured on an ordinal scale.
The polychoric correlation () can be estimated using: \[
\rho = \frac{\sum_{i=1}^{n} \sum_{j=1}^{m} (o_{ij} - e_{ij})^2 / e_{ij}}{\sqrt{\sum_{i=1}^{n} (o_{i+} - e_{i+})^2 / e_{i+} \sum_{j=1}^{m} (o_{+j} - e_{+j})^2 / e_{+j}}}
\]
Where:
\({o_{ij}}\) is the observed frequency for the cell in row \({i}\) and column \({j}\).
\({e_{ij}}\) is the expected frequency under the assumption of independence.
\({o_{i+}}\) and \({o_{+j}}\) are the marginal totals for row \({i}\) and column \({j}\).
\({e_{i+}}\) and \({e_{+j}}\) are the expected marginal totals for row \({i}\) and column \({j}\).
To enhance the process, we use polycor library from R then transform them to python code:
# Activate the pandas2ri conversionpandas2ri.activate()# Initialize a list to store the resultscorrelation = []# Define the R code for the polychoric_corr functionr_code ="""library(polycor)polychoric_corr <- function(x, y) { result <- polychor(x, y) return(result)}"""# Run the R coderobjects.r(r_code)# Get the polychoric_corr functionpolychoric_corr = robjects.globalenv['polychoric_corr']for col in X_list: r_corr = polychoric_corr(y, data[col]) correlation.append(r_corr[0])# Normalize correlationstotal =sum(correlation)correlation = [value / total for value in correlation]# Convert correlations to a pandas DataFramepoly_corr_df = pd.DataFrame({'Perception': X_list,'Polychoric Correlation': correlation})# Reformat the columnpoly_corr_df['Polychoric Correlation'] = (poly_corr_df['Polychoric Correlation']*100).round(1).astype(str) +'%'poly_corr_df
Perception
Polychoric Correlation
0
trust
12.9%
1
build
9.9%
2
differs
10.0%
3
easy
10.9%
4
appealing
10.6%
5
rewarding
10.1%
6
popular
8.9%
7
service
13.0%
8
impact
13.8%
Standardized Regression Coefficients
The standardized regression coefficient for predictor \({X_j}\) in a multiple regression model is:
where \({b_j}\) is the unstandardized regression coefficient, \({\sigma_{X_j}}\) is the standard deviation of \({X_j}\) , and \({\sigma_Y}\) is the standard deviation of the outcome variable \({Y}\).
These are the coefficients obtained from a regression model after standardizing the variables (i.e., converting them to a common scale). This allows for comparison of the relative importance of each predictor.
# Fitting the linear regression modelmodel = LinearRegression()model.fit(X, y)# Obtaining standardized coefficientscoefficients = model.coef_# Standardized coefficients as a DataFrame for better manipulationcoefficients_df = pd.DataFrame({'Perception': X_list,'Standardized Coefficient': coefficients})## Normalize the coefficientscoefficients_df['Standardized Coefficient'] /= coefficients_df['Standardized Coefficient'].sum()# Convert to percentage formatcoefficients_df['Standardized Coefficient'] = ( coefficients_df['Standardized Coefficient'] *100).round(1).astype(str) +'%'coefficients_df
Perception
Standardized Coefficient
0
trust
24.8%
1
build
4.3%
2
differs
6.3%
3
easy
4.7%
4
appealing
7.3%
5
rewarding
1.1%
6
popular
3.6%
7
service
18.9%
8
impact
29.1%
Shapley values
Shapley values, originating from cooperative game theory, are a method to fairly distribute the “payout” among players based on their contribution to the total payout. In the context of machine learning and regression models, Shapley values are used to quantify the contribution of each feature to the prediction of a model.
For linear regression, Shapley values provide a way to understand the importance and contribution of each predictor variable to the prediction for each instance. This is done by considering all possible combinations of features and calculating the marginal contribution of each feature.
The Shapley value for a feature \({j}\) is given by: \[
\phi_j = \sum_{S \subseteq N \setminus \{j\}} \frac{|S|!(|N| - |S| - 1)!}{|N|!} \left[ v(S \cup \{j\}) - v(S) \right]
\]
Where:
\({N}\) is the set of all features
\({S}\) is a subset of \({N}\) excluding \({j}\)
\({v(S)}\) is the value function (e.g., model performance) for subset \({S}\).
import shapmodel = LinearRegression()model.fit(X, y)# Calculate Shapley values using the shap libraryexplainer = shap.LinearExplainer(model, X)shap_values = explainer(X)# Get the mean absolute Shapley values for each featureshap_values_mean = pd.DataFrame(shap_values.values, columns=X.columns).abs().mean()shap_values_mean /= shap_values_mean.sum()shap_values_mean = (shap_values_mean *100).round(1).astype(str) +'%'shap_values_df = pd.DataFrame({'Perception': shap_values_mean.index,'Shapley Values': shap_values_mean.values})shap_values_df
Perception
Shapley Values
0
trust
26.7%
1
build
4.5%
2
differs
5.6%
3
easy
5.1%
4
appealing
7.6%
5
rewarding
1.1%
6
popular
3.8%
7
service
19.9%
8
impact
25.5%
Johnson Relative
Decomposes the model’s \({R^2}\) to assign weights to each predictor, reflecting their relative contribution to the explained variance in the outcome.
The relative weight for a predictor \({X_j}\) is calculated as: \[
RW_j = \sum_{k=1}^{p} \left( \frac{\lambda_{jk}^2 \cdot \text{Var}(Z_k)}{\text{Var}(Y)} \right)
\]
where:
\({\lambda_{jk}}\) is the loading of predictor \({j}\) on the \({k}\)-th principal component
\({\text{Var}(Z_k)}\) is the variance of the \({k}\)-th principal component
\({\text{Var}(Y)}\) is the variance of the outcome variable.
# Perform relative weights analysisjohnsons_eps = relativeImp(data, outcomeName='satisfaction', driverNames = X_list)# Drop the 'rawRelaImpt' columnjohnsons_eps = johnsons_eps.drop('rawRelaImpt', axis=1)# Rename the columnsjohnsons_eps = johnsons_eps.rename(columns={'driver': 'Perception', 'normRelaImpt': "Johnson's Epsilon"})# Reformat the columnsjohnsons_eps["Johnson's Epsilon"] = johnsons_eps["Johnson's Epsilon"].round(1).astype('str') +"%"johnsons_eps
Perception
Johnson's Epsilon
0
trust
19.8%
1
build
6.6%
2
differs
7.0%
3
easy
8.2%
4
appealing
8.3%
5
rewarding
6.0%
6
popular
5.4%
7
service
16.6%
8
impact
22.0%
Mean Decrease in Gini Coefficient
In Random Forests, the Gini importance (or Mean Decrease in Gini) is calculated based on the average decrease in impurity (Gini impurity) brought by each feature to the nodes in the trees.
The Mean Decrease in Gini for a feature \({j}\) is: \[
\text{MDG}_j = \frac{1}{T} \sum_{t=1}^{T} \left( \Delta Gini(t, j) \right)
\]
Where:
\({T}\) is the total number of trees
\({\Delta Gini(t, j)}\) is the decrease in Gini impurity for tree \({t}\) due to feature \({j}\).
np.random.seed(42)# Fit Random Forest modelrf_model = RandomForestClassifier(n_estimators=50,max_depth=8)rf_model.fit(X, y)# Get Mean Decrease in Gini Coefficient (feature importances)rf_importances = pd.Series(rf_model.feature_importances_, index=X.columns)# rf_importances /= rf_importances.sum()rf_importances_df = pd.DataFrame({'Perception': rf_importances.index,'Mean Decrease in Gini Coefficient': rf_importances.values})rf_importances_df['Mean Decrease in Gini Coefficient'] = ( rf_importances_df['Mean Decrease in Gini Coefficient']*100).round(1).astype(str) +'%'# Display the feature importancesrf_importances_df
Perception
Mean Decrease in Gini Coefficient
0
trust
10.5%
1
build
11.4%
2
differs
10.8%
3
easy
10.6%
4
appealing
11.6%
5
rewarding
11.9%
6
popular
12.1%
7
service
10.9%
8
impact
10.2%
XGBoost Feature Importance
XGBoost provides three types of feature importance: weight (frequency of a feature in trees), cover (average coverage of the feature), and gain (average gain brought by a feature to the branches it is on).
where \({T}\) is the total number of trees, \({n}\) are the nodes in tree \({t}\), and \({I}\) is an indicator function that is 1 if the node \({n}\) uses feature \({j}\), otherwise 0.
# Shift the classes in the target variable to start from 0y_shifted = y -1# Train the XGBoost model with the shifted target variablexgb_model = xgb.XGBClassifier()xgb_model.fit(X, y_shifted)# Get feature importances from the XGBoost modelxgb_importances = pd.Series(xgb_model.feature_importances_, index=X.columns)xgb_importances_df = pd.DataFrame({'Perception': xgb_importances.index,'XGBoost': xgb_importances.values})xgb_importances_df['XGBoost'] = ( xgb_importances_df['XGBoost']*100).round(1).astype(str) +'%'# Display the feature importancesxgb_importances_df
Very high importance: Pearson Correlation (13.2%), Standardized Coefficient (29.1%), Shapley Values (25.5%), Johnson’s Epsilon (22.0%), XGBoost (15.8%).
Conclusion: Consistently one of the most important factors.
Summary
Top Drivers of Satisfaction
Trust in the brand is the most consistently high-rated factor across multiple metrics, indicating that customer trust in the brand has a profound and consistent impact on overall satisfaction. This suggests that fostering and maintaining trust is crucial for any brand aiming to improve customer satisfaction. Outstanding customer service also shows high importance across all metrics, underlining that excellent customer service is vital for keeping customers satisfied. Additionally, the perception that the product or service makes a difference in customers’ lives is consistently one of the top factors, showing very high importance across most metrics. This highlights that customers highly value how the product or service impacts their lives positively.
Moderate Drivers of Satisfaction
Appealing benefits or rewards hold moderately high importance across various metrics, indicating that attractive benefits and rewards significantly contribute to customer satisfaction. Ease of use is another important factor, particularly in non-linear models, suggesting that while ease of use is significant, it is not the top factor. The perception that the product helps build credit quickly shows moderate importance, especially in tree-based models like Random Forest, making it an important but secondary factor. Differentiation from other cards is moderately important across most metrics, indicating that being different from other cards is valued by customers but is not the most critical factor.
Lower Drivers of Satisfaction
Rewards for responsible usage generally show lower importance in linear models but have some significance in non-linear models. This indicates that while these rewards are valued, they are less critical compared to other factors. The perception that the product is used by a lot of people generally has lower importance but shows higher importance in non-linear models like Random Forest and XGBoost. This suggests that while widespread usage is a less critical factor overall, it still holds some significance in certain contexts.