Key Drivers Analysis

Author

Duyen Tran

Published

May 30, 2024

Introduction

Key Driver Analysis (KDA) is a statistical technique used to determine the factors (or “drivers”) that most significantly impact a particular outcome or dependent variable. It is commonly used in fields like marketing, customer satisfaction, product development, and human resources to understand what influences key outcomes such as customer satisfaction, employee engagement, or product success.

This post implements a few measure of variable importance, interpreted as a key drivers analysis, for certain aspects of a payment card on customer satisfaction with that payment card. This involves calculating pearson correlations, standardized regression coefficients, “usefulness”, Shapley values for a linear regression, Johnson’s relative weights, and the mean decrease in the gini coefficient from a random forest.

Data Overview

First of all, let’s have some data overview:

data = pd.read_csv('data_for_drivers_analysis.csv')
data

	brand	id	satisfaction	trust	build	differs	easy	appealing	rewarding	popular	service	impact
0	1	98	3	1	0	1	1	1	0	0	1	0
1	1	179	5	0	0	0	0	0	0	0	0	0
2	1	197	3	1	0	0	1	1	1	0	1	1
3	1	317	1	0	0	0	0	1	0	1	1	1
4	1	356	4	1	1	1	1	1	1	1	1	1
...	...	...	...	...	...	...	...	...	...	...	...	...
2548	10	17800	5	1	1	0	1	0	1	1	1	1
2549	10	17808	3	1	0	0	1	0	1	1	1	0
2550	10	17893	5	0	1	1	0	0	0	0	0	0
2551	10	17984	3	1	1	0	1	0	1	0	0	0
2552	10	18073	4	0	1	0	1	0	0	0	0	0

2553 rows × 12 columns

# Calculate summary statistics for the dataset
summary_statistics = data.describe()

# Display the summary statistics
summary_statistics

	brand	id	satisfaction	trust	build	differs	easy	appealing	rewarding	popular	service	impact
count	2553.000000	2553.000000	2553.000000	2553.000000	2553.000000	2553.000000	2553.000000	2553.000000	2553.000000	2553.000000	2553.000000	2553.000000
mean	4.857423	8931.480611	3.386604	0.549550	0.461810	0.334508	0.536232	0.451234	0.451234	0.536232	0.467293	0.330983
std	2.830096	5114.287849	1.172006	0.497636	0.498637	0.471911	0.498783	0.497714	0.497714	0.498783	0.499027	0.470659
min	1.000000	88.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	3.000000	4310.000000	3.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	4.000000	8924.000000	4.000000	1.000000	0.000000	0.000000	1.000000	0.000000	0.000000	1.000000	0.000000	0.000000
75%	6.000000	13545.000000	4.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000
max	10.000000	18088.000000	5.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000

As we can see, there are 2,553 rows of data that helps further analysis to answer the questions:

Is offered by a brand I trust
Helps build credit quickly
Is different from other cards
Is easy to use
Has appealing benefits or rewards
Rewards me for responsible usage
Is used by a lot of people
Provides outstanding customer service
Makes a difference in my life

Summary of Metrics

To implement measures of variable importance, interpreted as a key drivers analysis, the process will include metrics:

Pearson Correlation: Measures the linear relationship between each perception and satisfaction.
Polychoric Correlation: Estimates the correlation between two theorized normally distributed continuous latent variables from observed ordinal variables.
Standardized Coefficient: The coefficients from a linear regression model standardized to have unit variance.
Shapley Values: Measure the contribution of each feature to the model’s prediction averaged over all possible combinations of features.
Johnson’s Epsilon: Reflects the relative importance of predictors adjusted for multicollinearity. = Mean Decrease in Gini Coefficient: Reflects the importance of each feature in reducing impurity in a Random Forest model.
XGBoost: Feature importance from the XGBoost model, based on how useful each feature is in reducing the objective function’s error.

First of all, let’s do some preparation for the model:

# Assign X as individual variables in regression
X_list = ['trust', 'build', 'differs', 'easy', 'appealing', 'rewarding', 'popular', 'service', 'impact']

X = data[X_list]

# Assign y as dependent variables in regression
y = data['satisfaction']

Pearson Correlations

The formula for calculating the Pearson correlation coefficient 𝑟 is:

\[ r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2}} \]

The Pearson correlation coefficient (r) measures the strength and direction of the linear relationship between two variables. It is widely used in statistics and data analysis to assess whether and how strongly pairs of variables are related. The value of the Pearson correlation coefficient ranges from -1 to 1:

1: A perfect positive linear relationship

-1: A perfect negative linear relationship

0: No linear relationship

# Create correlation matrix
correlation_matrix = X.corrwith(y)

# Normalize correlations
correlation_matrix /= correlation_matrix.sum()

# Convert to percentage format
pearson_correlations = (correlation_matrix * 100).round(1).astype(str) + '%'

pearson_correlations_df = pd.DataFrame({
    'Perception': pearson_correlations.index,
    'Pearson_Correlation': pearson_correlations.values
})

pearson_correlations_df

	Perception	Pearson_Correlation
0	trust	13.3%
1	build	10.0%
2	differs	9.6%
3	easy	11.1%
4	appealing	10.8%
5	rewarding	10.1%
6	popular	8.9%
7	service	13.0%
8	impact	13.2%

Polychoric Correlations

Polychoric correlation is used to estimate the correlation between two theorized normally distributed continuous latent variables from two observed ordinal variables. This type of correlation is particularly useful when dealing with ordinal data, where the variables are measured on an ordinal scale.

The polychoric correlation () can be estimated using: \[ \rho = \frac{\sum_{i=1}^{n} \sum_{j=1}^{m} (o_{ij} - e_{ij})^2 / e_{ij}}{\sqrt{\sum_{i=1}^{n} (o_{i+} - e_{i+})^2 / e_{i+} \sum_{j=1}^{m} (o_{+j} - e_{+j})^2 / e_{+j}}} \]

Where:

\({o_{ij}}\) is the observed frequency for the cell in row \({i}\) and column \({j}\).
\({e_{ij}}\) is the expected frequency under the assumption of independence.
\({o_{i+}}\) and \({o_{+j}}\) are the marginal totals for row \({i}\) and column \({j}\).
\({e_{i+}}\) and \({e_{+j}}\) are the expected marginal totals for row \({i}\) and column \({j}\).

To enhance the process, we use polycor library from R then transform them to python code:

# Activate the pandas2ri conversion
pandas2ri.activate()

# Initialize a list to store the results
correlation = []

# Define the R code for the polychoric_corr function
r_code = """
library(polycor)

polychoric_corr <- function(x, y) {
  result <- polychor(x, y)
  return(result)
}
"""

# Run the R code
robjects.r(r_code)

# Get the polychoric_corr function
polychoric_corr = robjects.globalenv['polychoric_corr']

for col in X_list:
    r_corr = polychoric_corr(y, data[col])
    correlation.append(r_corr[0])

# Normalize correlations
total = sum(correlation)
correlation = [value / total for value in correlation]

# Convert correlations to a pandas DataFrame
poly_corr_df = pd.DataFrame({
    'Perception': X_list,
    'Polychoric Correlation': correlation
})

# Reformat the column
poly_corr_df['Polychoric Correlation'] = (poly_corr_df['Polychoric Correlation']* 100).round(1).astype(str) + '%'


poly_corr_df

	Perception	Polychoric Correlation
0	trust	12.9%
1	build	9.9%
2	differs	10.0%
3	easy	10.9%
4	appealing	10.6%
5	rewarding	10.1%
6	popular	8.9%
7	service	13.0%
8	impact	13.8%

Standardized Regression Coefficients

The standardized regression coefficient for predictor \({X_j}\) in a multiple regression model is:

\[ \beta_j = b_j \cdot \frac{\sigma_{X_j}}{\sigma_Y} \]

where \({b_j}\) is the unstandardized regression coefficient, \({\sigma_{X_j}}\) is the standard deviation of \({X_j}\) , and \({\sigma_Y}\) is the standard deviation of the outcome variable \({Y}\).

These are the coefficients obtained from a regression model after standardizing the variables (i.e., converting them to a common scale). This allows for comparison of the relative importance of each predictor.

# Fitting the linear regression model
model = LinearRegression()
model.fit(X, y)

# Obtaining standardized coefficients
coefficients = model.coef_

# Standardized coefficients as a DataFrame for better manipulation
coefficients_df = pd.DataFrame({
    'Perception': X_list,
    'Standardized Coefficient': coefficients
})
#
# Normalize the coefficients
coefficients_df['Standardized Coefficient'] /= coefficients_df['Standardized Coefficient'].sum()

# Convert to percentage format
coefficients_df['Standardized Coefficient'] = (
    coefficients_df['Standardized Coefficient'] * 100).round(1).astype(str) + '%'

coefficients_df

	Perception	Standardized Coefficient
0	trust	24.8%
1	build	4.3%
2	differs	6.3%
3	easy	4.7%
4	appealing	7.3%
5	rewarding	1.1%
6	popular	3.6%
7	service	18.9%
8	impact	29.1%

Shapley values

Shapley values, originating from cooperative game theory, are a method to fairly distribute the “payout” among players based on their contribution to the total payout. In the context of machine learning and regression models, Shapley values are used to quantify the contribution of each feature to the prediction of a model.

For linear regression, Shapley values provide a way to understand the importance and contribution of each predictor variable to the prediction for each instance. This is done by considering all possible combinations of features and calculating the marginal contribution of each feature.

The Shapley value for a feature \({j}\) is given by: \[ \phi_j = \sum_{S \subseteq N \setminus \{j\}} \frac{|S|!(|N| - |S| - 1)!}{|N|!} \left[ v(S \cup \{j\}) - v(S) \right] \]

Where:

\({N}\) is the set of all features
\({S}\) is a subset of \({N}\) excluding \({j}\)
\({v(S)}\) is the value function (e.g., model performance) for subset \({S}\).

import shap

model = LinearRegression()
model.fit(X, y)


# Calculate Shapley values using the shap library
explainer = shap.LinearExplainer(model, X)
shap_values = explainer(X)

# Get the mean absolute Shapley values for each feature
shap_values_mean = pd.DataFrame(shap_values.values, columns=X.columns).abs().mean()

shap_values_mean /= shap_values_mean.sum()

shap_values_mean = (shap_values_mean * 100).round(1).astype(str) + '%'

shap_values_df = pd.DataFrame({
    'Perception': shap_values_mean.index,
    'Shapley Values': shap_values_mean.values
})


shap_values_df

	Perception	Shapley Values
0	trust	26.7%
1	build	4.5%
2	differs	5.6%
3	easy	5.1%
4	appealing	7.6%
5	rewarding	1.1%
6	popular	3.8%
7	service	19.9%
8	impact	25.5%

Johnson Relative

Decomposes the model’s \({R^2}\) to assign weights to each predictor, reflecting their relative contribution to the explained variance in the outcome.

The relative weight for a predictor \({X_j}\) is calculated as: \[ RW_j = \sum_{k=1}^{p} \left( \frac{\lambda_{jk}^2 \cdot \text{Var}(Z_k)}{\text{Var}(Y)} \right) \]

where:

\({\lambda_{jk}}\) is the loading of predictor \({j}\) on the \({k}\)-th principal component

\({\text{Var}(Z_k)}\) is the variance of the \({k}\)-th principal component

\({\text{Var}(Y)}\) is the variance of the outcome variable.

# Perform relative weights analysis
johnsons_eps = relativeImp(data, outcomeName= 'satisfaction', driverNames = X_list)

# Drop the 'rawRelaImpt' column
johnsons_eps = johnsons_eps.drop('rawRelaImpt', axis=1)

# Rename the columns
johnsons_eps = johnsons_eps.rename(columns={'driver': 'Perception', 'normRelaImpt': "Johnson's Epsilon"})

# Reformat the columns
johnsons_eps["Johnson's Epsilon"] = johnsons_eps["Johnson's Epsilon"].round(1).astype('str') + "%"


johnsons_eps

	Perception	Johnson's Epsilon
0	trust	19.8%
1	build	6.6%
2	differs	7.0%
3	easy	8.2%
4	appealing	8.3%
5	rewarding	6.0%
6	popular	5.4%
7	service	16.6%
8	impact	22.0%

Mean Decrease in Gini Coefficient

In Random Forests, the Gini importance (or Mean Decrease in Gini) is calculated based on the average decrease in impurity (Gini impurity) brought by each feature to the nodes in the trees.

The Mean Decrease in Gini for a feature \({j}\) is: \[ \text{MDG}_j = \frac{1}{T} \sum_{t=1}^{T} \left( \Delta Gini(t, j) \right) \]

Where:

\({T}\) is the total number of trees
\({\Delta Gini(t, j)}\) is the decrease in Gini impurity for tree \({t}\) due to feature \({j}\).

np.random.seed(42)
# Fit Random Forest model
rf_model = RandomForestClassifier(n_estimators=50,max_depth=8)
rf_model.fit(X, y)

# Get Mean Decrease in Gini Coefficient (feature importances)
rf_importances = pd.Series(rf_model.feature_importances_, index=X.columns)

# rf_importances /= rf_importances.sum()

rf_importances_df = pd.DataFrame({
    'Perception': rf_importances.index,
    'Mean Decrease in Gini Coefficient': rf_importances.values
})

rf_importances_df['Mean Decrease in Gini Coefficient'] = (
    rf_importances_df['Mean Decrease in Gini Coefficient']* 100).round(1).astype(str) + '%'

# Display the feature importances
rf_importances_df

	Perception	Mean Decrease in Gini Coefficient
0	trust	10.5%
1	build	11.4%
2	differs	10.8%
3	easy	10.6%
4	appealing	11.6%
5	rewarding	11.9%
6	popular	12.1%
7	service	10.9%
8	impact	10.2%

XGBoost Feature Importance

XGBoost provides three types of feature importance: weight (frequency of a feature in trees), cover (average coverage of the feature), and gain (average gain brought by a feature to the branches it is on).

Weight (frequency) \[ \text{Weight}_j = \sum_{t=1}^{T} \sum_{n \in t} I(n = j) \]

where \({T}\) is the total number of trees, \({n}\) are the nodes in tree \({t}\), and \({I}\) is an indicator function that is 1 if the node \({n}\) uses feature \({j}\), otherwise 0.

Cover (Average Cover)

\[ \text{Cover}_j = \sum_{t=1}^{T} \sum_{n \in t} \frac{I(n = j) \cdot \text{cover}(n)}{\sum_{n' \in t} \text{cover}(n')} \]

Gain (Average Gain)

\[ \text{Gain}_j = \sum_{t=1}^{T} \sum_{n \in t} \frac{I(n = j) \cdot \text{gain}(n)}{\sum_{n' \in t} \text{gain}(n')} \]

# Shift the classes in the target variable to start from 0
y_shifted = y - 1

# Train the XGBoost model with the shifted target variable
xgb_model = xgb.XGBClassifier()
xgb_model.fit(X, y_shifted)

# Get feature importances from the XGBoost model
xgb_importances = pd.Series(xgb_model.feature_importances_, index=X.columns)

xgb_importances_df = pd.DataFrame({
    'Perception': xgb_importances.index,
    'XGBoost': xgb_importances.values
})

xgb_importances_df['XGBoost'] = (
    xgb_importances_df['XGBoost']* 100).round(1).astype(str) + '%'

# Display the feature importances
xgb_importances_df

	Perception	XGBoost
0	trust	15.1%
1	build	9.1%
2	differs	11.0%
3	easy	9.5%
4	appealing	9.9%
5	rewarding	8.8%
6	popular	10.2%
7	service	10.7%
8	impact	15.8%

Results

Now combine all results together:

# Create dataframe list
dataframes = [
    pearson_correlations_df, 
    poly_corr_df, coefficients_df, 
    shap_values_df, 
    johnsons_eps, 
    rf_importances_df, 
    xgb_importances_df] 

# Merge all dataframe
table = reduce(lambda left,right: pd.merge(left,right,on='Perception'), dataframes)

table

	Perception	Pearson_Correlation	Polychoric Correlation	Standardized Coefficient	Shapley Values	Johnson's Epsilon	Mean Decrease in Gini Coefficient	XGBoost
0	trust	13.3%	12.9%	24.8%	26.7%	19.8%	10.5%	15.1%
1	build	10.0%	9.9%	4.3%	4.5%	6.6%	11.4%	9.1%
2	differs	9.6%	10.0%	6.3%	5.6%	7.0%	10.8%	11.0%
3	easy	11.1%	10.9%	4.7%	5.1%	8.2%	10.6%	9.5%
4	appealing	10.8%	10.6%	7.3%	7.6%	8.3%	11.6%	9.9%
5	rewarding	10.1%	10.1%	1.1%	1.1%	6.0%	11.9%	8.8%
6	popular	8.9%	8.9%	3.6%	3.8%	5.4%	12.1%	10.2%
7	service	13.0%	13.0%	18.9%	19.9%	16.6%	10.9%	10.7%
8	impact	13.2%	13.8%	29.1%	25.5%	22.0%	10.2%	15.8%

Interpretation by Perception

1. Is offered by a brand I trust:

Highest importance across multiple metrics: Pearson Correlation (13.3%), Standardized Coefficient (24.8%), Shapley Values (26.7%), Johnson’s Epsilon (19.8%), XGBoost (15.1%).

Conclusion: This is consistently identified as a key driver of satisfaction across different methods.

2. Helps build credit quickly:

Moderate importance: Pearson Correlation (10.0%), Standardized Coefficient (4.3%), Shapley Values (4.5%), Johnson’s Epsilon (6.6%), Mean Decrease in Gini (11.4%).

Conclusion: Important, but less so than “trust”.

3. Is different from other cards:

Moderate to lower importance: Pearson Correlation (9.6%), Standardized Coefficient (6.3%), Shapley Values (5.6%), Johnson’s Epsilon (7.0%).

Conclusion: Also important, but not a top driver.

4. Is easy to use:

Moderate importance: Pearson Correlation (11.1%), Standardized Coefficient (4.7%), Shapley Values (5.1%), Johnson’s Epsilon (8.2%).

Conclusion: Important, but consistently less so than “trust”.

5. Has appealing benefits or rewards:

Moderate importance: Pearson Correlation (10.8%), Standardized Coefficient (7.3%), Shapley Values (7.6%), Johnson’s Epsilon (8.3%).

Conclusion: Consistently important across all measures.

6. Rewards me for responsible usage:

Lower importance: Pearson Correlation (10.1%), Standardized Coefficient (1.1%), Shapley Values (1.1%), Johnson’s Epsilon (6.0%).

Conclusion: Generally less important compared to other perceptions.

7. Is used by a lot of people:

Lower to moderate importance: Pearson Correlation (8.9%), Standardized Coefficient (3.6%), Shapley Values (3.8%), Johnson’s Epsilon (5.4%), Mean Decrease in Gini (12.1%).

Conclusion: Less critical than other factors.

8. Provides outstanding customer service:

High importance: Pearson Correlation (13.0%), Standardized Coefficient (18.9%), Shapley Values (19.9%), Johnson’s Epsilon (16.6%).

Conclusion: Another key driver of satisfaction.

9. Makes a difference in my life:

Very high importance: Pearson Correlation (13.2%), Standardized Coefficient (29.1%), Shapley Values (25.5%), Johnson’s Epsilon (22.0%), XGBoost (15.8%).

Conclusion: Consistently one of the most important factors.

Summary

Top Drivers of Satisfaction

Trust in the brand is the most consistently high-rated factor across multiple metrics, indicating that customer trust in the brand has a profound and consistent impact on overall satisfaction. This suggests that fostering and maintaining trust is crucial for any brand aiming to improve customer satisfaction. Outstanding customer service also shows high importance across all metrics, underlining that excellent customer service is vital for keeping customers satisfied. Additionally, the perception that the product or service makes a difference in customers’ lives is consistently one of the top factors, showing very high importance across most metrics. This highlights that customers highly value how the product or service impacts their lives positively.

Moderate Drivers of Satisfaction

Appealing benefits or rewards hold moderately high importance across various metrics, indicating that attractive benefits and rewards significantly contribute to customer satisfaction. Ease of use is another important factor, particularly in non-linear models, suggesting that while ease of use is significant, it is not the top factor. The perception that the product helps build credit quickly shows moderate importance, especially in tree-based models like Random Forest, making it an important but secondary factor. Differentiation from other cards is moderately important across most metrics, indicating that being different from other cards is valued by customers but is not the most critical factor.

Lower Drivers of Satisfaction

Rewards for responsible usage generally show lower importance in linear models but have some significance in non-linear models. This indicates that while these rewards are valued, they are less critical compared to other factors. The perception that the product is used by a lot of people generally has lower importance but shows higher importance in non-linear models like Random Forest and XGBoost. This suggests that while widespread usage is a less critical factor overall, it still holds some significance in certain contexts.