Key Drivers Analysis

Author

Duyen Tran

Published

May 30, 2024

Introduction

Key Driver Analysis (KDA) is a statistical technique used to determine the factors (or “drivers”) that most significantly impact a particular outcome or dependent variable. It is commonly used in fields like marketing, customer satisfaction, product development, and human resources to understand what influences key outcomes such as customer satisfaction, employee engagement, or product success.

This post implements a few measure of variable importance, interpreted as a key drivers analysis, for certain aspects of a payment card on customer satisfaction with that payment card. This involves calculating pearson correlations, standardized regression coefficients, “usefulness”, Shapley values for a linear regression, Johnson’s relative weights, and the mean decrease in the gini coefficient from a random forest.

Data Overview

First of all, let’s have some data overview:

data = pd.read_csv('data_for_drivers_analysis.csv')
data
brand id satisfaction trust build differs easy appealing rewarding popular service impact
0 1 98 3 1 0 1 1 1 0 0 1 0
1 1 179 5 0 0 0 0 0 0 0 0 0
2 1 197 3 1 0 0 1 1 1 0 1 1
3 1 317 1 0 0 0 0 1 0 1 1 1
4 1 356 4 1 1 1 1 1 1 1 1 1
... ... ... ... ... ... ... ... ... ... ... ... ...
2548 10 17800 5 1 1 0 1 0 1 1 1 1
2549 10 17808 3 1 0 0 1 0 1 1 1 0
2550 10 17893 5 0 1 1 0 0 0 0 0 0
2551 10 17984 3 1 1 0 1 0 1 0 0 0
2552 10 18073 4 0 1 0 1 0 0 0 0 0

2553 rows × 12 columns

# Calculate summary statistics for the dataset
summary_statistics = data.describe()

# Display the summary statistics
summary_statistics
brand id satisfaction trust build differs easy appealing rewarding popular service impact
count 2553.000000 2553.000000 2553.000000 2553.000000 2553.000000 2553.000000 2553.000000 2553.000000 2553.000000 2553.000000 2553.000000 2553.000000
mean 4.857423 8931.480611 3.386604 0.549550 0.461810 0.334508 0.536232 0.451234 0.451234 0.536232 0.467293 0.330983
std 2.830096 5114.287849 1.172006 0.497636 0.498637 0.471911 0.498783 0.497714 0.497714 0.498783 0.499027 0.470659
min 1.000000 88.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 3.000000 4310.000000 3.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 4.000000 8924.000000 4.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000
75% 6.000000 13545.000000 4.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
max 10.000000 18088.000000 5.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

As we can see, there are 2,553 rows of data that helps further analysis to answer the questions:

  1. Is offered by a brand I trust
  2. Helps build credit quickly
  3. Is different from other cards
  4. Is easy to use
  5. Has appealing benefits or rewards
  6. Rewards me for responsible usage
  7. Is used by a lot of people
  8. Provides outstanding customer service
  9. Makes a difference in my life

Summary of Metrics

To implement measures of variable importance, interpreted as a key drivers analysis, the process will include metrics:

  • Pearson Correlation: Measures the linear relationship between each perception and satisfaction.
  • Polychoric Correlation: Estimates the correlation between two theorized normally distributed continuous latent variables from observed ordinal variables.
  • Standardized Coefficient: The coefficients from a linear regression model standardized to have unit variance.
  • Shapley Values: Measure the contribution of each feature to the model’s prediction averaged over all possible combinations of features.
  • Johnson’s Epsilon: Reflects the relative importance of predictors adjusted for multicollinearity. = Mean Decrease in Gini Coefficient: Reflects the importance of each feature in reducing impurity in a Random Forest model.
  • XGBoost: Feature importance from the XGBoost model, based on how useful each feature is in reducing the objective function’s error.

First of all, let’s do some preparation for the model:

# Assign X as individual variables in regression
X_list = ['trust', 'build', 'differs', 'easy', 'appealing', 'rewarding', 'popular', 'service', 'impact']

X = data[X_list]

# Assign y as dependent variables in regression
y = data['satisfaction']

Pearson Correlations

The formula for calculating the Pearson correlation coefficient 𝑟 is:

\[ r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2}} \]

The Pearson correlation coefficient (r) measures the strength and direction of the linear relationship between two variables. It is widely used in statistics and data analysis to assess whether and how strongly pairs of variables are related. The value of the Pearson correlation coefficient ranges from -1 to 1:

1: A perfect positive linear relationship

-1: A perfect negative linear relationship

0: No linear relationship

# Create correlation matrix
correlation_matrix = X.corrwith(y)

# Normalize correlations
correlation_matrix /= correlation_matrix.sum()

# Convert to percentage format
pearson_correlations = (correlation_matrix * 100).round(1).astype(str) + '%'

pearson_correlations_df = pd.DataFrame({
    'Perception': pearson_correlations.index,
    'Pearson_Correlation': pearson_correlations.values
})

pearson_correlations_df
Perception Pearson_Correlation
0 trust 13.3%
1 build 10.0%
2 differs 9.6%
3 easy 11.1%
4 appealing 10.8%
5 rewarding 10.1%
6 popular 8.9%
7 service 13.0%
8 impact 13.2%

Polychoric Correlations

Polychoric correlation is used to estimate the correlation between two theorized normally distributed continuous latent variables from two observed ordinal variables. This type of correlation is particularly useful when dealing with ordinal data, where the variables are measured on an ordinal scale.

The polychoric correlation () can be estimated using: \[ \rho = \frac{\sum_{i=1}^{n} \sum_{j=1}^{m} (o_{ij} - e_{ij})^2 / e_{ij}}{\sqrt{\sum_{i=1}^{n} (o_{i+} - e_{i+})^2 / e_{i+} \sum_{j=1}^{m} (o_{+j} - e_{+j})^2 / e_{+j}}} \]

Where:

  • \({o_{ij}}\) is the observed frequency for the cell in row \({i}\) and column \({j}\).

  • \({e_{ij}}\) is the expected frequency under the assumption of independence.

  • \({o_{i+}}\) and \({o_{+j}}\) are the marginal totals for row \({i}\) and column \({j}\).

  • \({e_{i+}}\) and \({e_{+j}}\) are the expected marginal totals for row \({i}\) and column \({j}\).

To enhance the process, we use polycor library from R then transform them to python code:

# Activate the pandas2ri conversion
pandas2ri.activate()

# Initialize a list to store the results
correlation = []

# Define the R code for the polychoric_corr function
r_code = """
library(polycor)

polychoric_corr <- function(x, y) {
  result <- polychor(x, y)
  return(result)
}
"""

# Run the R code
robjects.r(r_code)

# Get the polychoric_corr function
polychoric_corr = robjects.globalenv['polychoric_corr']

for col in X_list:
    r_corr = polychoric_corr(y, data[col])
    correlation.append(r_corr[0])

# Normalize correlations
total = sum(correlation)
correlation = [value / total for value in correlation]

# Convert correlations to a pandas DataFrame
poly_corr_df = pd.DataFrame({
    'Perception': X_list,
    'Polychoric Correlation': correlation
})

# Reformat the column
poly_corr_df['Polychoric Correlation'] = (poly_corr_df['Polychoric Correlation']* 100).round(1).astype(str) + '%'


poly_corr_df
Perception Polychoric Correlation
0 trust 12.9%
1 build 9.9%
2 differs 10.0%
3 easy 10.9%
4 appealing 10.6%
5 rewarding 10.1%
6 popular 8.9%
7 service 13.0%
8 impact 13.8%

Standardized Regression Coefficients

The standardized regression coefficient for predictor \({X_j}\) in a multiple regression model is:

\[ \beta_j = b_j \cdot \frac{\sigma_{X_j}}{\sigma_Y} \]

where \({b_j}\) is the unstandardized regression coefficient, \({\sigma_{X_j}}\) is the standard deviation of \({X_j}\) , and \({\sigma_Y}\) is the standard deviation of the outcome variable \({Y}\).

These are the coefficients obtained from a regression model after standardizing the variables (i.e., converting them to a common scale). This allows for comparison of the relative importance of each predictor.

# Fitting the linear regression model
model = LinearRegression()
model.fit(X, y)

# Obtaining standardized coefficients
coefficients = model.coef_

# Standardized coefficients as a DataFrame for better manipulation
coefficients_df = pd.DataFrame({
    'Perception': X_list,
    'Standardized Coefficient': coefficients
})
#
# Normalize the coefficients
coefficients_df['Standardized Coefficient'] /= coefficients_df['Standardized Coefficient'].sum()

# Convert to percentage format
coefficients_df['Standardized Coefficient'] = (
    coefficients_df['Standardized Coefficient'] * 100).round(1).astype(str) + '%'

coefficients_df
Perception Standardized Coefficient
0 trust 24.8%
1 build 4.3%
2 differs 6.3%
3 easy 4.7%
4 appealing 7.3%
5 rewarding 1.1%
6 popular 3.6%
7 service 18.9%
8 impact 29.1%

Shapley values

Shapley values, originating from cooperative game theory, are a method to fairly distribute the “payout” among players based on their contribution to the total payout. In the context of machine learning and regression models, Shapley values are used to quantify the contribution of each feature to the prediction of a model.

For linear regression, Shapley values provide a way to understand the importance and contribution of each predictor variable to the prediction for each instance. This is done by considering all possible combinations of features and calculating the marginal contribution of each feature.

The Shapley value for a feature \({j}\) is given by: \[ \phi_j = \sum_{S \subseteq N \setminus \{j\}} \frac{|S|!(|N| - |S| - 1)!}{|N|!} \left[ v(S \cup \{j\}) - v(S) \right] \]

Where:

  • \({N}\) is the set of all features

  • \({S}\) is a subset of \({N}\) excluding \({j}\)

  • \({v(S)}\) is the value function (e.g., model performance) for subset \({S}\).

import shap

model = LinearRegression()
model.fit(X, y)


# Calculate Shapley values using the shap library
explainer = shap.LinearExplainer(model, X)
shap_values = explainer(X)

# Get the mean absolute Shapley values for each feature
shap_values_mean = pd.DataFrame(shap_values.values, columns=X.columns).abs().mean()

shap_values_mean /= shap_values_mean.sum()

shap_values_mean = (shap_values_mean * 100).round(1).astype(str) + '%'

shap_values_df = pd.DataFrame({
    'Perception': shap_values_mean.index,
    'Shapley Values': shap_values_mean.values
})


shap_values_df
Perception Shapley Values
0 trust 26.7%
1 build 4.5%
2 differs 5.6%
3 easy 5.1%
4 appealing 7.6%
5 rewarding 1.1%
6 popular 3.8%
7 service 19.9%
8 impact 25.5%

Johnson Relative

Decomposes the model’s \({R^2}\) to assign weights to each predictor, reflecting their relative contribution to the explained variance in the outcome.

The relative weight for a predictor \({X_j}\) is calculated as: \[ RW_j = \sum_{k=1}^{p} \left( \frac{\lambda_{jk}^2 \cdot \text{Var}(Z_k)}{\text{Var}(Y)} \right) \]

where:

\({\lambda_{jk}}\) is the loading of predictor \({j}\) on the \({k}\)-th principal component

\({\text{Var}(Z_k)}\) is the variance of the \({k}\)-th principal component

\({\text{Var}(Y)}\) is the variance of the outcome variable.

# Perform relative weights analysis
johnsons_eps = relativeImp(data, outcomeName= 'satisfaction', driverNames = X_list)

# Drop the 'rawRelaImpt' column
johnsons_eps = johnsons_eps.drop('rawRelaImpt', axis=1)

# Rename the columns
johnsons_eps = johnsons_eps.rename(columns={'driver': 'Perception', 'normRelaImpt': "Johnson's Epsilon"})

# Reformat the columns
johnsons_eps["Johnson's Epsilon"] = johnsons_eps["Johnson's Epsilon"].round(1).astype('str') + "%"


johnsons_eps
Perception Johnson's Epsilon
0 trust 19.8%
1 build 6.6%
2 differs 7.0%
3 easy 8.2%
4 appealing 8.3%
5 rewarding 6.0%
6 popular 5.4%
7 service 16.6%
8 impact 22.0%

Mean Decrease in Gini Coefficient

In Random Forests, the Gini importance (or Mean Decrease in Gini) is calculated based on the average decrease in impurity (Gini impurity) brought by each feature to the nodes in the trees.

The Mean Decrease in Gini for a feature \({j}\) is: \[ \text{MDG}_j = \frac{1}{T} \sum_{t=1}^{T} \left( \Delta Gini(t, j) \right) \]

Where:

  • \({T}\) is the total number of trees

  • \({\Delta Gini(t, j)}\) is the decrease in Gini impurity for tree \({t}\) due to feature \({j}\).

np.random.seed(42)
# Fit Random Forest model
rf_model = RandomForestClassifier(n_estimators=50,max_depth=8)
rf_model.fit(X, y)

# Get Mean Decrease in Gini Coefficient (feature importances)
rf_importances = pd.Series(rf_model.feature_importances_, index=X.columns)

# rf_importances /= rf_importances.sum()

rf_importances_df = pd.DataFrame({
    'Perception': rf_importances.index,
    'Mean Decrease in Gini Coefficient': rf_importances.values
})

rf_importances_df['Mean Decrease in Gini Coefficient'] = (
    rf_importances_df['Mean Decrease in Gini Coefficient']* 100).round(1).astype(str) + '%'

# Display the feature importances
rf_importances_df
Perception Mean Decrease in Gini Coefficient
0 trust 10.5%
1 build 11.4%
2 differs 10.8%
3 easy 10.6%
4 appealing 11.6%
5 rewarding 11.9%
6 popular 12.1%
7 service 10.9%
8 impact 10.2%

XGBoost Feature Importance

XGBoost provides three types of feature importance: weight (frequency of a feature in trees), cover (average coverage of the feature), and gain (average gain brought by a feature to the branches it is on).

  1. Weight (frequency) \[ \text{Weight}_j = \sum_{t=1}^{T} \sum_{n \in t} I(n = j) \]

where \({T}\) is the total number of trees, \({n}\) are the nodes in tree \({t}\), and \({I}\) is an indicator function that is 1 if the node \({n}\) uses feature \({j}\), otherwise 0.

  1. Cover (Average Cover)

\[ \text{Cover}_j = \sum_{t=1}^{T} \sum_{n \in t} \frac{I(n = j) \cdot \text{cover}(n)}{\sum_{n' \in t} \text{cover}(n')} \]

  1. Gain (Average Gain)

\[ \text{Gain}_j = \sum_{t=1}^{T} \sum_{n \in t} \frac{I(n = j) \cdot \text{gain}(n)}{\sum_{n' \in t} \text{gain}(n')} \]

# Shift the classes in the target variable to start from 0
y_shifted = y - 1

# Train the XGBoost model with the shifted target variable
xgb_model = xgb.XGBClassifier()
xgb_model.fit(X, y_shifted)

# Get feature importances from the XGBoost model
xgb_importances = pd.Series(xgb_model.feature_importances_, index=X.columns)

xgb_importances_df = pd.DataFrame({
    'Perception': xgb_importances.index,
    'XGBoost': xgb_importances.values
})

xgb_importances_df['XGBoost'] = (
    xgb_importances_df['XGBoost']* 100).round(1).astype(str) + '%'

# Display the feature importances
xgb_importances_df
Perception XGBoost
0 trust 15.1%
1 build 9.1%
2 differs 11.0%
3 easy 9.5%
4 appealing 9.9%
5 rewarding 8.8%
6 popular 10.2%
7 service 10.7%
8 impact 15.8%

Results

Now combine all results together:

# Create dataframe list
dataframes = [
    pearson_correlations_df, 
    poly_corr_df, coefficients_df, 
    shap_values_df, 
    johnsons_eps, 
    rf_importances_df, 
    xgb_importances_df] 

# Merge all dataframe
table = reduce(lambda left,right: pd.merge(left,right,on='Perception'), dataframes)

table
Perception Pearson_Correlation Polychoric Correlation Standardized Coefficient Shapley Values Johnson's Epsilon Mean Decrease in Gini Coefficient XGBoost
0 trust 13.3% 12.9% 24.8% 26.7% 19.8% 10.5% 15.1%
1 build 10.0% 9.9% 4.3% 4.5% 6.6% 11.4% 9.1%
2 differs 9.6% 10.0% 6.3% 5.6% 7.0% 10.8% 11.0%
3 easy 11.1% 10.9% 4.7% 5.1% 8.2% 10.6% 9.5%
4 appealing 10.8% 10.6% 7.3% 7.6% 8.3% 11.6% 9.9%
5 rewarding 10.1% 10.1% 1.1% 1.1% 6.0% 11.9% 8.8%
6 popular 8.9% 8.9% 3.6% 3.8% 5.4% 12.1% 10.2%
7 service 13.0% 13.0% 18.9% 19.9% 16.6% 10.9% 10.7%
8 impact 13.2% 13.8% 29.1% 25.5% 22.0% 10.2% 15.8%

Interpretation by Perception

1. Is offered by a brand I trust:

Highest importance across multiple metrics: Pearson Correlation (13.3%), Standardized Coefficient (24.8%), Shapley Values (26.7%), Johnson’s Epsilon (19.8%), XGBoost (15.1%).

Conclusion: This is consistently identified as a key driver of satisfaction across different methods.

2. Helps build credit quickly:

Moderate importance: Pearson Correlation (10.0%), Standardized Coefficient (4.3%), Shapley Values (4.5%), Johnson’s Epsilon (6.6%), Mean Decrease in Gini (11.4%).

Conclusion: Important, but less so than “trust”.

3. Is different from other cards:

Moderate to lower importance: Pearson Correlation (9.6%), Standardized Coefficient (6.3%), Shapley Values (5.6%), Johnson’s Epsilon (7.0%).

Conclusion: Also important, but not a top driver.

4. Is easy to use:

Moderate importance: Pearson Correlation (11.1%), Standardized Coefficient (4.7%), Shapley Values (5.1%), Johnson’s Epsilon (8.2%).

Conclusion: Important, but consistently less so than “trust”.

5. Has appealing benefits or rewards:

Moderate importance: Pearson Correlation (10.8%), Standardized Coefficient (7.3%), Shapley Values (7.6%), Johnson’s Epsilon (8.3%).

Conclusion: Consistently important across all measures.

6. Rewards me for responsible usage:

Lower importance: Pearson Correlation (10.1%), Standardized Coefficient (1.1%), Shapley Values (1.1%), Johnson’s Epsilon (6.0%).

Conclusion: Generally less important compared to other perceptions.

7. Is used by a lot of people:

Lower to moderate importance: Pearson Correlation (8.9%), Standardized Coefficient (3.6%), Shapley Values (3.8%), Johnson’s Epsilon (5.4%), Mean Decrease in Gini (12.1%).

Conclusion: Less critical than other factors.

8. Provides outstanding customer service:

High importance: Pearson Correlation (13.0%), Standardized Coefficient (18.9%), Shapley Values (19.9%), Johnson’s Epsilon (16.6%).

Conclusion: Another key driver of satisfaction.

9. Makes a difference in my life:

Very high importance: Pearson Correlation (13.2%), Standardized Coefficient (29.1%), Shapley Values (25.5%), Johnson’s Epsilon (22.0%), XGBoost (15.8%).

Conclusion: Consistently one of the most important factors.

Summary

Top Drivers of Satisfaction

Trust in the brand is the most consistently high-rated factor across multiple metrics, indicating that customer trust in the brand has a profound and consistent impact on overall satisfaction. This suggests that fostering and maintaining trust is crucial for any brand aiming to improve customer satisfaction. Outstanding customer service also shows high importance across all metrics, underlining that excellent customer service is vital for keeping customers satisfied. Additionally, the perception that the product or service makes a difference in customers’ lives is consistently one of the top factors, showing very high importance across most metrics. This highlights that customers highly value how the product or service impacts their lives positively.

Moderate Drivers of Satisfaction

Appealing benefits or rewards hold moderately high importance across various metrics, indicating that attractive benefits and rewards significantly contribute to customer satisfaction. Ease of use is another important factor, particularly in non-linear models, suggesting that while ease of use is significant, it is not the top factor. The perception that the product helps build credit quickly shows moderate importance, especially in tree-based models like Random Forest, making it an important but secondary factor. Differentiation from other cards is moderately important across most metrics, indicating that being different from other cards is valued by customers but is not the most critical factor.

Lower Drivers of Satisfaction

Rewards for responsible usage generally show lower importance in linear models but have some significance in non-linear models. This indicates that while these rewards are valued, they are less critical compared to other factors. The perception that the product is used by a lot of people generally has lower importance but shows higher importance in non-linear models like Random Forest and XGBoost. This suggests that while widespread usage is a less critical factor overall, it still holds some significance in certain contexts.