Pentathlon: Next Product to Buy Models

Author

Duyen Tran

Published

May 30, 2024

PROJECT SUMMARY

• Project Goal: Optimize profit by tailoring email campaigns for 7 departments to individual customers based on predictive modeling.

Modeling Approach:

• Logistic Regression: Utilized for its ability to predict the probability of customer engagement based on email content. • Linear Regression: Employed to predict average order size from each customer as a continuous outcome. • Random Forest: Chosen for its robustness and ability to capture non-linear patterns and feature interactions. • Neural Network: Applied to leverage its deep learning capabilities in identifying complex patterns within customer data. • XGBoost:Selectedforitse iciency,modelperformance,andabilitytohandleavarietyofdata types and distributions.

Model Training and Tuning:

• Conducted comprehensive training and hyperparameter tuning for each model to ensure op- timal performance. • Employed methods like cross-validation and grid search, particularly for computationally intensive models like Random Forest.

Profit Analysis: • Evaluated the predicted profit increase from each model to identify the most effective ap- proach. • Analyzed model-specific trade-offs, such as training time (e.g., Random Forest’s extensive training period).

Email Policy Evaluation: • Assessed the impact of each model on the email policy proposal to ensure alignment with profit optimization goals.

Data Preparation

import pandas as pd
import pyrsm as rsm
## loading the data - this dataset must NOT be changed
pentathlon_nptb = pd.read_parquet("pentathlon_nptb.parquet")

Pentathon: Next Product To Buy

The available data is based on the last e-mail sent to each Pentathlon customer. Hence, an observation or row in the data is a “customer-promotional e-mail” pair. The data contains the following basic demographic information available to Pentathlon:

  • “age”: Customer age(coded in 4 buckets:“<30”, “30 to 44”, “45 to 59”, and “>=60”)
  • “female”: Gender identity coded as Female “yes” or “no”
  • “income”: Income in Euros, rounded to the nearest EUR5,000
  • “education”: Percentage of college graduates in the customer’s neighborhood, coded from 0-100
  • “children”: Average number of children in the customer’s neighborhood

The data also contains basic historical information about customer purchases, specifically, a department-specific frequency measure.

  • “freq_endurance-freq_racquet”: Number of purchases in each department in the last year, excluding any purchase in response to the last email.

The key outcome variables are:

  • “buyer”: Did the customer click on the e-mail and complete a purchase within two days of receiving the e-mail (“yes” or “no”)?
  • “total_os”: Total order size (in Euros) conditional on the customer having purchased (buyer == “yes”). This measures spending across all departments, not just the department that sent the message

Note: In addition to the six message groups, a seventh group of customers received no promotional e-mails for the duration of the test (“control”).

pentathlon_nptb.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600000 entries, 0 to 599999
Data columns (total 16 columns):
 #   Column            Non-Null Count   Dtype   
---  ------            --------------   -----   
 0   custid            600000 non-null  object  
 1   buyer             600000 non-null  category
 2   total_os          600000 non-null  int32   
 3   message           600000 non-null  category
 4   age               600000 non-null  category
 5   female            600000 non-null  category
 6   income            600000 non-null  int32   
 7   education         600000 non-null  int32   
 8   children          600000 non-null  float64 
 9   freq_endurance    600000 non-null  int32   
 10  freq_strength     600000 non-null  int32   
 11  freq_water        600000 non-null  int32   
 12  freq_team         600000 non-null  int32   
 13  freq_backcountry  600000 non-null  int32   
 14  freq_racquet      600000 non-null  int32   
 15  training          600000 non-null  float64 
dtypes: category(4), float64(2), int32(9), object(1)
memory usage: 36.6+ MB
pentathlon_nptb['buyer_yes'] = rsm.ifelse(pentathlon_nptb['buyer'] == 'yes', 1, 0)
pentathlon_nptb['buyer_yes'].value_counts()
buyer_yes
0    585600
1     14400
Name: count, dtype: int64

Check the training data and testing data, make sure they are splitted a 70-30 ratio

pd.crosstab(pentathlon_nptb["message"], pentathlon_nptb["training"])
training 0.0 1.0
message
backcountry 26179 60425
control 26043 61217
endurance 24773 58083
racquet 26316 60772
strength 25251 59029
team 25942 60850
water 25496 59624
pentathlon_nptb.training.value_counts(normalize=True)
training
1.0    0.7
0.0    0.3
Name: proportion, dtype: float64
pentathlon_nptb.hist(figsize=(10, 10), bins=30)
array([[<Axes: title={'center': 'total_os'}>,
        <Axes: title={'center': 'income'}>,
        <Axes: title={'center': 'education'}>],
       [<Axes: title={'center': 'children'}>,
        <Axes: title={'center': 'freq_endurance'}>,
        <Axes: title={'center': 'freq_strength'}>],
       [<Axes: title={'center': 'freq_water'}>,
        <Axes: title={'center': 'freq_team'}>,
        <Axes: title={'center': 'freq_backcountry'}>],
       [<Axes: title={'center': 'freq_racquet'}>,
        <Axes: title={'center': 'training'}>,
        <Axes: title={'center': 'buyer_yes'}>]], dtype=object)

Logistic Regression Model

Check cross tab between the outcome variable and the key predictor variable

ct = rsm.basics.cross_tabs(pentathlon_nptb.query("training == 1"), "buyer", "age")
ct.summary(output = "perc_row")
ct.plot(plots = "perc_row")

Cross-tabs
Data     : Not provided
Variables: buyer, age
Null hyp : There is no association between buyer and age
Alt. hyp : There is an association between buyer and age

Row percentages:

age     < 30 30 to 44 45 to 59   >= 60   Total
buyer                                         
yes    9.42%   40.74%   38.18%  11.65%  100.0%
no     17.7%   31.16%   32.42%  18.72%  100.0%
Total  17.5%   31.39%   32.56%  18.55%  100.0%

Chi-squared: 1038.94 df(3), p.value < .001
0.0% of cells have expected values below 5

ct = rsm.basics.cross_tabs(pentathlon_nptb.query("training == 1"), "buyer", "children")
ct.summary(output = "perc_row")

Cross-tabs
Data     : Not provided
Variables: buyer, children
Null hyp : There is no association between buyer and children
Alt. hyp : There is an association between buyer and children

Row percentages:

children    0.2    0.3    0.4    0.5    0.6     0.7    0.8     0.9     1.0  \
buyer                                                                        
yes       1.58%  2.16%  3.52%  4.06%  5.13%   7.48%  7.55%  10.71%  10.63%   
no        2.98%   4.2%  6.18%  7.23%   7.9%  10.42%  9.56%  10.81%   9.11%   
Total     2.94%  4.15%  6.12%  7.15%  7.84%  10.35%  9.51%   10.8%   9.15%   

children    1.1  ...   4.5   4.6   4.7    4.8   4.9   5.0   5.1   5.2   5.8  \
buyer            ...                                                          
yes       9.74%  ...  0.0%  0.0%  0.0%  0.01%  0.0%  0.0%  0.0%  0.0%  0.0%   
no        7.18%  ...  0.0%  0.0%  0.0%   0.0%  0.0%  0.0%  0.0%  0.0%  0.0%   
Total     7.24%  ...  0.0%  0.0%  0.0%   0.0%  0.0%  0.0%  0.0%  0.0%  0.0%   

children   Total  
buyer             
yes       100.0%  
no        100.0%  
Total     100.0%  

[3 rows x 53 columns]

Chi-squared: 2028.29 df(51), p.value < .001
700.0% of cells have expected values below 5
ct = rsm.basics.cross_tabs(pentathlon_nptb.query("training == 1"), "buyer", "message")
ct.summary(output = "perc_row")
ct.plot(plots = "perc_row")

Cross-tabs
Data     : Not provided
Variables: buyer, message
Null hyp : There is no association between buyer and message
Alt. hyp : There is an association between buyer and message

Row percentages:

message backcountry control endurance racquet strength    team   water   Total
buyer                                                                         
yes          13.97%  13.27%    15.37%  13.77%   14.85%  14.58%  14.19%  100.0%
no            14.4%  14.61%    13.79%  14.49%   14.03%  14.49%   14.2%  100.0%
Total        14.39%  14.58%    13.83%  14.47%   14.05%  14.49%   14.2%  100.0%

Chi-squared: 39.15 df(6), p.value < .001
0.0% of cells have expected values below 5

ct = rsm.basics.cross_tabs(pentathlon_nptb.query("training == 1"), "buyer", "freq_endurance")
ct.summary(output = "perc_row")

Cross-tabs
Data     : Not provided
Variables: buyer, freq_endurance
Null hyp : There is no association between buyer and freq_endurance
Alt. hyp : There is an association between buyer and freq_endurance

Row percentages:

freq_endurance       0       1       2       3      4      5      6      7  \
buyer                                                                        
yes              24.2%  16.15%  14.66%  12.23%  9.58%  7.76%  5.24%  3.76%   
no              56.16%   22.7%  10.21%   5.33%  2.82%  1.42%  0.72%  0.34%   
Total            55.4%  22.54%  10.32%    5.5%  2.98%  1.58%  0.83%  0.42%   

freq_endurance      8      9     10     11     12     13     14     15   Total  
buyer                                                                           
yes             2.56%  1.72%  0.97%  0.57%  0.34%  0.14%  0.11%  0.02%  100.0%  
no              0.16%  0.08%  0.03%  0.01%  0.01%   0.0%   0.0%   0.0%  100.0%  
Total           0.22%  0.12%  0.05%  0.03%  0.01%  0.01%   0.0%   0.0%  100.0%  

Chi-squared: 21259.53 df(15), p.value < .001
150.0% of cells have expected values below 5
ct = rsm.basics.cross_tabs(pentathlon_nptb.query("training == 1"), "buyer", "freq_strength")
ct.summary(output = "perc_row")

Cross-tabs
Data     : Not provided
Variables: buyer, freq_strength
Null hyp : There is no association between buyer and freq_strength
Alt. hyp : There is an association between buyer and freq_strength

Row percentages:

freq_strength       0       1       2      3      4      5      6      7  \
buyer                                                                      
yes             21.4%   9.33%   9.92%  8.99%  8.33%  8.32%  6.73%  6.43%   
no             46.85%   18.9%  11.59%  7.77%  5.32%  3.54%  2.26%  1.51%   
Total          46.24%  18.67%  11.55%   7.8%   5.4%  3.65%  2.37%  1.63%   

freq_strength      8      9  ...     15     16     17     18     19    20  \
buyer                        ...                                            
yes             5.1%  4.09%  ...  0.65%  0.51%  0.23%  0.25%  0.11%  0.1%   
no             0.94%  0.55%  ...  0.02%  0.01%  0.01%   0.0%   0.0%  0.0%   
Total          1.04%  0.64%  ...  0.03%  0.02%  0.01%  0.01%   0.0%  0.0%   

freq_strength     21     22     23   Total  
buyer                                       
yes            0.04%  0.05%  0.01%  100.0%  
no              0.0%   0.0%   0.0%  100.0%  
Total           0.0%   0.0%   0.0%  100.0%  

[3 rows x 25 columns]

Chi-squared: 22032.18 df(23), p.value < .001
300.0% of cells have expected values below 5
ct = rsm.basics.cross_tabs(pentathlon_nptb.query("training == 1"), "buyer", "freq_water")
ct.summary(output = "perc_row")

Cross-tabs
Data     : Not provided
Variables: buyer, freq_water
Null hyp : There is no association between buyer and freq_water
Alt. hyp : There is an association between buyer and freq_water

Row percentages:

freq_water       0       1      2      3      4      5      6      7      8  \
buyer                                                                         
yes         64.79%  18.38%   9.3%   4.7%  1.88%  0.66%  0.21%  0.06%  0.01%   
no          92.53%   5.72%  1.33%  0.32%  0.08%  0.01%   0.0%   0.0%   0.0%   
Total       91.87%   6.03%  1.52%  0.42%  0.12%  0.03%  0.01%   0.0%   0.0%   

freq_water   Total  
buyer               
yes         100.0%  
no          100.0%  
Total       100.0%  

Chi-squared: 16794.94 df(8), p.value < .001
125.0% of cells have expected values below 5
ct = rsm.basics.cross_tabs(pentathlon_nptb.query("training == 1"), "buyer", "freq_team")
ct.summary(output = "perc_row")

Cross-tabs
Data     : Not provided
Variables: buyer, freq_team
Null hyp : There is no association between buyer and freq_team
Alt. hyp : There is an association between buyer and freq_team

Row percentages:

freq_team       0       1       2      3      4      5      6      7      8  \
buyer                                                                         
yes         39.2%  10.78%   9.88%  9.12%  8.45%   6.8%  4.74%  4.07%  2.71%   
no         59.12%  16.61%  10.14%  6.28%  3.73%  2.08%  1.08%  0.53%  0.24%   
Total      58.64%  16.47%  10.14%  6.35%  3.84%  2.19%  1.17%  0.61%   0.3%   

freq_team      9     10     11     12     13     14     15     16   Total  
buyer                                                                      
yes        1.87%  1.07%  0.62%  0.32%  0.12%   0.2%  0.05%  0.02%  100.0%  
no         0.11%  0.05%  0.02%  0.01%   0.0%   0.0%   0.0%   0.0%  100.0%  
Total      0.16%  0.07%  0.04%  0.02%  0.01%  0.01%   0.0%   0.0%  100.0%  

Chi-squared: 13743.4 df(16), p.value < .001
175.0% of cells have expected values below 5
ct = rsm.basics.cross_tabs(pentathlon_nptb.query("training == 1"), "buyer", "freq_backcountry")
ct.summary(output = "perc_row")

Cross-tabs
Data     : Not provided
Variables: buyer, freq_backcountry
Null hyp : There is no association between buyer and freq_backcountry
Alt. hyp : There is an association between buyer and freq_backcountry

Row percentages:

freq_backcountry       0       1       2       3      4      5      6   Total
buyer                                                                        
yes               45.69%  20.77%  18.53%  10.39%  3.85%  0.73%  0.03%  100.0%
no                61.34%  28.61%   8.24%   1.57%  0.22%  0.02%   0.0%  100.0%
Total             60.97%  28.42%   8.49%   1.78%   0.3%  0.03%   0.0%  100.0%

Chi-squared: 11914.73 df(6), p.value < .001
50.0% of cells have expected values below 5
ct = rsm.basics.cross_tabs(pentathlon_nptb.query("training == 1"), "buyer", "freq_racquet")
ct.summary(output = "perc_row")

Cross-tabs
Data     : Not provided
Variables: buyer, freq_racquet
Null hyp : There is no association between buyer and freq_racquet
Alt. hyp : There is an association between buyer and freq_racquet

Row percentages:

freq_racquet       0       1       2       3       4      5      6      7  \
buyer                                                                       
yes           23.83%  18.48%   17.7%  13.82%  10.43%  7.43%  4.32%  2.42%   
no            48.75%  28.86%  12.68%   5.76%   2.48%  0.97%  0.34%  0.12%   
Total         48.15%  28.61%   12.8%   5.95%   2.68%  1.13%  0.44%  0.17%   

freq_racquet      8      9     10     11   Total  
buyer                                             
yes           0.97%  0.46%  0.13%  0.02%  100.0%  
no            0.03%  0.01%   0.0%   0.0%  100.0%  
Total         0.05%  0.02%   0.0%   0.0%  100.0%  

Chi-squared: 18774.72 df(11), p.value < .001
100.0% of cells have expected values below 5
ct = rsm.basics.cross_tabs(pentathlon_nptb.query("training == 1"), "buyer", "income")
ct.summary(output = "perc_row")

Cross-tabs
Data     : Not provided
Variables: buyer, income
Null hyp : There is no association between buyer and income
Alt. hyp : There is an association between buyer and income

Row percentages:

income     0   5000  10000  15000  20000  25000  30000   35000   40000  \
buyer                                                                    
yes     0.0%  0.01%   0.0%  0.16%  0.26%  0.52%  0.84%    1.3%   2.22%   
no      0.0%  0.03%  0.19%  0.73%  2.13%  4.78%  8.77%  11.78%   14.0%   
Total   0.0%  0.03%  0.18%  0.71%  2.09%  4.68%  8.58%  11.53%  13.72%   

income   45000  ... 160000 165000 170000 175000 180000 185000 190000 195000  \
buyer           ...                                                           
yes      3.83%  ...  0.26%  0.37%  0.19%  0.14%  0.19%  0.14%   0.1%  0.11%   
no      12.71%  ...   0.0%   0.0%   0.0%   0.0%   0.0%   0.0%   0.0%   0.0%   
Total    12.5%  ...  0.01%  0.01%  0.01%  0.01%  0.01%   0.0%   0.0%   0.0%   

income 200000   Total  
buyer                  
yes     0.54%  100.0%  
no      0.01%  100.0%  
Total   0.02%  100.0%  

[3 rows x 42 columns]

Chi-squared: 40347.2 df(40), p.value < .001
400.0% of cells have expected values below 5
pentathlon_nptb[pentathlon_nptb["training"] == 1]
custid buyer total_os message age female income education children freq_endurance freq_strength freq_water freq_team freq_backcountry freq_racquet training buyer_yes
0 U1 no 0 team 30 to 44 no 55000 19 0.8 0 4 0 4 0 1 1.0 0
2 U13 no 0 endurance 45 to 59 yes 45000 33 0.7 0 0 0 0 2 2 1.0 0
3 U20 no 0 water 45 to 59 yes 25000 24 0.2 0 0 0 0 0 0 1.0 0
5 U28 no 0 strength < 30 yes 25000 18 0.3 0 0 0 0 0 0 1.0 0
8 U59 no 0 strength >= 60 yes 65000 36 1.2 1 1 0 2 0 3 1.0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
599989 U3462835 no 0 control 30 to 44 yes 45000 20 0.8 1 0 0 0 0 1 1.0 0
599992 U3462858 no 0 control 45 to 59 yes 40000 16 1.1 0 0 0 0 0 0 1.0 0
599993 U3462877 no 0 backcountry < 30 no 65000 30 1.0 2 3 1 0 1 1 1.0 0
599995 U3462888 no 0 water >= 60 yes 40000 26 0.6 0 0 0 0 0 0 1.0 0
599997 U3462902 no 0 team < 30 yes 55000 32 0.9 0 5 0 2 1 2 1.0 0

420000 rows × 17 columns

Build a logistic regression model

evar = pentathlon_nptb.columns.to_list()
evar = evar[evar.index("message"):]
evar = evar[:evar.index("freq_racquet")+1]
evar
['message',
 'age',
 'female',
 'income',
 'education',
 'children',
 'freq_endurance',
 'freq_strength',
 'freq_water',
 'freq_team',
 'freq_backcountry',
 'freq_racquet']
lr = rsm.model.logistic(
    data = {"pentathlon_nptb": pentathlon_nptb[pentathlon_nptb["training"] == 1]},
    rvar = "buyer_yes",
    evar = evar,
)
lr.summary()
Logistic regression (GLM)
Data                 : pentathlon_nptb
Response variable    : buyer_yes
Level                : None
Explanatory variables: message, age, female, income, education, children, freq_endurance, freq_strength, freq_water, freq_team, freq_backcountry, freq_racquet
Null hyp.: There is no effect of x on buyer_yes
Alt. hyp.: There is an effect of x on buyer_yes

                       OR      OR%  coefficient  std.error  z.value p.value     
Intercept           0.000  -100.0%        -8.28      0.064 -128.837  < .001  ***
message[control]    0.906    -9.4%        -0.10      0.042   -2.335    0.02    *
message[endurance]  1.250    25.0%         0.22      0.041    5.462  < .001  ***
message[racquet]    0.993    -0.7%        -0.01      0.042   -0.171   0.864     
message[strength]   1.162    16.2%         0.15      0.041    3.660  < .001  ***
message[team]       1.035     3.5%         0.03      0.041    0.823    0.41     
message[water]      1.056     5.6%         0.05      0.042    1.311    0.19     
age[30 to 44]       2.386   138.6%         0.87      0.039   22.353  < .001  ***
age[45 to 59]       2.186   118.6%         0.78      0.040   19.555  < .001  ***
age[>= 60]          1.201    20.1%         0.18      0.048    3.805  < .001  ***
female[no]          1.353    35.3%         0.30      0.024   12.574  < .001  ***
income              1.000     0.0%         0.00      0.000   22.105  < .001  ***
education           1.037     3.7%         0.04      0.001   32.809  < .001  ***
children            1.618    61.8%         0.48      0.030   16.210  < .001  ***
freq_endurance      1.101    10.1%         0.10      0.005   18.300  < .001  ***
freq_strength       1.115    11.5%         0.11      0.003   33.044  < .001  ***
freq_water          1.178    17.8%         0.16      0.014   12.026  < .001  ***
freq_team           1.050     5.0%         0.05      0.004   10.797  < .001  ***
freq_backcountry    1.188    18.8%         0.17      0.010   17.313  < .001  ***
freq_racquet        1.110    11.0%         0.10      0.007   15.866  < .001  ***

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pseudo R-squared (McFadden): 0.261
Pseudo R-squared (McFadden adjusted): 0.26
Area under the RO Curve (AUC): 0.884
Log-likelihood: -35159.009, AIC: 70358.018, BIC: 70576.978
Chi-squared: 24788.884, df(19), p.value < 0.001 
Nr obs: 420,000
lr.coef.round(3)
index OR OR% coefficient std.error z.value p.value
0 Intercept 0.000 -99.975 -8.283 0.064 -128.837 0.000 ***
1 message[T.control] 0.906 -9.372 -0.098 0.042 -2.335 0.020 *
2 message[T.endurance] 1.250 24.991 0.223 0.041 5.462 0.000 ***
3 message[T.racquet] 0.993 -0.711 -0.007 0.042 -0.171 0.864
4 message[T.strength] 1.162 16.227 0.150 0.041 3.660 0.000 ***
5 message[T.team] 1.035 3.458 0.034 0.041 0.823 0.410
6 message[T.water] 1.056 5.606 0.055 0.042 1.311 0.190
7 age[T.30 to 44] 2.386 138.587 0.870 0.039 22.353 0.000 ***
8 age[T.45 to 59] 2.186 118.557 0.782 0.040 19.555 0.000 ***
9 age[T.>= 60] 1.201 20.099 0.183 0.048 3.805 0.000 ***
10 female[T.no] 1.353 35.328 0.303 0.024 12.574 0.000 ***
11 income 1.000 0.002 0.000 0.000 22.105 0.000 ***
12 education 1.037 3.736 0.037 0.001 32.809 0.000 ***
13 children 1.618 61.764 0.481 0.030 16.210 0.000 ***
14 freq_endurance 1.101 10.083 0.096 0.005 18.300 0.000 ***
15 freq_strength 1.115 11.550 0.109 0.003 33.044 0.000 ***
16 freq_water 1.178 17.847 0.164 0.014 12.026 0.000 ***
17 freq_team 1.050 4.978 0.049 0.004 10.797 0.000 ***
18 freq_backcountry 1.188 18.795 0.172 0.010 17.313 0.000 ***
19 freq_racquet 1.110 11.008 0.104 0.007 15.866 0.000 ***
lr.summary(main = False, vif = True)

Pseudo R-squared (McFadden): 0.261
Pseudo R-squared (McFadden adjusted): 0.26
Area under the RO Curve (AUC): 0.884
Log-likelihood: -35159.009, AIC: 70358.018, BIC: 70576.978
Chi-squared: 24788.884, df(19), p.value < 0.001 
Nr obs: 420,000

Variance inflation factors:

                    vif    Rsq
education         3.219  0.689
income            2.814  0.645
freq_endurance    1.686  0.407
freq_racquet      1.603  0.376
freq_strength     1.470  0.320
children          1.423  0.297
freq_team         1.339  0.253
age               1.307  0.235
freq_water        1.268  0.212
freq_backcountry  1.145  0.126
female            1.112  0.101
message           1.001  0.001
lr.plot("vimp")

pentathlon_nptb['pred_lr'] = lr.predict(pentathlon_nptb)['prediction']
dct = {"train" : pentathlon_nptb.query("training == 1"), "test" : pentathlon_nptb.query("training == 0")}
fig = rsm.gains_plot(dct, "buyer", "yes", "pred_lr")

from sklearn import metrics

# prediction on training set
pred = pentathlon_nptb.query("training == 1")['pred_lr']
actual = pentathlon_nptb.query("training == 1")['buyer_yes']
fpr, tpr, thresholds = metrics.roc_curve(actual, pred)
metrics.auc(fpr, tpr).round(3)
0.884
# prediction on test set
pred = pentathlon_nptb.query("training == 0")['pred_lr']
actual = pentathlon_nptb.query("training == 0")['buyer_yes']
fpr, tpr, thresholds = metrics.roc_curve(actual, pred)
metrics.auc(fpr, tpr).round(3)
0.883

We can conclude that:

The AUC for the training set is 0.884, which indicates that the model has a very good ability to distinguish between the two classes in the training data.

The AUC for the test set is 0.883, which is slightly lower but still indicates a very good predictive performance on unseen data.

The fact that the test AUC is close to the training AUC suggests that the model is generalizing well and not overfitting significantly to the training data. A small decrease from training to test set performance is normal because models will usually perform slightly better on the data they were trained on.

1. For each customer determine the message (i.e., endurance, strength, water, team, backcountry, racquet, or no-message) predicted to lead to the highest probability of purchase

# Check the department of the message variable
pentathlon_nptb["message"].value_counts()
message
control        87260
racquet        87088
team           86792
backcountry    86604
water          85120
strength       84280
endurance      82856
Name: count, dtype: int64
# Create predictions
pentathlon_nptb["p_control_lr"] = lr.predict(pentathlon_nptb, data_cmd={"message": "control"})["prediction"]
pentathlon_nptb["p_racquet_lr"] = lr.predict(pentathlon_nptb, data_cmd={"message": "racquet"})["prediction"]
pentathlon_nptb["p_team_lr"] = lr.predict(pentathlon_nptb, data_cmd={"message": "team"})["prediction"]
pentathlon_nptb["p_backcountry_lr"] = lr.predict(pentathlon_nptb, data_cmd={"message": "backcountry"})["prediction"]
pentathlon_nptb["p_water_lr"] = lr.predict(pentathlon_nptb, data_cmd={"message": "water"})["prediction"]
pentathlon_nptb["p_strength_lr"] = lr.predict(pentathlon_nptb, data_cmd={"message": "strength"})["prediction"]
pentathlon_nptb["p_endurance_lr"] = lr.predict(pentathlon_nptb, data_cmd={"message": "endurance"})["prediction"]
pentathlon_nptb
custid buyer total_os message age female income education children freq_endurance ... training buyer_yes pred_lr p_control_lr p_racquet_lr p_team_lr p_backcountry_lr p_water_lr p_strength_lr p_endurance_lr
0 U1 no 0 team 30 to 44 no 55000 19 0.8 0 ... 1.0 0 0.013031 0.011433 0.012512 0.013031 0.012601 0.013298 0.014615 0.015700
1 U3 no 0 backcountry 45 to 59 no 35000 22 1.0 0 ... 0.0 0 0.005123 0.004645 0.005086 0.005299 0.005123 0.005408 0.005949 0.006395
2 U13 no 0 endurance 45 to 59 yes 45000 33 0.7 0 ... 1.0 0 0.011948 0.008692 0.009515 0.009910 0.009582 0.010114 0.011120 0.011948
3 U20 no 0 water 45 to 59 yes 25000 24 0.2 0 ... 1.0 0 0.002361 0.002027 0.002220 0.002313 0.002236 0.002361 0.002598 0.002793
4 U25 no 0 racquet >= 60 yes 65000 32 1.1 1 ... 0.0 0 0.011717 0.010706 0.011717 0.012203 0.011800 0.012453 0.013688 0.014705
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
599995 U3462888 no 0 water >= 60 yes 40000 26 0.6 0 ... 1.0 0 0.002182 0.001873 0.002052 0.002138 0.002067 0.002182 0.002401 0.002582
599996 U3462900 no 0 team < 30 no 55000 32 0.9 3 ... 0.0 0 0.007347 0.006442 0.007053 0.007347 0.007103 0.007499 0.008247 0.008863
599997 U3462902 no 0 team < 30 yes 55000 32 0.9 0 ... 1.0 0 0.009142 0.008017 0.008776 0.009142 0.008839 0.009330 0.010258 0.011023
599998 U3462916 no 0 team < 30 no 50000 35 0.6 2 ... 0.0 0 0.006615 0.005799 0.006350 0.006615 0.006395 0.006751 0.007425 0.007981
599999 U3462922 no 0 endurance 30 to 44 yes 50000 25 0.7 1 ... 0.0 0 0.010293 0.007484 0.008194 0.008535 0.008252 0.008710 0.009578 0.010293

600000 rows × 25 columns

We then extend the prediction to the full database and see how the predictions are distributed.

pentathlon_nptb["message_lr"] = pentathlon_nptb[[
    "p_control_lr", 
    "p_endurance_lr", 
    "p_backcountry_lr", 
    "p_racquet_lr", 
    "p_strength_lr", 
    "p_team_lr", 
    "p_water_lr"]].idxmax(axis=1)
pentathlon_nptb["message_lr"].value_counts()
message_lr
p_endurance_lr    600000
Name: count, dtype: int64

After checking the distribution, we can see that the model predicts that every customer should be sent the same product. The mistake in the analysis above was that the specified model is not sufficiently flexible to allow customization across customers!

The output of the logistic regression above shows that there are indeed some interaction effects taking place in the data. We want to create a model that can capture these interaction effects, which we have done. We are now going to use the model to predict the next product to buy for the full database.

ivar=[f"{e}:message" for e in evar if e != "message"]
ivar
['age:message',
 'female:message',
 'income:message',
 'education:message',
 'children:message',
 'freq_endurance:message',
 'freq_strength:message',
 'freq_water:message',
 'freq_team:message',
 'freq_backcountry:message',
 'freq_racquet:message']

The model is then:

lr_int = rsm.model.logistic(
    data={"penathlon_nptb": pentathlon_nptb[pentathlon_nptb.training == 1]},
    rvar="buyer",
    lev="yes",
    evar=evar,
    ivar=ivar
)
lr_int.summary()
Logistic regression (GLM)
Data                 : penathlon_nptb
Response variable    : buyer
Level                : yes
Explanatory variables: message, age, female, income, education, children, freq_endurance, freq_strength, freq_water, freq_team, freq_backcountry, freq_racquet
Null hyp.: There is no effect of x on buyer
Alt. hyp.: There is an effect of x on buyer

                                        OR      OR%  coefficient  std.error  z.value p.value     
Intercept                            0.000  -100.0%        -8.25      0.153  -53.948  < .001  ***
message[control]                     1.115    11.5%         0.11      0.220    0.496    0.62     
message[endurance]                   1.199    19.9%         0.18      0.213    0.850   0.395     
message[racquet]                     1.084     8.4%         0.08      0.217    0.370   0.711     
message[strength]                    0.959    -4.1%        -0.04      0.216   -0.195   0.846     
message[team]                        0.881   -11.9%        -0.13      0.217   -0.581   0.561     
message[water]                       0.903    -9.7%        -0.10      0.219   -0.468    0.64     
age[30 to 44]                        1.979    97.9%         0.68      0.100    6.802  < .001  ***
age[45 to 59]                        2.056   105.6%         0.72      0.102    7.033  < .001  ***
age[>= 60]                           0.992    -0.8%        -0.01      0.127   -0.060   0.952     
female[no]                           1.365    36.5%         0.31      0.064    4.833  < .001  ***
age[30 to 44]:message[control]       1.196    19.6%         0.18      0.146    1.225    0.22     
age[45 to 59]:message[control]       1.016     1.6%         0.02      0.150    0.107   0.914     
age[>= 60]:message[control]          1.155    15.5%         0.14      0.184    0.785   0.432     
age[30 to 44]:message[endurance]     1.302    30.2%         0.26      0.142    1.855   0.064    .
age[45 to 59]:message[endurance]     1.142    14.2%         0.13      0.146    0.907   0.364     
age[>= 60]:message[endurance]        1.473    47.3%         0.39      0.176    2.203   0.028    *
age[30 to 44]:message[racquet]       1.053     5.3%         0.05      0.144    0.358    0.72     
age[45 to 59]:message[racquet]       1.004     0.4%         0.00      0.147    0.030   0.976     
age[>= 60]:message[racquet]          1.269    26.9%         0.24      0.178    1.334   0.182     
age[30 to 44]:message[strength]      1.307    30.7%         0.27      0.142    1.886   0.059    .
age[45 to 59]:message[strength]      0.947    -5.3%        -0.05      0.146   -0.375   0.708     
age[>= 60]:message[strength]         1.084     8.4%         0.08      0.178    0.456   0.649     
age[30 to 44]:message[team]          1.268    26.8%         0.24      0.144    1.654   0.098    .
age[45 to 59]:message[team]          1.111    11.1%         0.10      0.147    0.713   0.476     
age[>= 60]:message[team]             1.164    16.4%         0.15      0.180    0.841     0.4     
age[30 to 44]:message[water]         1.359    35.9%         0.31      0.147    2.081   0.037    *
age[45 to 59]:message[water]         1.264    26.4%         0.23      0.151    1.552   0.121     
age[>= 60]:message[water]            1.362    36.2%         0.31      0.184    1.677   0.094    .
female[no]:message[control]          0.974    -2.6%        -0.03      0.092   -0.283   0.777     
female[no]:message[endurance]        0.884   -11.6%        -0.12      0.089   -1.391   0.164     
female[no]:message[racquet]          1.192    19.2%         0.18      0.092    1.911   0.056    .
female[no]:message[strength]         1.015     1.5%         0.02      0.090    0.168   0.867     
female[no]:message[team]             1.019     1.9%         0.02      0.090    0.206   0.837     
female[no]:message[water]            0.897   -10.3%        -0.11      0.091   -1.190   0.234     
income                               1.000     0.0%         0.00      0.000    8.324  < .001  ***
income:message[control]              1.000     0.0%         0.00      0.000    0.020   0.984     
income:message[endurance]            1.000     0.0%         0.00      0.000    1.688   0.091    .
income:message[racquet]              1.000     0.0%         0.00      0.000    0.191   0.849     
income:message[strength]             1.000    -0.0%        -0.00      0.000   -0.368   0.713     
income:message[team]                 1.000     0.0%         0.00      0.000    0.053   0.958     
income:message[water]                1.000    -0.0%        -0.00      0.000   -0.353   0.724     
education                            1.041     4.1%         0.04      0.003   13.546  < .001  ***
education:message[control]           0.995    -0.5%        -0.01      0.004   -1.273   0.203     
education:message[endurance]         0.993    -0.7%        -0.01      0.004   -1.722   0.085    .
education:message[racquet]           0.992    -0.8%        -0.01      0.004   -1.959    0.05    .
education:message[strength]          1.002     0.2%         0.00      0.004    0.551   0.581     
education:message[team]              1.000    -0.0%        -0.00      0.004   -0.071   0.943     
education:message[water]             0.997    -0.3%        -0.00      0.004   -0.718   0.473     
children                             1.723    72.3%         0.54      0.078    6.982  < .001  ***
children:message[control]            0.838   -16.2%        -0.18      0.114   -1.552   0.121     
children:message[endurance]          0.885   -11.5%        -0.12      0.109   -1.120   0.263     
children:message[racquet]            0.933    -6.7%        -0.07      0.110   -0.629   0.529     
children:message[strength]           1.023     2.3%         0.02      0.110    0.207   0.836     
children:message[team]               0.960    -4.0%        -0.04      0.111   -0.369   0.712     
children:message[water]              0.935    -6.5%        -0.07      0.112   -0.602   0.547     
freq_endurance                       1.081     8.1%         0.08      0.014    5.617  < .001  ***
freq_endurance:message[control]      1.029     2.9%         0.03      0.020    1.442   0.149     
freq_endurance:message[endurance]    0.990    -1.0%        -0.01      0.020   -0.512   0.609     
freq_endurance:message[racquet]      1.016     1.6%         0.02      0.020    0.792   0.428     
freq_endurance:message[strength]     1.016     1.6%         0.02      0.020    0.804   0.421     
freq_endurance:message[team]         1.043     4.3%         0.04      0.019    2.157   0.031    *
freq_endurance:message[water]        1.036     3.6%         0.04      0.020    1.807   0.071    .
freq_strength                        1.114    11.4%         0.11      0.009   12.347  < .001  ***
freq_strength:message[control]       1.001     0.1%         0.00      0.012    0.045   0.964     
freq_strength:message[endurance]     0.991    -0.9%        -0.01      0.012   -0.732   0.464     
freq_strength:message[racquet]       1.002     0.2%         0.00      0.012    0.142   0.887     
freq_strength:message[strength]      0.998    -0.2%        -0.00      0.012   -0.174   0.862     
freq_strength:message[team]          1.003     0.3%         0.00      0.012    0.249   0.803     
freq_strength:message[water]         1.015     1.5%         0.01      0.012    1.162   0.245     
freq_water                           1.139    13.9%         0.13      0.035    3.678  < .001  ***
freq_water:message[control]          1.036     3.6%         0.04      0.051    0.699   0.484     
freq_water:message[endurance]        1.067     6.7%         0.06      0.051    1.272   0.204     
freq_water:message[racquet]          1.028     2.8%         0.03      0.051    0.539    0.59     
freq_water:message[strength]         1.014     1.4%         0.01      0.051    0.268   0.788     
freq_water:message[team]             0.970    -3.0%        -0.03      0.050   -0.602   0.547     
freq_water:message[water]            1.158    15.8%         0.15      0.051    2.859   0.004   **
freq_team                            1.037     3.7%         0.04      0.012    3.069   0.002   **
freq_team:message[control]           1.005     0.5%         0.01      0.017    0.302   0.763     
freq_team:message[endurance]         0.999    -0.1%        -0.00      0.017   -0.057   0.955     
freq_team:message[racquet]           1.025     2.5%         0.02      0.017    1.431   0.152     
freq_team:message[strength]          1.031     3.1%         0.03      0.017    1.788   0.074    .
freq_team:message[team]              0.993    -0.7%        -0.01      0.017   -0.406   0.685     
freq_team:message[water]             1.036     3.6%         0.04      0.017    2.081   0.037    *
freq_backcountry                     1.210    21.0%         0.19      0.026    7.256  < .001  ***
freq_backcountry:message[control]    0.973    -2.7%        -0.03      0.037   -0.727   0.467     
freq_backcountry:message[endurance]  0.992    -0.8%        -0.01      0.037   -0.224   0.823     
freq_backcountry:message[racquet]    0.968    -3.2%        -0.03      0.037   -0.864   0.388     
freq_backcountry:message[strength]   0.982    -1.8%        -0.02      0.037   -0.492   0.623     
freq_backcountry:message[team]       0.976    -2.4%        -0.02      0.037   -0.658   0.511     
freq_backcountry:message[water]      0.981    -1.9%        -0.02      0.038   -0.517   0.605     
freq_racquet                         1.100    10.0%         0.10      0.017    5.456  < .001  ***
freq_racquet:message[control]        1.034     3.4%         0.03      0.025    1.356   0.175     
freq_racquet:message[endurance]      1.034     3.4%         0.03      0.025    1.375   0.169     
freq_racquet:message[racquet]        1.039     3.9%         0.04      0.025    1.523   0.128     
freq_racquet:message[strength]       0.975    -2.5%        -0.03      0.025   -1.023   0.306     
freq_racquet:message[team]           0.992    -0.8%        -0.01      0.024   -0.345    0.73     
freq_racquet:message[water]          0.998    -0.2%        -0.00      0.025   -0.090   0.928     

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pseudo R-squared (McFadden): 0.262
Pseudo R-squared (McFadden adjusted): 0.26
Area under the RO Curve (AUC): 0.884
Log-likelihood: -35102.941, AIC: 70401.883, BIC: 71474.788
Chi-squared: 24901.02, df(97), p.value < 0.001 
Nr obs: 420,000
pentathlon_nptb["pred_lr_int"] = lr_int.predict(pentathlon_nptb)["prediction"]
dct = {"train": pentathlon_nptb.query("training == 1"), "test": pentathlon_nptb.query("training == 0")}
fig = rsm.gains_plot(dct, "buyer", "yes", "pred_lr_int")

fig = rsm.gains_plot(
    pentathlon_nptb,
    "buyer", "yes",
    ["pred_lr", "pred_lr_int"]
)

# prediction on training set
pred = pentathlon_nptb.query("training == 1")['pred_lr_int']
actual = pentathlon_nptb.query("training == 1")['buyer_yes']
fpr, tpr, thresholds = metrics.roc_curve(actual, pred)
metrics.auc(fpr, tpr).round(3)
0.884
# prediction on test set
pred = pentathlon_nptb.query("training == 0")['pred_lr_int']
actual = pentathlon_nptb.query("training == 0")['buyer_yes']
fpr, tpr, thresholds = metrics.roc_curve(actual, pred)
metrics.auc(fpr, tpr).round(3)
0.883

Now, lets repeat the analysis above with the new model.

# Create predictions
pentathlon_nptb["p_controli_lr"] = lr_int.predict(pentathlon_nptb, data_cmd={"message": "control"})["prediction"]
pentathlon_nptb["p_racqueti_lr"] = lr_int.predict(pentathlon_nptb, data_cmd={"message": "racquet"})["prediction"]
pentathlon_nptb["p_teami_lr"] = lr_int.predict(pentathlon_nptb, data_cmd={"message": "team"})["prediction"]
pentathlon_nptb["p_backcountryi_lr"] = lr_int.predict(pentathlon_nptb, data_cmd={"message": "backcountry"})["prediction"]
pentathlon_nptb["p_wateri_lr"] = lr_int.predict(pentathlon_nptb, data_cmd={"message": "water"})["prediction"]
pentathlon_nptb["p_strengthi_lr"] = lr_int.predict(pentathlon_nptb, data_cmd={"message": "strength"})["prediction"]
pentathlon_nptb["p_endurancei_lr"] = lr_int.predict(pentathlon_nptb, data_cmd={"message": "endurance"})["prediction"]
pentathlon_nptb
custid buyer total_os message age female income education children freq_endurance ... p_endurance_lr message_lr pred_lr_int p_controli_lr p_racqueti_lr p_teami_lr p_backcountryi_lr p_wateri_lr p_strengthi_lr p_endurancei_lr
0 U1 no 0 team 30 to 44 no 55000 19 0.8 0 ... 0.015700 p_endurance_lr 0.012008 0.012022 0.014499 0.012008 0.011131 0.012604 0.015452 0.015682
1 U3 no 0 backcountry 45 to 59 no 35000 22 1.0 0 ... 0.006395 p_endurance_lr 0.005556 0.004605 0.005858 0.005279 0.005556 0.004981 0.005475 0.006014
2 U13 no 0 endurance 45 to 59 yes 45000 33 0.7 0 ... 0.011948 p_endurance_lr 0.013884 0.009140 0.008789 0.009533 0.010718 0.009680 0.009343 0.013884
3 U20 no 0 water 45 to 59 yes 25000 24 0.2 0 ... 0.002793 p_endurance_lr 0.002389 0.002252 0.002089 0.002264 0.002339 0.002389 0.002196 0.002970
4 U25 no 0 racquet >= 60 yes 65000 32 1.1 1 ... 0.014705 p_endurance_lr 0.012039 0.010784 0.012039 0.011119 0.011526 0.011437 0.011463 0.019675
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
599995 U3462888 no 0 water >= 60 yes 40000 26 0.6 0 ... 0.002582 p_endurance_lr 0.002042 0.001966 0.002118 0.001946 0.001947 0.002042 0.002089 0.003217
599996 U3462900 no 0 team < 30 no 55000 32 0.9 3 ... 0.008863 p_endurance_lr 0.007692 0.006959 0.008257 0.007692 0.007884 0.005832 0.008117 0.007731
599997 U3462902 no 0 team < 30 yes 55000 32 0.9 0 ... 0.011023 p_endurance_lr 0.008225 0.008535 0.008976 0.008225 0.010062 0.008286 0.009822 0.011345
599998 U3462916 no 0 team < 30 no 50000 35 0.6 2 ... 0.007981 p_endurance_lr 0.006796 0.006378 0.007352 0.006796 0.007164 0.005272 0.007284 0.006971
599999 U3462922 no 0 endurance 30 to 44 yes 50000 25 0.7 1 ... 0.010293 p_endurance_lr 0.011857 0.008527 0.007495 0.008401 0.007500 0.008214 0.009222 0.011857

600000 rows × 34 columns

We then extend the prediction to the full database and see how the predictions are distributed.

repl = {
    "p_controli_lr": "control", 
    "p_endurancei_lr": "endurance", 
    "p_backcountryi_lr": "backcountry", 
    "p_racqueti_lr": "racquet", 
    "p_strengthi_lr": "strength", 
    "p_teami_lr": "team", 
    "p_wateri_lr": "water"}
predictions_lr = [
    "p_controli_lr", 
    "p_endurancei_lr", 
    "p_backcountryi_lr", 
    "p_racqueti_lr", 
    "p_strengthi_lr", 
    "p_teami_lr", 
    "p_wateri_lr"]

Find the highest probability of purchase for each customer, labeled them by the product with the highest probability of purchase

pentathlon_nptb["messagei_lr"] = (
    pentathlon_nptb[predictions_lr]
    .idxmax(axis=1)
    .map(repl)
)
pentathlon_nptb
custid buyer total_os message age female income education children freq_endurance ... message_lr pred_lr_int p_controli_lr p_racqueti_lr p_teami_lr p_backcountryi_lr p_wateri_lr p_strengthi_lr p_endurancei_lr messagei_lr
0 U1 no 0 team 30 to 44 no 55000 19 0.8 0 ... p_endurance_lr 0.012008 0.012022 0.014499 0.012008 0.011131 0.012604 0.015452 0.015682 endurance
1 U3 no 0 backcountry 45 to 59 no 35000 22 1.0 0 ... p_endurance_lr 0.005556 0.004605 0.005858 0.005279 0.005556 0.004981 0.005475 0.006014 endurance
2 U13 no 0 endurance 45 to 59 yes 45000 33 0.7 0 ... p_endurance_lr 0.013884 0.009140 0.008789 0.009533 0.010718 0.009680 0.009343 0.013884 endurance
3 U20 no 0 water 45 to 59 yes 25000 24 0.2 0 ... p_endurance_lr 0.002389 0.002252 0.002089 0.002264 0.002339 0.002389 0.002196 0.002970 endurance
4 U25 no 0 racquet >= 60 yes 65000 32 1.1 1 ... p_endurance_lr 0.012039 0.010784 0.012039 0.011119 0.011526 0.011437 0.011463 0.019675 endurance
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
599995 U3462888 no 0 water >= 60 yes 40000 26 0.6 0 ... p_endurance_lr 0.002042 0.001966 0.002118 0.001946 0.001947 0.002042 0.002089 0.003217 endurance
599996 U3462900 no 0 team < 30 no 55000 32 0.9 3 ... p_endurance_lr 0.007692 0.006959 0.008257 0.007692 0.007884 0.005832 0.008117 0.007731 racquet
599997 U3462902 no 0 team < 30 yes 55000 32 0.9 0 ... p_endurance_lr 0.008225 0.008535 0.008976 0.008225 0.010062 0.008286 0.009822 0.011345 endurance
599998 U3462916 no 0 team < 30 no 50000 35 0.6 2 ... p_endurance_lr 0.006796 0.006378 0.007352 0.006796 0.007164 0.005272 0.007284 0.006971 racquet
599999 U3462922 no 0 endurance 30 to 44 yes 50000 25 0.7 1 ... p_endurance_lr 0.011857 0.008527 0.007495 0.008401 0.007500 0.008214 0.009222 0.011857 endurance

600000 rows × 35 columns

# Find the maximum probability of purchase
pentathlon_nptb["p_max"] = pentathlon_nptb[predictions_lr].max(axis=1)

Approach: First of all, We build a logistic regression model to predict the next product to buy, use buyer_yes as the response variable and the other variables as predictors. After relizing that the model predicts that every customer should be sent the same product i.e, endurance, we build another model that can capture these interaction effects.

We then extend the prediction to the full database. To predict the probability of purchasing for different messages, we use the data_cmd to predict the probability of purchasing for different messages.

Finally, we choose the product with the highest probability using idxmax to automatically find the best message for each customer. This command also provides a label for the category with the maximum predicted propability of buying accross all the products. Then, create p_max to store the maximum predicted probability of purchasing for the best message sellected for each customer.

2. For each message, report the percentage of customers for whom that message or no-message maximizes their probability of purchase

Let’s create a crosstab to see which products should we send to each customer.

pd.crosstab(index=pentathlon_nptb[pentathlon_nptb.training == 0].messagei_lr, columns="count").apply(rsm.format_nr)
col_0 count
messagei_lr
backcountry 1,447
endurance 125,861
racquet 12,299
strength 36,584
team 1,680
water 2,129

Report the percentage of customers for whom that message is the best message or not the best message.

pentathlon_nptb["messagei_lr"].value_counts(normalize=True)
messagei_lr
endurance      0.699675
strength       0.202202
racquet        0.069037
water          0.011852
team           0.009358
backcountry    0.007875
control        0.000002
Name: proportion, dtype: float64
  • edurance: The distribution suggests that endurance - related messages resonate significantly more with the customers than the other messages.

  • strength: Wilte not as dominant as endurance, strength - related messages also play an important role. There might be opportunities to further optimize or tailor these messages to increase their effectiveness.

  • Other categories such as racquet, water, team, backcountry have much smaller proportions, suggesting that they are less often the best choice for maximizing the probability of purchase.

  • control: The negligible role of control in maximizing the probability suggrest that any engagement, even if not perfectly oplimized, tends to be better than no engagement at all. However, it’s also essential to consider the context and the potential costs of sending messages to customers.

Create a table with the average purchase probability if we send the message for each product to everyone.

pentathlon_nptb.loc[pentathlon_nptb.training == 0, predictions_lr].agg("mean").sort_values(
    ascending=False).apply(rsm.format_nr, perc=True)
p_endurancei_lr      2.79%
p_strengthi_lr       2.64%
p_wateri_lr          2.44%
p_teami_lr           2.39%
p_backcountryi_lr    2.32%
p_racqueti_lr        2.31%
p_controli_lr        2.14%
dtype: object

3. Expected profit

# Calculate the actual order size
ordersize = pentathlon_nptb[(pentathlon_nptb.training == 1) & (pentathlon_nptb.buyer_yes == 1)].groupby("message")["total_os"].mean()
ordersize
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/476348601.py:2: FutureWarning:

The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
message
backcountry    64.034091
control        49.900598
endurance      55.584893
racquet        56.405620
strength       56.708751
team           56.522449
water          61.957343
Name: total_os, dtype: float64

Creat Linear model to predict ordersize

reg = rsm.model.regress(
    data = {"pentathlon_nptb": pentathlon_nptb[(pentathlon_nptb.training == 1) & (pentathlon_nptb.buyer == "yes")]},
    rvar ="total_os",
    evar = evar,
    ivar = ivar
)
reg.summary()
Linear regression (OLS)
Data                 : pentathlon_nptb
Response variable    : total_os
Explanatory variables: message, age, female, income, education, children, freq_endurance, freq_strength, freq_water, freq_team, freq_backcountry, freq_racquet
Null hyp.: the effect of x on total_os is zero
Alt. hyp.: the effect of x on total_os is not zero

                                     coefficient  std.error  t.value p.value     
Intercept                                 -9.616      8.976   -1.071   0.284     
message[control]                          17.412     12.953    1.344   0.179     
message[endurance]                         9.106     12.058    0.755    0.45     
message[racquet]                           4.179     12.822    0.326   0.745     
message[strength]                          1.414     12.601    0.112   0.911     
message[team]                             10.566     12.573    0.840   0.401     
message[water]                             6.659     12.905    0.516   0.606     
age[30 to 44]                              4.166      5.701    0.731   0.465     
age[45 to 59]                              6.059      5.712    1.061   0.289     
age[>= 60]                                -0.679      7.066   -0.096   0.923     
female[no]                                 3.578      3.548    1.008   0.313     
age[30 to 44]:message[control]            -6.037      8.297   -0.728   0.467     
age[45 to 59]:message[control]            -3.440      8.437   -0.408   0.683     
age[>= 60]:message[control]                5.342     10.292    0.519   0.604     
age[30 to 44]:message[endurance]          -3.088      8.017   -0.385     0.7     
age[45 to 59]:message[endurance]         -12.434      8.132   -1.529   0.126     
age[>= 60]:message[endurance]              6.071      9.755    0.622   0.534     
age[30 to 44]:message[racquet]            -1.337      8.176   -0.163    0.87     
age[45 to 59]:message[racquet]            -8.117      8.242   -0.985   0.325     
age[>= 60]:message[racquet]               -5.150     10.010   -0.514   0.607     
age[30 to 44]:message[strength]           -0.696      7.975   -0.087    0.93     
age[45 to 59]:message[strength]            0.160      8.076    0.020   0.984     
age[>= 60]:message[strength]               5.451      9.893    0.551   0.582     
age[30 to 44]:message[team]                8.784      8.077    1.088   0.277     
age[45 to 59]:message[team]                6.381      8.161    0.782   0.434     
age[>= 60]:message[team]                   9.798     10.020    0.978   0.328     
age[30 to 44]:message[water]              -7.117      8.285   -0.859    0.39     
age[45 to 59]:message[water]             -12.193      8.378   -1.455   0.146     
age[>= 60]:message[water]                  5.907     10.209    0.579   0.563     
female[no]:message[control]               -4.082      5.058   -0.807    0.42     
female[no]:message[endurance]             -3.209      4.877   -0.658   0.511     
female[no]:message[racquet]               -2.000      5.096   -0.392   0.695     
female[no]:message[strength]              -3.970      4.920   -0.807    0.42     
female[no]:message[team]                  -8.944      4.904   -1.824   0.068    .
female[no]:message[water]                  1.978      4.972    0.398   0.691     
income                                     0.000      0.000    3.970  < .001  ***
income:message[control]                   -0.000      0.000   -1.166   0.244     
income:message[endurance]                 -0.000      0.000   -0.710   0.478     
income:message[racquet]                    0.000      0.000    0.702   0.483     
income:message[strength]                  -0.000      0.000   -1.205   0.228     
income:message[team]                      -0.000      0.000   -1.392   0.164     
income:message[water]                     -0.000      0.000   -1.353   0.176     
education                                  0.773      0.161    4.812  < .001  ***
education:message[control]                -0.370      0.230   -1.613   0.107     
education:message[endurance]              -0.238      0.218   -1.090   0.276     
education:message[racquet]                -0.292      0.228   -1.283   0.199     
education:message[strength]                0.005      0.221    0.025    0.98     
education:message[team]                   -0.337      0.224   -1.505   0.132     
education:message[water]                  -0.088      0.226   -0.390   0.696     
children                                   7.307      4.406    1.659   0.097    .
children:message[control]                  4.666      6.217    0.750   0.453     
children:message[endurance]                1.572      5.994    0.262   0.793     
children:message[racquet]                  5.236      6.255    0.837   0.403     
children:message[strength]                -0.699      6.269   -0.112   0.911     
children:message[team]                     4.814      6.233    0.772    0.44     
children:message[water]                   11.040      6.482    1.703   0.089    .
freq_endurance                            -1.400      0.698   -2.004   0.045    *
freq_endurance:message[control]            1.082      0.995    1.088   0.277     
freq_endurance:message[endurance]         -0.138      0.974   -0.142   0.887     
freq_endurance:message[racquet]           -0.927      1.005   -0.922   0.356     
freq_endurance:message[strength]           0.160      0.981    0.163   0.871     
freq_endurance:message[team]               1.424      0.963    1.479   0.139     
freq_endurance:message[water]             -0.298      0.968   -0.308   0.758     
freq_strength                             -2.181      0.437   -4.988  < .001  ***
freq_strength:message[control]             0.352      0.626    0.563   0.573     
freq_strength:message[endurance]           0.286      0.607    0.471   0.638     
freq_strength:message[racquet]             0.029      0.635    0.045   0.964     
freq_strength:message[strength]            1.337      0.614    2.179   0.029    *
freq_strength:message[team]                0.807      0.615    1.312    0.19     
freq_strength:message[water]               1.055      0.614    1.718   0.086    .
freq_water                                 2.979      1.679    1.774   0.076    .
freq_water:message[control]                0.263      2.417    0.109   0.913     
freq_water:message[endurance]             -1.171      2.375   -0.493   0.622     
freq_water:message[racquet]               -2.418      2.392   -1.011   0.312     
freq_water:message[strength]               0.200      2.369    0.084   0.933     
freq_water:message[team]                  -1.236      2.347   -0.527   0.598     
freq_water:message[water]                 -3.641      2.368   -1.538   0.124     
freq_team                                 -0.075      0.599   -0.125   0.901     
freq_team:message[control]                -0.229      0.847   -0.270   0.787     
freq_team:message[endurance]               2.188      0.829    2.638   0.008   **
freq_team:message[racquet]                -0.563      0.861   -0.654   0.513     
freq_team:message[strength]               -0.657      0.850   -0.774   0.439     
freq_team:message[team]                    0.320      0.829    0.386     0.7     
freq_team:message[water]                  -0.932      0.836   -1.115   0.265     
freq_backcountry                           2.686      1.320    2.035   0.042    *
freq_backcountry:message[control]         -0.962      1.902   -0.506   0.613     
freq_backcountry:message[endurance]        2.745      1.848    1.486   0.137     
freq_backcountry:message[racquet]         -0.198      1.887   -0.105   0.916     
freq_backcountry:message[strength]        -0.097      1.864   -0.052   0.958     
freq_backcountry:message[team]            -0.529      1.858   -0.284   0.776     
freq_backcountry:message[water]            1.071      1.862    0.575   0.565     
freq_racquet                               0.297      0.868    0.343   0.732     
freq_racquet:message[control]             -0.775      1.246   -0.622   0.534     
freq_racquet:message[endurance]           -0.021      1.186   -0.018   0.986     
freq_racquet:message[racquet]              0.944      1.237    0.764   0.445     
freq_racquet:message[strength]             1.148      1.200    0.957   0.339     
freq_racquet:message[team]                 0.199      1.207    0.165   0.869     
freq_racquet:message[water]                1.503      1.214    1.238   0.216     

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-squared: 0.089, Adjusted R-squared: 0.08
F-statistic: 10.021 df(97, 9982), p.value < 0.001
Nr obs: 10,080
reg.coef.round(3)
index coefficient std.error t.value p.value
0 Intercept -9.616 8.976 -1.071 0.284
1 message[T.control] 17.412 12.953 1.344 0.179
2 message[T.endurance] 9.106 12.058 0.755 0.450
3 message[T.racquet] 4.179 12.822 0.326 0.745
4 message[T.strength] 1.414 12.601 0.112 0.911
... ... ... ... ... ... ...
93 freq_racquet:message[T.endurance] -0.021 1.186 -0.018 0.986
94 freq_racquet:message[T.racquet] 0.944 1.237 0.764 0.445
95 freq_racquet:message[T.strength] 1.148 1.200 0.957 0.339
96 freq_racquet:message[T.team] 0.199 1.207 0.165 0.869
97 freq_racquet:message[T.water] 1.503 1.214 1.238 0.216

98 rows × 6 columns

pentathlon_nptb["pos_control_reg"] = reg.predict(pentathlon_nptb, data_cmd={"message": "control"})["prediction"]
pentathlon_nptb["pos_racquet_reg"] = reg.predict(pentathlon_nptb, data_cmd={"message": "racquet"})["prediction"]
pentathlon_nptb["pos_team_reg"] = reg.predict(pentathlon_nptb, data_cmd={"message": "team"})["prediction"]
pentathlon_nptb["pos_backcountry_reg"] = reg.predict(pentathlon_nptb, data_cmd={"message": "backcountry"})["prediction"]
pentathlon_nptb["pos_water_reg"] = reg.predict(pentathlon_nptb, data_cmd={"message": "water"})["prediction"]
pentathlon_nptb["pos_strength_reg"] = reg.predict(pentathlon_nptb, data_cmd={"message": "strength"})["prediction"]
pentathlon_nptb["pos_endurance_reg"] = reg.predict(pentathlon_nptb, data_cmd={"message": "endurance"})["prediction"]

Calculate the expected profit for each product

pentathlon_nptb["ep_control_lr"] = pentathlon_nptb["p_control_lr"] * pentathlon_nptb["pos_control_reg"] * 0.4
pentathlon_nptb["ep_racquet_lr"] = pentathlon_nptb["p_racquet_lr"] * pentathlon_nptb["pos_racquet_reg"] * 0.4
pentathlon_nptb["ep_team_lr"] = pentathlon_nptb["p_team_lr"] * pentathlon_nptb["pos_team_reg"] * 0.4
pentathlon_nptb["ep_backcountry_lr"] = pentathlon_nptb["p_backcountry_lr"] * pentathlon_nptb["pos_backcountry_reg"] * 0.4
pentathlon_nptb["ep_water_lr"] = pentathlon_nptb["p_water_lr"] * pentathlon_nptb["pos_water_reg"] * 0.4
pentathlon_nptb["ep_strength_lr"] = pentathlon_nptb["p_strength_lr"] * pentathlon_nptb["pos_strength_reg"] * 0.4
pentathlon_nptb["ep_endurance_lr"] = pentathlon_nptb["p_endurance_lr"] * pentathlon_nptb["pos_endurance_reg"] * 0.4

To determine the message to send that will maximize the expected profit, we can use the following formula:

expected_profit_lr = [
    "ep_control_lr", 
    "ep_endurance_lr", 
    "ep_backcountry_lr", 
    "ep_racquet_lr", 
    "ep_strength_lr", 
    "ep_team_lr", 
    "ep_water_lr"]
repl = {"ep_control_lr": "control", "ep_endurance_lr": "endurance", "ep_backcountry_lr": "backcountry", "ep_racquet_lr": "racquet", "ep_strength_lr": "strength", "ep_team_lr": "team", "ep_water_lr": "water"}
pentathlon_nptb["ep_message_lr"] = (
    pentathlon_nptb[expected_profit_lr]
    .idxmax(axis=1)
    .map(repl)
)

To predict the order size, we use the total_os as the response variable and the other variables as predictors as well as the interaction between message and other variables. We then extend the prediction to the full database and see how the predictions are distributed.

Same as the previous model, we use the data_cmd to predict the order size for different messages.

After that, we calculate the expected profit for each product by multiplying the probability of purchasing and the order size for each product. We then choose the product with the highest expected profit using idxmax to automatically find the best message for each customer. This command also provides a label for the category with the maximum expected profit. Then, create p_max to store the maximum expected profit of purchasing for the best message sellected for each customer.

4.Report for each message, i.e., endurance, racket, etc., and no-message, the percentage of customers for whom that (no) message maximizes their expected profit.

pentathlon_nptb.ep_message_lr.value_counts(normalize=True)
ep_message_lr
water          0.340890
endurance      0.306053
team           0.206722
backcountry    0.073778
strength       0.066898
racquet        0.004322
control        0.001337
Name: proportion, dtype: float64

5. Expected profit can we obtain, on average, per customer if we customize the message to each customer?

pentathlon_nptb["ep_max"] = pentathlon_nptb[expected_profit_lr].max(axis=1)
pentathlon_nptb["ep_max"].mean()
0.6754439327865267

Let create the crosstab

pd.crosstab(index=pentathlon_nptb[pentathlon_nptb.training == 0].ep_message_lr, columns="count").apply(rsm.format_nr)
col_0 count
ep_message_lr
backcountry 13,379
control 238
endurance 54,956
racquet 740
strength 12,092
team 37,539
water 61,056

6. Expected profit per e-mailed customer if every customer receives the same message

(
    pentathlon_nptb
    .loc[pentathlon_nptb.training == 0, ["ep_control_lr", "ep_endurance_lr", "ep_backcountry_lr", "ep_racquet_lr", "ep_strength_lr", "ep_team_lr", "ep_water_lr"]]
    .agg("mean")
    .sort_values(ascending=False)
    .apply(rsm.format_nr, sym = "$", dec = 2)
)
ep_endurance_lr      $0.62
ep_water_lr           $0.6
ep_strength_lr        $0.6
ep_backcountry_lr    $0.59
ep_team_lr           $0.54
ep_racquet_lr        $0.53
ep_control_lr        $0.43
dtype: object

7. Expected profit per e-mailed customer if every customer is assigned randomly to one of the messages or the no-message condition?

# probabilty of purchase where customer is assigned to a random message
pentathlon_nptb["p_random_lr"] = lr_int.predict(pentathlon_nptb)["prediction"]

# expected avg order size where customer is assigned to a random message
pentathlon_nptb["ordersize_random_reg"] = reg.predict(pentathlon_nptb)["prediction"]

# expected profit where customer is assigned to a random message
pentathlon_nptb["ep_random_lr"] = pentathlon_nptb["p_random_lr"] * pentathlon_nptb["ordersize_random_reg"] * 0.4

# expected profit per customer where customer is assigned to a random message
random_profit_per_customer = pentathlon_nptb.loc[pentathlon_nptb.training == 0, "ep_random_lr"].mean()
random_profit_per_customer
0.5569904087502335
# expected profit where no-message is sent (control)
pentathlon_nptb["ep_control_lr"]

# expected profit per customer where no-message is sent (control)
control_profit_per_customer = pentathlon_nptb.loc[pentathlon_nptb.training == 0, "ep_control_lr"].mean()
control_profit_per_customer
0.4339519244630221

8. Profit for 5,000,000 customers

# Profit where each customer is assigned to the message with the highest expected profit
profit_logit = pentathlon_nptb.loc[pentathlon_nptb.training == 0, "ep_max"].agg("mean") * 5000000
profit_logit
3408166.566437483
# Profit where each customer is sent to a random message
random_profit = random_profit_per_customer * 5000000
random_profit
2784952.0437511676
# Profit where no message is sent
control_profit = control_profit_per_customer * 5000000
control_profit
2169759.6223151106
profit_improvement_lr = profit_logit - control_profit
profit_improvement_lr
1238406.9441223722
profits_dct = {
    "Customize Message": profit_logit,
    "Randomly Assign": random_profit,
    "No Message Sent": control_profit,
}


import seaborn as sns
import matplotlib.pyplot as plt

# Convert dictionary to DataFrame
df = pd.DataFrame(list(profits_dct.items()), columns=['Model', 'Profit'])
plt.figure(figsize=(10, 5))  # Adjust the width and height to your preference
# Plot
sns.set(style="white")
ax = sns.barplot(x="Model", y="Profit", data=df, palette="magma")

# Annotations
for index, row in df.iterrows():
    ax.text(index, row.Profit, f'${row.Profit:.2f}', ha='center')

# Set labels and title
ax.set_xlabel("Model Type", fontdict={'family': 'serif', 'color': 'black', 'size': 12})
ax.set_ylabel("Profit", fontdict={'family': 'serif', 'color': 'black', 'size': 12})
ax.set_title("Profit by Model", fontdict={'family': 'serif', 'color': 'black', 'size': 15})

plt.xticks(rotation=45)  # Rotate x labels for better readability
plt.show()
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1123293812.py:16: FutureWarning:



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

Neural Networks Model

pentathlon_nptb[pentathlon_nptb["training"] == 1]
custid buyer total_os message age female income education children freq_endurance ... ep_team_lr ep_backcountry_lr ep_water_lr ep_strength_lr ep_endurance_lr ep_message_lr ep_max p_random_lr ordersize_random_reg ep_random_lr
0 U1 no 0 team 30 to 44 no 55000 19 0.8 0 ... 0.170065 0.152448 0.165323 0.127039 0.216326 endurance 0.216326 0.012008 32.627819 0.156712
2 U13 no 0 endurance 45 to 59 yes 45000 33 0.7 0 ... 0.197624 0.190290 0.186217 0.204221 0.194732 strength 0.204221 0.013884 40.744326 0.226274
3 U20 no 0 water 45 to 59 yes 25000 24 0.2 0 ... 0.028568 0.022970 0.014892 0.024216 0.016298 team 0.028568 0.002389 15.768889 0.015070
5 U28 no 0 strength < 30 yes 25000 18 0.3 0 ... 0.006081 0.005424 0.007153 0.005244 0.008038 endurance 0.008038 0.000929 13.078011 0.004860
8 U59 no 0 strength >= 60 yes 65000 36 1.2 1 ... 0.244028 0.211679 0.291270 0.243492 0.301579 endurance 0.301579 0.011213 47.059571 0.211075
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
599989 U3462835 no 0 control 30 to 44 yes 45000 20 0.8 1 ... 0.081865 0.060368 0.063264 0.057766 0.070770 team 0.081865 0.004871 32.245840 0.062828
599992 U3462858 no 0 control 45 to 59 yes 40000 16 1.1 0 ... 0.056256 0.041872 0.041478 0.040307 0.037076 team 0.056256 0.003052 38.458959 0.046946
599993 U3462877 no 0 backcountry < 30 no 65000 30 1.0 2 ... 0.176618 0.232750 0.283269 0.223465 0.266163 water 0.283269 0.014259 45.079704 0.257119
599995 U3462888 no 0 water >= 60 yes 40000 26 0.6 0 ... 0.030821 0.023935 0.033779 0.028076 0.036240 endurance 0.036240 0.002042 38.698478 0.031609
599997 U3462902 no 0 team < 30 yes 55000 32 0.9 0 ... 0.119600 0.121016 0.170371 0.139849 0.178244 endurance 0.178244 0.008225 32.707711 0.107603

420000 rows × 55 columns

NN = rsm.model.mlp(
    data={"pentathlon_nptb": pentathlon_nptb[pentathlon_nptb["training"] == 1]},
    rvar='buyer_yes',
    evar= evar,
    hidden_layer_sizes=(1,),  # Simple NN with 1 hidden layer
    mod_type='classification'
)

NN.summary()
Multi-layer Perceptron (NN)
Data                 : pentathlon_nptb
Response variable    : buyer_yes
Level                : None
Explanatory variables: message, age, female, income, education, children, freq_endurance, freq_strength, freq_water, freq_team, freq_backcountry, freq_racquet
Model type           : classification
Nr. of features      : (12, 19)
Nr. of observations  : 420,000
Hidden_layer_sizes   : (1,)
Activation function  : tanh
Solver               : lbfgs
Alpha                : 0.0001
Batch size           : auto
Learning rate        : 0.001
Maximum iterations   : 10000
random_state         : 1234
AUC                  : 0.884

Raw data             :
  message      age female  income  education  children  freq_endurance  freq_strength  freq_water  freq_team  freq_backcountry  freq_racquet
     team 30 to 44     no   55000         19       0.8               0              4           0          4                 0             1
endurance 45 to 59    yes   45000         33       0.7               0              0           0          0                 2             2
    water 45 to 59    yes   25000         24       0.2               0              0           0          0                 0             0
 strength     < 30    yes   25000         18       0.3               0              0           0          0                 0             0
 strength    >= 60    yes   65000         36       1.2               1              1           0          2                 0             3

Estimation data      :
   income  education  children  freq_endurance  freq_strength  freq_water  freq_team  freq_backcountry  freq_racquet  message_control  message_endurance  message_racquet  message_strength  message_team  message_water  age_30 to 44  age_45 to 59  age_>= 60  female_no
 0.388655  -0.663713 -0.265321       -0.640336       1.100861   -0.261727   1.892088         -0.690593      0.058975            False              False            False             False          True          False          True         False      False       True
-0.184052   0.279378 -0.479844       -0.640336      -0.706563   -0.261727  -0.620020          1.958233      0.882489            False               True            False             False         False          False         False          True      False      False
-1.329467  -0.326895 -1.552460       -0.640336      -0.706563   -0.261727  -0.620020         -0.690593     -0.764538            False              False            False             False         False           True         False          True      False      False
-1.329467  -0.731077 -1.337937       -0.640336      -0.706563   -0.261727  -0.620020         -0.690593     -0.764538            False              False            False              True         False          False         False         False      False      False
 0.961362   0.481469  0.592772        0.060592      -0.254707   -0.261727   0.636034         -0.690593      1.706002            False              False            False              True         False          False         False         False       True      False
NN100 = rsm.model.mlp(
    data={"pentathlon_nptb": pentathlon_nptb[pentathlon_nptb["training"] == 1]},
    rvar='buyer_yes',
    evar= evar,
    hidden_layer_sizes=(100,),  # more complex NN with 100 hidden layers
    mod_type='classification'
)

NN100.summary()
Multi-layer Perceptron (NN)
Data                 : pentathlon_nptb
Response variable    : buyer_yes
Level                : None
Explanatory variables: message, age, female, income, education, children, freq_endurance, freq_strength, freq_water, freq_team, freq_backcountry, freq_racquet
Model type           : classification
Nr. of features      : (12, 19)
Nr. of observations  : 420,000
Hidden_layer_sizes   : (100,)
Activation function  : tanh
Solver               : lbfgs
Alpha                : 0.0001
Batch size           : auto
Learning rate        : 0.001
Maximum iterations   : 10000
random_state         : 1234
AUC                  : 0.892

Raw data             :
  message      age female  income  education  children  freq_endurance  freq_strength  freq_water  freq_team  freq_backcountry  freq_racquet
     team 30 to 44     no   55000         19       0.8               0              4           0          4                 0             1
endurance 45 to 59    yes   45000         33       0.7               0              0           0          0                 2             2
    water 45 to 59    yes   25000         24       0.2               0              0           0          0                 0             0
 strength     < 30    yes   25000         18       0.3               0              0           0          0                 0             0
 strength    >= 60    yes   65000         36       1.2               1              1           0          2                 0             3

Estimation data      :
   income  education  children  freq_endurance  freq_strength  freq_water  freq_team  freq_backcountry  freq_racquet  message_control  message_endurance  message_racquet  message_strength  message_team  message_water  age_30 to 44  age_45 to 59  age_>= 60  female_no
 0.388655  -0.663713 -0.265321       -0.640336       1.100861   -0.261727   1.892088         -0.690593      0.058975            False              False            False             False          True          False          True         False      False       True
-0.184052   0.279378 -0.479844       -0.640336      -0.706563   -0.261727  -0.620020          1.958233      0.882489            False               True            False             False         False          False         False          True      False      False
-1.329467  -0.326895 -1.552460       -0.640336      -0.706563   -0.261727  -0.620020         -0.690593     -0.764538            False              False            False             False         False           True         False          True      False      False
-1.329467  -0.731077 -1.337937       -0.640336      -0.706563   -0.261727  -0.620020         -0.690593     -0.764538            False              False            False              True         False          False         False         False      False      False
 0.961362   0.481469  0.592772        0.060592      -0.254707   -0.261727   0.636034         -0.690593      1.706002            False              False            False              True         False          False         False         False       True      False

Model Tuning

# NN_cv = rsm.load_state("./data/NN_cv.pkl")['NN_cv']

After fine-tuning the model in a separate notebook, we preserved the model’s state with save_state. To reintegrate it into the main notebook, load_state was utilized. However, due to GitHub’s file evaluation constraints, we are unable to successfully pass the automated tests using the aforementioned code. Should you wish to review our tuned model, please refer to the auxiliary notebooks. You can replicate our steps there by employing the provided code above

# NN_cv.best_params_
# NN_cv.best_score_.round(3)
NN_best = rsm.model.mlp(
    data={"pentathlon_nptb": pentathlon_nptb[pentathlon_nptb["training"] == 1]},
    rvar='buyer_yes',
    evar= evar,
    alpha = 0.01,
    hidden_layer_sizes = (3, 3),
    mod_type='classification'
)

NN_best.summary()
Multi-layer Perceptron (NN)
Data                 : pentathlon_nptb
Response variable    : buyer_yes
Level                : None
Explanatory variables: message, age, female, income, education, children, freq_endurance, freq_strength, freq_water, freq_team, freq_backcountry, freq_racquet
Model type           : classification
Nr. of features      : (12, 19)
Nr. of observations  : 420,000
Hidden_layer_sizes   : (3, 3)
Activation function  : tanh
Solver               : lbfgs
Alpha                : 0.01
Batch size           : auto
Learning rate        : 0.001
Maximum iterations   : 10000
random_state         : 1234
AUC                  : 0.89

Raw data             :
  message      age female  income  education  children  freq_endurance  freq_strength  freq_water  freq_team  freq_backcountry  freq_racquet
     team 30 to 44     no   55000         19       0.8               0              4           0          4                 0             1
endurance 45 to 59    yes   45000         33       0.7               0              0           0          0                 2             2
    water 45 to 59    yes   25000         24       0.2               0              0           0          0                 0             0
 strength     < 30    yes   25000         18       0.3               0              0           0          0                 0             0
 strength    >= 60    yes   65000         36       1.2               1              1           0          2                 0             3

Estimation data      :
   income  education  children  freq_endurance  freq_strength  freq_water  freq_team  freq_backcountry  freq_racquet  message_control  message_endurance  message_racquet  message_strength  message_team  message_water  age_30 to 44  age_45 to 59  age_>= 60  female_no
 0.388655  -0.663713 -0.265321       -0.640336       1.100861   -0.261727   1.892088         -0.690593      0.058975            False              False            False             False          True          False          True         False      False       True
-0.184052   0.279378 -0.479844       -0.640336      -0.706563   -0.261727  -0.620020          1.958233      0.882489            False               True            False             False         False          False         False          True      False      False
-1.329467  -0.326895 -1.552460       -0.640336      -0.706563   -0.261727  -0.620020         -0.690593     -0.764538            False              False            False             False         False           True         False          True      False      False
-1.329467  -0.731077 -1.337937       -0.640336      -0.706563   -0.261727  -0.620020         -0.690593     -0.764538            False              False            False              True         False          False         False         False      False      False
 0.961362   0.481469  0.592772        0.060592      -0.254707   -0.261727   0.636034         -0.690593      1.706002            False              False            False              True         False          False         False         False       True      False
pentathlon_nptb['pred_NN'] = NN_best.predict(pentathlon_nptb)['prediction']
dct = {"train" : pentathlon_nptb.query("training == 1"), "test" : pentathlon_nptb.query("training == 0")}
fig = rsm.gains_plot(dct, "buyer", "yes", "pred_NN")

from sklearn import metrics

# prediction on training set
pred = pentathlon_nptb.query("training == 1")['pred_NN']
actual = pentathlon_nptb.query("training == 1")['buyer_yes']
fpr, tpr, thresholds = metrics.roc_curve(actual, pred)
metrics.auc(fpr, tpr).round(5)
0.89011
# prediction on test set
pred = pentathlon_nptb.query("training == 0")['pred_NN']
actual = pentathlon_nptb.query("training == 0")['buyer_yes']
fpr, tpr, thresholds = metrics.roc_curve(actual, pred)
metrics.auc(fpr, tpr).round(5)
0.88819

We can conclude that:

The AUC for the training set is 0.890, which indicates that the model has a very good ability to distinguish between the two classes in the training data.

The AUC for the test set is 0.888, which is nearly the same as the training AUC. This means that the model is generalizing well to new data and not overfitting to the training data. The small decrease from training to test is nearly negligible.

1. Create predictions

# Check for all unique message values in the dataset and count how many
pentathlon_nptb["message"].value_counts()
message
control        87260
racquet        87088
team           86792
backcountry    86604
water          85120
strength       84280
endurance      82856
Name: count, dtype: int64
# Create predictions for different scenarios where the message variable is set to specific values
pentathlon_nptb["p_control_nn"] = NN_best.predict(pentathlon_nptb, data_cmd={"message": "control"})["prediction"]
pentathlon_nptb["p_racquet_nn"] = NN_best.predict(pentathlon_nptb, data_cmd={"message": "racquet"})["prediction"]
pentathlon_nptb["p_team_nn"] = NN_best.predict(pentathlon_nptb, data_cmd={"message": "team"})["prediction"]
pentathlon_nptb["p_backcountry_nn"] = NN_best.predict(pentathlon_nptb, data_cmd={"message": "backcountry"})["prediction"]
pentathlon_nptb["p_water_nn"] = NN_best.predict(pentathlon_nptb, data_cmd={"message": "water"})["prediction"]
pentathlon_nptb["p_strength_nn"] = NN_best.predict(pentathlon_nptb, data_cmd={"message": "strength"})["prediction"]
pentathlon_nptb["p_endurance_nn"] = NN_best.predict(pentathlon_nptb, data_cmd={"message": "endurance"})["prediction"]
pentathlon_nptb
custid buyer total_os message age female income education children freq_endurance ... ordersize_random_reg ep_random_lr pred_NN p_control_nn p_racquet_nn p_team_nn p_backcountry_nn p_water_nn p_strength_nn p_endurance_nn
0 U1 no 0 team 30 to 44 no 55000 19 0.8 0 ... 32.627819 0.156712 0.011532 0.009833 0.011503 0.011532 0.011059 0.010634 0.011800 0.011186
1 U3 no 0 backcountry 45 to 59 no 35000 22 1.0 0 ... 37.181377 0.082633 0.002658 0.002535 0.002404 0.003184 0.002658 0.003101 0.003468 0.002342
2 U13 no 0 endurance 45 to 59 yes 45000 33 0.7 0 ... 40.744326 0.226274 0.008840 0.006773 0.007218 0.008068 0.007635 0.008054 0.009127 0.008840
3 U20 no 0 water 45 to 59 yes 25000 24 0.2 0 ... 15.768889 0.015070 0.001161 0.001065 0.001360 0.000875 0.001357 0.001161 0.001291 0.003342
4 U25 no 0 racquet >= 60 yes 65000 32 1.1 1 ... 48.946607 0.235699 0.009574 0.007788 0.009574 0.008922 0.008792 0.008076 0.008884 0.008817
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
599995 U3462888 no 0 water >= 60 yes 40000 26 0.6 0 ... 38.698478 0.031609 0.001644 0.001504 0.001932 0.001207 0.001928 0.001644 0.001835 0.004317
599996 U3462900 no 0 team < 30 no 55000 32 0.9 3 ... 29.767369 0.091593 0.004586 0.003494 0.003388 0.004586 0.003605 0.004077 0.004503 0.002707
599997 U3462902 no 0 team < 30 yes 55000 32 0.9 0 ... 32.707711 0.107603 0.002481 0.001935 0.002163 0.002481 0.001966 0.001951 0.002062 0.001856
599998 U3462916 no 0 team < 30 no 50000 35 0.6 2 ... 25.117259 0.068280 0.003051 0.002224 0.002025 0.003051 0.002200 0.002624 0.002867 0.001594
599999 U3462922 no 0 endurance 30 to 44 yes 50000 25 0.7 1 ... 29.102240 0.138029 0.002656 0.003538 0.003329 0.004762 0.003617 0.004233 0.004701 0.002656

600000 rows × 63 columns

repl = {
    "p_control_nn": "control", 
    "p_endurance_nn": "endurance", 
    "p_backcountry_nn": "backcountry", 
    "p_racquet_nn": "racquet", 
    "p_strength_nn": "strength", 
    "p_team_nn": "team", 
    "p_water_nn": "water"}
# extending the prediction to the full database to see distribution
predictions_nn = [
    "p_control_nn", 
    "p_endurance_nn", 
    "p_backcountry_nn", 
    "p_racquet_nn", 
    "p_strength_nn", 
    "p_team_nn", 
    "p_water_nn"]
# extending the prediction to the full database to see distribution
pentathlon_nptb["message_nn"] = pentathlon_nptb[predictions_nn].idxmax(axis=1)
pentathlon_nptb["message_nn"].value_counts()
message_nn
p_endurance_nn    321011
p_strength_nn     161291
p_team_nn         106081
p_racquet_nn       11617
Name: count, dtype: int64
# Find the maximum probability of purchase
pentathlon_nptb["p_max_nn"] = pentathlon_nptb[predictions_nn].max(axis=1)

Approach: First of all, a neural network model was built to predict the next product to buy using buyer_yes as the response variable and the other variables as predictors. The model was trained using the training data and then used to predict the probability of purchasing for the test data. The model was then used to predict the probability of purchasing for the full database. To predict the probability of purchasing for different messages, we use the data_cmd to predict the probability of purchasing for different messages.

Finally, we choose the product with the highest probability using idxmax to automatically find the best message for each customer. This command also provides a label for the category with the maximum predicted probability of buying across all the products. Then, p_max was created to store the maximum predicted probability of purchasing for the best message selected for each customer.

2. Percentage of customersfor whom that message or not message

# create a crosstab to see which products should we send to each customer.
pd.crosstab(index=pentathlon_nptb[pentathlon_nptb.training == 0].message_nn, columns="count").apply(rsm.format_nr)
col_0 count
message_nn
p_endurance_nn 96,065
p_racquet_nn 3,473
p_strength_nn 48,593
p_team_nn 31,869
# percentage of customers for whom that message is the best message
pentathlon_nptb["message_nn"].value_counts(normalize=True)
message_nn
p_endurance_nn    0.535018
p_strength_nn     0.268818
p_team_nn         0.176802
p_racquet_nn      0.019362
Name: proportion, dtype: float64
  • edurance: The distribution suggests that endurance, related messages resonate significantly more with the customers than the other messages.

  • team: While not as dominant as endurance, related messages also play an important role. There might be opportunities to further optimize or tailor these messages to increase their effectiveness.

  • Other categories such as strength, racquet, and water, have much smaller proportions, suggesting that they are less often the best choice for maximizing the probability of purchase.

  • backcountry was not included in the original dataset, so it is not surprising that it is not the best message for any customer.`

# average purchase probability if we send the message for each product to everyone
pentathlon_nptb.loc[pentathlon_nptb.training == 0, predictions_nn].agg("mean").sort_values(
    ascending=False).apply(rsm.format_nr, perc=True)
p_endurance_nn      2.75%
p_strength_nn       2.64%
p_water_nn          2.42%
p_team_nn           2.37%
p_backcountry_nn    2.33%
p_racquet_nn        2.29%
p_control_nn        2.14%
dtype: object

3. Expected profit

# Calculate the avg order size/message
ordersize = pentathlon_nptb[(pentathlon_nptb.training == 1) & (pentathlon_nptb.buyer_yes == 1)].groupby("message")["total_os"].mean()
ordersize
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/3828228129.py:2: FutureWarning:

The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
message
backcountry    64.034091
control        49.900598
endurance      55.584893
racquet        56.405620
strength       56.708751
team           56.522449
water          61.957343
Name: total_os, dtype: float64
NN_os = rsm.model.mlp(
    data={"pentathlon_nptb": pentathlon_nptb[pentathlon_nptb["training"] == 1]},
    rvar = "total_os",
    evar = evar,
    hidden_layer_sizes = (1,),  # Simple NN with 1 hidden layer
    mod_type = 'regression'
)
NN_os.summary()
Multi-layer Perceptron (NN)
Data                 : pentathlon_nptb
Response variable    : total_os
Explanatory variables: message, age, female, income, education, children, freq_endurance, freq_strength, freq_water, freq_team, freq_backcountry, freq_racquet
Model type           : regression
Nr. of features      : (12, 19)
Nr. of observations  : 420,000
Hidden_layer_sizes   : (1,)
Activation function  : tanh
Solver               : lbfgs
Alpha                : 0.0001
Batch size           : auto
Learning rate        : 0.001
Maximum iterations   : 10000
random_state         : 1234
Model fit            :
       n     r2    mse    mae
  420000  0.109  0.891  0.179

Raw data             :
  message      age female  income  education  children  freq_endurance  freq_strength  freq_water  freq_team  freq_backcountry  freq_racquet
     team 30 to 44     no   55000         19       0.8               0              4           0          4                 0             1
endurance 45 to 59    yes   45000         33       0.7               0              0           0          0                 2             2
    water 45 to 59    yes   25000         24       0.2               0              0           0          0                 0             0
 strength     < 30    yes   25000         18       0.3               0              0           0          0                 0             0
 strength    >= 60    yes   65000         36       1.2               1              1           0          2                 0             3

Estimation data      :
   income  education  children  freq_endurance  freq_strength  freq_water  freq_team  freq_backcountry  freq_racquet  message_control  message_endurance  message_racquet  message_strength  message_team  message_water  age_30 to 44  age_45 to 59  age_>= 60  female_no
 0.388655  -0.663713 -0.265321       -0.640336       1.100861   -0.261727   1.892088         -0.690593      0.058975            False              False            False             False          True          False          True         False      False       True
-0.184052   0.279378 -0.479844       -0.640336      -0.706563   -0.261727  -0.620020          1.958233      0.882489            False               True            False             False         False          False         False          True      False      False
-1.329467  -0.326895 -1.552460       -0.640336      -0.706563   -0.261727  -0.620020         -0.690593     -0.764538            False              False            False             False         False           True         False          True      False      False
-1.329467  -0.731077 -1.337937       -0.640336      -0.706563   -0.261727  -0.620020         -0.690593     -0.764538            False              False            False              True         False          False         False         False      False      False
 0.961362   0.481469  0.592772        0.060592      -0.254707   -0.261727   0.636034         -0.690593      1.706002            False              False            False              True         False          False         False         False       True      False

Model Tuning

# NN_os = rsm.load_state("./data/NN_os.pkl")['NN_os']

Again, after fine-tuning the model in a separate notebook, we preserved the model’s state with save_state. To reintegrate it into the main notebook, load_state was utilized. However, due to GitHub’s file evaluation constraints, we are unable to successfully pass the automated tests using the aforementioned code. Should you wish to review our tuned model, please refer to the auxiliary notebooks. You can replicate our steps there by employing the provided code above

# NN_os.best_params_
# NN_cv.best_score_.round(3)
NN_os_best = rsm.model.mlp(
    data={"pentathlon_nptb": pentathlon_nptb[pentathlon_nptb["training"] == 1]},
    rvar="total_os",
    evar= evar,
    alpha = 0.0001,
    hidden_layer_sizes = (3, 3),
    mod_type='regression'
)

NN_os_best.summary()
Multi-layer Perceptron (NN)
Data                 : pentathlon_nptb
Response variable    : total_os
Explanatory variables: message, age, female, income, education, children, freq_endurance, freq_strength, freq_water, freq_team, freq_backcountry, freq_racquet
Model type           : regression
Nr. of features      : (12, 19)
Nr. of observations  : 420,000
Hidden_layer_sizes   : (3, 3)
Activation function  : tanh
Solver               : lbfgs
Alpha                : 0.0001
Batch size           : auto
Learning rate        : 0.001
Maximum iterations   : 10000
random_state         : 1234
Model fit            :
       n     r2    mse    mae
  420000  0.114  0.886  0.175

Raw data             :
  message      age female  income  education  children  freq_endurance  freq_strength  freq_water  freq_team  freq_backcountry  freq_racquet
     team 30 to 44     no   55000         19       0.8               0              4           0          4                 0             1
endurance 45 to 59    yes   45000         33       0.7               0              0           0          0                 2             2
    water 45 to 59    yes   25000         24       0.2               0              0           0          0                 0             0
 strength     < 30    yes   25000         18       0.3               0              0           0          0                 0             0
 strength    >= 60    yes   65000         36       1.2               1              1           0          2                 0             3

Estimation data      :
   income  education  children  freq_endurance  freq_strength  freq_water  freq_team  freq_backcountry  freq_racquet  message_control  message_endurance  message_racquet  message_strength  message_team  message_water  age_30 to 44  age_45 to 59  age_>= 60  female_no
 0.388655  -0.663713 -0.265321       -0.640336       1.100861   -0.261727   1.892088         -0.690593      0.058975            False              False            False             False          True          False          True         False      False       True
-0.184052   0.279378 -0.479844       -0.640336      -0.706563   -0.261727  -0.620020          1.958233      0.882489            False               True            False             False         False          False         False          True      False      False
-1.329467  -0.326895 -1.552460       -0.640336      -0.706563   -0.261727  -0.620020         -0.690593     -0.764538            False              False            False             False         False           True         False          True      False      False
-1.329467  -0.731077 -1.337937       -0.640336      -0.706563   -0.261727  -0.620020         -0.690593     -0.764538            False              False            False              True         False          False         False         False      False      False
 0.961362   0.481469  0.592772        0.060592      -0.254707   -0.261727   0.636034         -0.690593      1.706002            False              False            False              True         False          False         False         False       True      False
pentathlon_nptb["pos_control_nn"] = NN_os_best.predict(pentathlon_nptb, data_cmd={"message": "control"})["prediction"]
pentathlon_nptb["pos_racquet_nn"] = NN_os_best.predict(pentathlon_nptb, data_cmd={"message": "racquet"})["prediction"]
pentathlon_nptb["pos_team_nn"] = NN_os_best.predict(pentathlon_nptb, data_cmd={"message": "team"})["prediction"]
pentathlon_nptb["pos_backcountry_nn"] = NN_os_best.predict(pentathlon_nptb, data_cmd={"message": "backcountry"})["prediction"]
pentathlon_nptb["pos_water_nn"] = NN_os_best.predict(pentathlon_nptb, data_cmd={"message": "water"})["prediction"]
pentathlon_nptb["pos_strength_nn"] = NN_os_best.predict(pentathlon_nptb, data_cmd={"message": "strength"})["prediction"]
pentathlon_nptb["pos_endurance_nn"] = NN_os_best.predict(pentathlon_nptb, data_cmd={"message": "endurance"})["prediction"]
pentathlon_nptb["ep_control_nn"] = pentathlon_nptb["p_control_nn"] * pentathlon_nptb["pos_control_reg"] * 0.4
pentathlon_nptb["ep_racquet_nn"] = pentathlon_nptb["p_racquet_nn"] * pentathlon_nptb["pos_racquet_reg"] * 0.4
pentathlon_nptb["ep_team_nn"] = pentathlon_nptb["p_team_nn"] * pentathlon_nptb["pos_team_reg"] * 0.4
pentathlon_nptb["ep_backcountry_nn"] = pentathlon_nptb["p_backcountry_nn"] * pentathlon_nptb["pos_backcountry_reg"] * 0.4
pentathlon_nptb["ep_water_nn"] = pentathlon_nptb["p_water_nn"] * pentathlon_nptb["pos_water_reg"] * 0.4
pentathlon_nptb["ep_strength_nn"] = pentathlon_nptb["p_strength_nn"] * pentathlon_nptb["pos_strength_reg"] * 0.4
pentathlon_nptb["ep_endurance_nn"] = pentathlon_nptb["p_endurance_nn"] * pentathlon_nptb["pos_endurance_reg"] * 0.4
expected_profit_nn = [
    "ep_control_nn", 
    "ep_endurance_nn", 
    "ep_backcountry_nn", 
    "ep_racquet_nn", 
    "ep_strength_nn", 
    "ep_team_nn", 
    "ep_water_nn"]
repl = {"ep_control_nn": "control", "ep_endurance_nn": "endurance", "ep_backcountry_nn": "backcountry", "ep_racquet_nn": "racquet", "ep_strength_nn": "strength", "ep_team_nn": "team", "ep_water_nn": "water"}
pentathlon_nptb["ep_message_nn"] = (
    pentathlon_nptb[expected_profit_nn]
    .idxmax(axis=1)
    .map(repl)
)

4. Respective message maximizes expected profit

pentathlon_nptb.ep_message_nn.value_counts(normalize=True)
ep_message_nn
endurance      0.369478
team           0.245130
water          0.221107
strength       0.110613
backcountry    0.024117
racquet        0.019118
control        0.010437
Name: proportion, dtype: float64

The message that most customers respond the best to is endurance holding close to almost half of the customer. Messages such as backcountry, racquet, and control are not the best choice for nearly any customer. The remaining three messages, team, strength, and water, are the best choice for a small proportion of customers.

5. Expected profit/customer

pentathlon_nptb["ep_max_nn"] = pentathlon_nptb[expected_profit_nn].max(axis=1)
pentathlon_nptb["ep_max_nn"].mean()
0.6816084703184525
pd.crosstab(index=pentathlon_nptb[pentathlon_nptb.training == 0].ep_message_nn, columns="count").apply(rsm.format_nr)
col_0 count
ep_message_nn
backcountry 4,371
control 1,914
endurance 66,071
racquet 3,429
strength 19,986
team 44,334
water 39,895

6. Expected profit per e-mailed customer if every customer receives the same message

(
    pentathlon_nptb
    .loc[pentathlon_nptb.training == 0, expected_profit_nn]
    .agg("mean")
    .sort_values(ascending=False)
    .apply(rsm.format_nr, sym = "$", dec = 2)
)
ep_endurance_nn      $0.61
ep_strength_nn        $0.6
ep_water_nn           $0.6
ep_backcountry_nn    $0.59
ep_team_nn           $0.53
ep_racquet_nn        $0.52
ep_control_nn        $0.43
dtype: object

To no surprise, sending no message yields the lowest expected profit per customer. The message that maximizes the expected profit per customer are endurance and strength tied. The remaining messages of team, water, backcountry yield slightly less than strength and endurance. Racquet yields slightly more than sending no message but still less than the other messages.

7. Expected profit per e-mailed customer if every customer is assigned randomly to one of the messages options

# probabilty of purchase where customer is assigned to a random message
pentathlon_nptb["p_random_nn"] = NN_best.predict(pentathlon_nptb)["prediction"]

# expected avg order size where customer is assigned to a random message
pentathlon_nptb["ordersize_random_nn"] = NN_os_best.predict(pentathlon_nptb)["prediction"]

# expected profit where customer is assigned to a random message
pentathlon_nptb["ep_random_nn"] = pentathlon_nptb["p_random_nn"] * pentathlon_nptb["ordersize_random_nn"] * 0.4

# expected profit per customer where customer is assigned to a random message
random_profit_per_customer = pentathlon_nptb.loc[pentathlon_nptb.training == 0, "ep_random_nn"].mean()
random_profit_per_customer
0.1078448520217575
# expected profit where no-message is sent (control)
pentathlon_nptb["ep_control_nn"]

# expected profit per customer where no-message is sent (control)
control_profit_per_customer = pentathlon_nptb.loc[pentathlon_nptb.training == 0, "ep_control_nn"].mean()
control_profit_per_customer
0.4335109104691227

8. Expected profit for 5,000,000 customers

# Profit where each customer is assigned to the message with the highest expected profit
profit_nn = pentathlon_nptb.loc[pentathlon_nptb.training == 0, "ep_max_nn"].agg("mean") * 5000000
profit_nn
3430076.6444671643
# Profit where each customer is sent to a random message
random_profit_nn = random_profit_per_customer * 5000000
random_profit_nn
539224.2601087876
# Profit where no message is sent
control_profit_nn = control_profit_per_customer * 5000000
control_profit_nn
2167554.5523456135
profit_improvement_nn = profit_nn - control_profit_nn
profit_improvement_nn
1262522.0921215508
profits_dct = {
    "Customize Message": profit_nn,
    "Randomly Assign": random_profit_nn,
    "No Message Sent": control_profit_nn,
}


import seaborn as sns
import matplotlib.pyplot as plt

# Convert dictionary to DataFrame
df = pd.DataFrame(list(profits_dct.items()), columns=['Model', 'Profit'])
plt.figure(figsize=(10, 5))  
# Plot
sns.set(style="whitegrid")
ax = sns.barplot(x="Model", y="Profit", data=df, palette="viridis")

# Annotations
for index, row in df.iterrows():
    ax.text(index, row.Profit, f'${row.Profit:.2f}', ha='center')

# Set labels and title
ax.set_xlabel("Model Type", fontdict={'family': 'serif', 'color': 'black', 'size': 12})
ax.set_ylabel("Profit", fontdict={'family': 'serif', 'color': 'black', 'size': 12})
ax.set_title("Profit by Model", fontdict={'family': 'serif', 'color': 'black', 'size': 15})

plt.xticks(rotation=45) 
plt.show()
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/3855792413.py:16: FutureWarning:



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

Random Forest Model

Train a Random Forest Model to predict the next product to buy

rf = rsm.model.rforest(
    data = {'NPTB': pentathlon_nptb.query("training == 1")},
    rvar = 'buyer',
    lev = 'yes',
    evar = evar,
)
rf.summary()
Random Forest
Data                 : NPTB
Response variable    : buyer
Level                : yes
Explanatory variables: message, age, female, income, education, children, freq_endurance, freq_strength, freq_water, freq_team, freq_backcountry, freq_racquet
OOB                  : True
Model type           : classification
Nr. of features      : (12, 21)
Nr. of observations  : 420,000
max_features         : sqrt (4)
n_estimators         : 100
min_samples_leaf     : 1
random_state         : 1234
AUC                  : 0.825

Estimation data      :
 income  education  children  freq_endurance  freq_strength  freq_water  freq_team  freq_backcountry  freq_racquet  message_backcountry  message_control  message_endurance  message_racquet  message_strength  message_team  message_water  age_< 30  age_30 to 44  age_45 to 59  age_>= 60  female_no
  55000         19       0.8               0              4           0          4                 0             1                False            False              False            False             False          True          False     False          True         False      False       True
  45000         33       0.7               0              0           0          0                 2             2                False            False               True            False             False         False          False     False         False          True      False      False
  25000         24       0.2               0              0           0          0                 0             0                False            False              False            False             False         False           True     False         False          True      False      False
  25000         18       0.3               0              0           0          0                 0             0                False            False              False            False              True         False          False      True         False         False      False      False
  65000         36       1.2               1              1           0          2                 0             3                False            False              False            False              True         False          False     False         False         False       True      False

We want to also use a random forest model to predict the next product to buy and see if we can create a robust NPTB model that accurately can predict which product a customer would purchase based on the messaging we send them. Above, we are training our base random forest model, and we will use this base random forest model to tune our hyperparameters, as seen below.

Some interesting key features of the output of this random forest model is the AUC value. In our base model, we see that the AUC is a 0.818. This is a decent value for AUC, with anything over 0.5 typically being considered a good model.

Random Forest Model Tuning

#max_features = [None, 'auto', 'sqrt', 'log2', 0.25, 0.5, 0.75, 1.0]
#n_estimators = [10, 50, 100, 200, 500, 1000]


#param_grid = {"max_features": max_features, "n_estimators": n_estimators}
#scoring = {"AUC": "roc_auc"}

#rf_cv = GridSearchCV(rf.fitted, param_grid, scoring=scoring, cv=5, n_jobs=2, refit="AUC", verbose=5)

Below is the code for the grid search of a very large matrix. I would not recommend running this, as it takes a very long time to run. However, I have included the code for the grid search below.

rf_cv.fit(rf.data_onehot, rf.data.buyer_yes)

::: {#bdc7b1cb .cell execution_count=107}
``` {.python .cell-code}
#rf_cv.fit(rf_treatment.data_onehot, rf_treatment.data.buyer_yes)

:::

Above, we conduct a grid search to identify which hyperparameters are best for our random forest model. However, when running a gridsearch this large, we ran into a computational cost issue that caused either our hyperparameter grid search to run for 4 hours, or caused the kernel to crash. Therefore, we conducted multiple step-wise grid searches where we limited the gridsearch matrix size, and went through the various iterations of combinations to see which parameters created the most robut model. Below, you can see the various rounds of model tuning.

1. Create prediction for each product

pentathlon_nptb['p_control_rf'] = rf_tuned.predict(pentathlon_nptb, data_cmd={"message": "control"})["prediction"]
pentathlon_nptb['p_racquet_rf'] = rf_tuned.predict(pentathlon_nptb, data_cmd={"message": "racquet"})["prediction"]
pentathlon_nptb['p_team_rf'] = rf_tuned.predict(pentathlon_nptb, data_cmd={"message": "team"})["prediction"]
pentathlon_nptb['p_backcountry_rf'] = rf_tuned.predict(pentathlon_nptb, data_cmd={"message": "backcountry"})["prediction"]
pentathlon_nptb['p_water_rf'] = rf_tuned.predict(pentathlon_nptb, data_cmd={"message": "water"})["prediction"]
pentathlon_nptb['p_strength_rf'] = rf_tuned.predict(pentathlon_nptb, data_cmd={"message": "strength"})["prediction"]
pentathlon_nptb['p_endurance_rf'] = rf_tuned.predict(pentathlon_nptb, data_cmd={"message": "endurance"})["prediction"]
pentathlon_nptb
custid buyer total_os message age female income education children freq_endurance ... ordersize_random_nn ep_random_nn pred_rf p_control_rf p_racquet_rf p_team_rf p_backcountry_rf p_water_rf p_strength_rf p_endurance_rf
0 U1 no 0 team 30 to 44 no 55000 19 0.8 0 ... 0.404840 0.001868 0.001000 0.004 0.001 0.001 0.008000 0.012 0.028 0.070000
1 U3 no 0 backcountry 45 to 59 no 35000 22 1.0 0 ... 0.162739 0.000173 0.000000 0.000 0.000 0.000 0.000000 0.000 0.000 0.000000
2 U13 no 0 endurance 45 to 59 yes 45000 33 0.7 0 ... 0.604567 0.002138 0.001000 0.002 0.000 0.002 0.001333 0.000 0.000 0.001000
3 U20 no 0 water 45 to 59 yes 25000 24 0.2 0 ... -0.002303 -0.000001 0.000000 0.000 0.000 0.000 0.000000 0.000 0.000 0.000000
4 U25 no 0 racquet >= 60 yes 65000 32 1.1 1 ... 0.780294 0.002988 0.019000 0.009 0.019 0.007 0.023000 0.009 0.010 0.005000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
599995 U3462888 no 0 water >= 60 yes 40000 26 0.6 0 ... -0.482575 -0.000317 0.000000 0.000 0.000 0.000 0.000000 0.000 0.000 0.000000
599996 U3462900 no 0 team < 30 no 55000 32 0.9 3 ... 0.155320 0.000285 0.000000 0.001 0.001 0.000 0.000000 0.000 0.001 0.000000
599997 U3462902 no 0 team < 30 yes 55000 32 0.9 0 ... 0.153595 0.000152 0.000000 0.006 0.003 0.000 0.029000 0.005 0.010 0.003000
599998 U3462916 no 0 team < 30 no 50000 35 0.6 2 ... -0.042065 -0.000051 0.011000 0.002 0.005 0.011 0.001000 0.000 0.000 0.000000
599999 U3462922 no 0 endurance 30 to 44 yes 50000 25 0.7 1 ... 0.483980 0.000514 0.021667 0.002 0.000 0.005 0.002000 0.001 0.003 0.021667

600000 rows × 92 columns

To identify which message will lead to the highest probability of purchase, we will use our random forest model to extrapolate predictions to the full database.

repl = {
    "p_control_rf": "control", 
    "p_endurance_rf": "endurance", 
    "p_backcountry_rf": "backcountry", 
    "p_racquet_rf": "racquet", 
    "p_strength_rf": "strength", 
    "p_team_rf": "team", 
    "p_water_rf": "water"}
predictions_rf = [
    "p_control_rf", 
    "p_endurance_rf", 
    "p_backcountry_rf", 
    "p_racquet_rf", 
    "p_strength_rf", 
    "p_team_rf", 
    "p_water_rf"]
pentathlon_nptb['message_rf'] = (
    pentathlon_nptb[predictions_rf]
    .idxmax(axis=1)
    .map(repl)
)

The code above is creating a new column in our Pentathlon_NPTB database, called message_rf which identifies which of the predicted messages has the highest probability, and then filling in the message_rf column with the name of the message that has the highest probability.

#plot the distribution of message_rf column
pentathlon_nptb['message_rf'].value_counts().plot(kind='bar', title = 'Distribution of message_rf')

pentathlon_nptb['p_max_rf'] = pentathlon_nptb[predictions_rf].max(axis=1)

Approach
Our approach to creating the targeting messaging is building a random forest model to predict the next product the customer will buy based on a variety of customer features. We will use buyer_yes as the response variable and the other variables as predictors.

We use our random forest model to extend the prediction to the full database. To predict the probability of purchasing for different messages, we use the data_cmd to predict the probability of purchasing for different messages.

Finally, we select the product with the highest probability using idxmax. This command also provides a label for the category with the maximum predicted probability of buying across all the products. Then, create p_max to store the maximum predicted probability of purchasing an item based on the messaging sent to said customer.

2. Percentage of customers to message and not to message

pd.crosstab(index=pentathlon_nptb[pentathlon_nptb.training == 0].message_rf, columns="count").apply(rsm.format_nr)
col_0 count
message_rf
backcountry 18,952
control 53,019
endurance 29,327
racquet 18,859
strength 23,927
team 18,809
water 17,107
pentathlon_nptb['message_rf'].value_counts(normalize=True)
message_rf
control        0.299772
endurance      0.162343
strength       0.132712
racquet        0.104497
backcountry    0.103977
team           0.102348
water          0.094352
Name: proportion, dtype: float64
  • control: The distribution suggests that the control_rf control messaging seems to resonate more with the customers than the other messages based on this model.

  • endurance, strength: the endurance and strength messages are approximately similar in their values. There might be opportunities to further optimize or tailor these messages to increase their effectiveness.

  • backcountry, racquet, water, team, all have roughly the same proportion of customers to message at around 10%. This suggests that these messages are not as effective as the control and endurance messages.

pentathlon_nptb['message_rf'].value_counts(normalize=True).plot(kind='bar', title='Message Distribution')

pentathlon_nptb.loc[pentathlon_nptb.training == 0, predictions_rf].agg("mean").sort_values(
    ascending=False).apply(rsm.format_nr, perc=True)
p_endurance_rf      2.91%
p_strength_rf       2.79%
p_water_rf          2.63%
p_team_rf           2.61%
p_backcountry_rf    2.55%
p_racquet_rf        2.55%
p_control_rf        2.43%
dtype: object

Above is a table with the average purchase probability if we send the message for each product to everyone. Even though we are sending much more control messages compared to the other categories, we see that this does not lead to an increase in the probability of purchase. This is an interesting insight, and we should consider other ways to optimize our messaging strategy to increase the number of messages sent to those in the endurance and strength categories.

3. Expected profit

#Average Ordersize per message
ordersize = pentathlon_nptb[(pentathlon_nptb.training == 1) & (pentathlon_nptb.buyer_yes == 1)].groupby("message_rf")["total_os"].mean()
ordersize
message_rf
backcountry    64.065387
control        49.908751
endurance      55.584893
racquet        56.416727
strength       56.684246
team           56.522449
water          61.926676
Name: total_os, dtype: float64

Create another random forest model to Predict the Ordersize

rf_os = rsm.model.rforest(
    data = {"pentathlon_nptb": pentathlon_nptb[(pentathlon_nptb.training == 1) & (pentathlon_nptb.buyer == "yes")]},
    rvar ="total_os",
    evar = evar,
    n_estimators = 200,
    max_features = 'sqrt'
)
rf_os.summary()
Random Forest
Data                 : pentathlon_nptb
Response variable    : total_os
Level                : None
Explanatory variables: message, age, female, income, education, children, freq_endurance, freq_strength, freq_water, freq_team, freq_backcountry, freq_racquet
OOB                  : True
Model type           : classification
Nr. of features      : (12, 21)
Nr. of observations  : 10,080
max_features         : sqrt (4)
n_estimators         : 200
min_samples_leaf     : 1
random_state         : 1234
AUC                  : nan

Estimation data      :
 income  education  children  freq_endurance  freq_strength  freq_water  freq_team  freq_backcountry  freq_racquet  message_backcountry  message_control  message_endurance  message_racquet  message_strength  message_team  message_water  age_< 30  age_30 to 44  age_45 to 59  age_>= 60  female_no
  65000         52       0.5               3              0           0          2                 1             3                False            False               True            False             False         False          False     False         False          True      False       True
  55000         43       0.2               1              2           0          0                 2             0                False             True              False            False             False         False          False     False         False          True      False       True
 160000         82       1.3              13              5           0         15                 5             2                 True            False              False            False             False         False          False     False          True         False      False       True
  90000         60       1.2               7              2           2          5                 0             1                 True            False              False            False             False         False          False     False         False          True      False       True
  55000         40       0.7               0              1           0          1                 1             0                False            False              False            False              True         False          False     False          True         False      False       True
/Users/duyentran/Library/Python/3.11/lib/python/site-packages/pyrsm/model/perf.py:1240: RuntimeWarning:

invalid value encountered in scalar divide
pentathlon_nptb['pos_control_rf'] = rf_os.predict(pentathlon_nptb, data_cmd={"message": "control"})["prediction"]
pentathlon_nptb['pos_racquet_rf'] = rf_os.predict(pentathlon_nptb, data_cmd={"message": "racquet"})["prediction"]
pentathlon_nptb['pos_team_rf'] = rf_os.predict(pentathlon_nptb, data_cmd={"message": "team"})["prediction"]
pentathlon_nptb['pos_backcountry_rf'] = rf_os.predict(pentathlon_nptb, data_cmd={"message": "backcountry"})["prediction"]
pentathlon_nptb['pos_water_rf'] = rf_os.predict(pentathlon_nptb, data_cmd={"message": "water"})["prediction"]
pentathlon_nptb['pos_strength_rf'] = rf_os.predict(pentathlon_nptb, data_cmd={"message": "strength"})["prediction"]
pentathlon_nptb['pos_endurance_rf'] = rf_os.predict(pentathlon_nptb, data_cmd={"message": "endurance"})["prediction"]

Calculating the expected profit for each product

pentathlon_nptb['ep_control_rf'] = pentathlon_nptb['p_control_rf'] * pentathlon_nptb['pos_control_reg'] * 0.4
pentathlon_nptb['ep_racquet_rf'] = pentathlon_nptb['p_racquet_rf'] * pentathlon_nptb['pos_racquet_reg'] * 0.4
pentathlon_nptb['ep_team_rf'] = pentathlon_nptb['p_team_rf'] * pentathlon_nptb['pos_team_reg'] * 0.4
pentathlon_nptb['ep_backcountry_rf'] = pentathlon_nptb['p_backcountry_rf'] * pentathlon_nptb['pos_backcountry_reg'] * 0.4
pentathlon_nptb['ep_water_rf'] = pentathlon_nptb['p_water_rf'] * pentathlon_nptb['pos_water_reg'] * 0.4
pentathlon_nptb['ep_strength_rf'] = pentathlon_nptb['p_strength_rf'] * pentathlon_nptb['pos_strength_reg'] * 0.4
pentathlon_nptb['ep_endurance_rf'] = pentathlon_nptb['p_endurance_rf'] * pentathlon_nptb['pos_endurance_reg'] * 0.4
expected_profit_rf = [
    "ep_control_rf", 
    "ep_endurance_rf", 
    "ep_backcountry_rf", 
    "ep_racquet_rf", 
    "ep_strength_rf", 
    "ep_team_rf", 
    "ep_water_rf"]
repl = {"ep_control_rf": "control", "ep_endurance_rf": "endurance", "ep_backcountry_rf": "backcountry", "ep_racquet_rf": "racquet", "ep_strength_rf": "strength", "ep_team_rf": "team", "ep_water_rf": "water"}
pentathlon_nptb["ep_message_rf"] = (
    pentathlon_nptb[expected_profit_rf]
    .idxmax(axis=1)
    .map(repl)
)
pentathlon_nptb['ep_message_rf'].value_counts()
ep_message_rf
control        171206
endurance       88843
water           72279
strength        71154
backcountry     68671
team            65027
racquet         62820
Name: count, dtype: int64
4. Percentage of customer for whom no message maximizes their expected profits
pentathlon_nptb.ep_message_rf.value_counts(normalize=True)
ep_message_rf
control        0.285343
endurance      0.148072
water          0.120465
strength       0.118590
backcountry    0.114452
team           0.108378
racquet        0.104700
Name: proportion, dtype: float64

5. Expected profit per email if we customize the message to each customers

pentathlon_nptb["ep_max_rf"] = pentathlon_nptb[expected_profit_rf].max(axis=1)
pentathlon_nptb["ep_max_rf"].mean()
1.0516237742883443
pd.crosstab(index=pentathlon_nptb[pentathlon_nptb.training == 0].ep_message_rf, columns="count").apply(rsm.format_nr)
col_0 count
ep_message_rf
backcountry 21,405
control 50,280
endurance 26,442
racquet 18,786
strength 21,082
team 19,690
water 22,315

6. Expected profit per e-mailed customer if every customer receives the same message

(
    pentathlon_nptb
    .loc[pentathlon_nptb.training == 0, expected_profit_rf]
    .agg("mean")
    .sort_values(ascending=False)
    .apply(rsm.format_nr, sym = "$", dec = 2)
)
ep_water_rf          $0.65
ep_backcountry_rf    $0.65
ep_endurance_rf      $0.64
ep_strength_rf       $0.63
ep_team_rf           $0.58
ep_racquet_rf        $0.58
ep_control_rf        $0.49
dtype: object

7. Expected profit per e-mailed customer if every customer is assigned randomly to one of the messages or the no-message condition

# probabilty of purchase where customer is assigned to a random message
pentathlon_nptb["p_random_rf"] = rf.predict(pentathlon_nptb)["prediction"]

# expected avg order size where customer is assigned to a random message
pentathlon_nptb["ordersize_random_reg"] = reg.predict(pentathlon_nptb)["prediction"]

# expected profit when customer is assigned to a random message
pentathlon_nptb["ep_random_rf"] = pentathlon_nptb["p_random_rf"] * pentathlon_nptb["ordersize_random_reg"] * 0.4

# expected profit/customer when a customer is assigned to a random message
random_profit_per_customer_rf = pentathlon_nptb.loc[pentathlon_nptb.training == 0, "ep_random_rf"].mean()
random_profit_per_customer_rf
0.5981461370014217
# expected profit where no-message is sent (control)
pentathlon_nptb["ep_control_rf"]

# expected profit per customer where no-message is sent (control)
control_profit_per_customer_rf = pentathlon_nptb.loc[pentathlon_nptb.training == 0, "ep_control_rf"].mean()
control_profit_per_customer_rf
0.48744714166587705

8. Profit for 5,000,000 customers

profit_rf = pentathlon_nptb.loc[pentathlon_nptb.training == 0, "ep_max_rf"].agg("mean") * 5000000
profit_rf
4947465.355117234
# Profit when a customer is sent a random message
random_profit_rf = random_profit_per_customer_rf * 5000000
random_profit_rf

# Profit when there is no message sent
control_profit_rf = control_profit_per_customer_rf * 5000000
control_profit_rf
2437235.708329385
profit_improvement_rf = profit_rf - control_profit_rf
profit_improvement_rf
2510229.6467878493
profits_dct_rf = {
    "Customize Message": profit_rf,
    "Randomly Assign": random_profit_rf,
    "No Message Sent": control_profit_rf,
}


import seaborn as sns
import matplotlib.pyplot as plt

# Convert dictionary to DataFrame
df = pd.DataFrame(list(profits_dct_rf.items()), columns=['Model', 'Profit'])
plt.figure(figsize=(10, 5))  # Adjust the width and height to your preference
# Plot
sns.set(style="white")
ax = sns.barplot(x="Model", y="Profit", data=df, palette="coolwarm")

# Annotations
for index, row in df.iterrows():
    ax.text(index, row.Profit, f'${row.Profit:.2f}', ha='center')

# Set labels and title
ax.set_xlabel("Model Type", fontdict={'family': 'serif', 'color': 'black', 'size': 12})
ax.set_ylabel("Profit", fontdict={'family': 'serif', 'color': 'black', 'size': 12})
ax.set_title("Profit by Model", fontdict={'family': 'serif', 'color': 'black', 'size': 15})

plt.xticks(rotation=45)  # Rotate x labels for better readability
plt.show()
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/2755170672.py:16: FutureWarning:



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

XGBoost Model

rvar = "buyer_yes"

evar = pentathlon_nptb.columns.to_list()
evar = evar[evar.index("message"):] # all columns after "message"
evar = evar[:evar.index("freq_racquet")+1] # all columns before "freq_racquet"
evar
['message',
 'age',
 'female',
 'income',
 'education',
 'children',
 'freq_endurance',
 'freq_strength',
 'freq_water',
 'freq_team',
 'freq_backcountry',
 'freq_racquet']
# Create X_train, X_test, y_train, y_test using "training" column
X_train = pentathlon_nptb.loc[pentathlon_nptb['training'] == 1, evar]
X_test = pentathlon_nptb.loc[pentathlon_nptb['training'] == 0, evar]
y_train = pentathlon_nptb.loc[pentathlon_nptb['training'] == 1, rvar]
y_test = pentathlon_nptb.loc[pentathlon_nptb['training'] == 0, rvar]
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier


# Define the parameter grid to be used in RandomizedSearchCV
param_grid = {
    'learning_rate': [0.1,0.01,0.001],  
    'n_estimators': [100],
    'max_depth': [3, 6 ,10 ] # Depths from 3 to 10
}
# Initialize the XGBClassifier
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss',enable_categorical = True)
# Initialize GridSearchCV
xgb_tuned = GridSearchCV(estimator=xgb, param_grid= param_grid, scoring='roc_auc', cv=5, verbose=5)
# Fit the model
xgb_tuned.fit(X_train, y_train)
Fitting 5 folds for each of 9 candidates, totalling 45 fits
[CV 1/5] END learning_rate=0.1, max_depth=3, n_estimators=100;, score=0.889 total time=   0.8s
[CV 2/5] END learning_rate=0.1, max_depth=3, n_estimators=100;, score=0.884 total time=   0.7s
[CV 3/5] END learning_rate=0.1, max_depth=3, n_estimators=100;, score=0.890 total time=   0.7s
[CV 4/5] END learning_rate=0.1, max_depth=3, n_estimators=100;, score=0.882 total time=   0.7s
[CV 5/5] END learning_rate=0.1, max_depth=3, n_estimators=100;, score=0.880 total time=   0.7s
[CV 1/5] END learning_rate=0.1, max_depth=6, n_estimators=100;, score=0.891 total time=   1.1s
[CV 2/5] END learning_rate=0.1, max_depth=6, n_estimators=100;, score=0.886 total time=   1.1s
[CV 3/5] END learning_rate=0.1, max_depth=6, n_estimators=100;, score=0.892 total time=   1.1s
[CV 4/5] END learning_rate=0.1, max_depth=6, n_estimators=100;, score=0.884 total time=   1.1s
[CV 5/5] END learning_rate=0.1, max_depth=6, n_estimators=100;, score=0.881 total time=   1.1s
[CV 1/5] END learning_rate=0.1, max_depth=10, n_estimators=100;, score=0.884 total time=   1.6s
[CV 2/5] END learning_rate=0.1, max_depth=10, n_estimators=100;, score=0.879 total time=   1.8s
[CV 3/5] END learning_rate=0.1, max_depth=10, n_estimators=100;, score=0.885 total time=   1.6s
[CV 4/5] END learning_rate=0.1, max_depth=10, n_estimators=100;, score=0.875 total time=   1.7s
[CV 5/5] END learning_rate=0.1, max_depth=10, n_estimators=100;, score=0.874 total time=   1.6s
[CV 1/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=0.864 total time=   0.6s
[CV 2/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=0.864 total time=   0.6s
[CV 3/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=0.870 total time=   0.6s
[CV 4/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=0.856 total time=   0.6s
[CV 5/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=0.854 total time=   0.6s
[CV 1/5] END learning_rate=0.01, max_depth=6, n_estimators=100;, score=0.880 total time=   1.0s
[CV 2/5] END learning_rate=0.01, max_depth=6, n_estimators=100;, score=0.876 total time=   1.0s
[CV 3/5] END learning_rate=0.01, max_depth=6, n_estimators=100;, score=0.882 total time=   1.0s
[CV 4/5] END learning_rate=0.01, max_depth=6, n_estimators=100;, score=0.873 total time=   1.0s
[CV 5/5] END learning_rate=0.01, max_depth=6, n_estimators=100;, score=0.870 total time=   1.0s
[CV 1/5] END learning_rate=0.01, max_depth=10, n_estimators=100;, score=0.882 total time=   1.6s
[CV 2/5] END learning_rate=0.01, max_depth=10, n_estimators=100;, score=0.876 total time=   1.6s
[CV 3/5] END learning_rate=0.01, max_depth=10, n_estimators=100;, score=0.885 total time=   1.6s
[CV 4/5] END learning_rate=0.01, max_depth=10, n_estimators=100;, score=0.877 total time=   1.6s
[CV 5/5] END learning_rate=0.01, max_depth=10, n_estimators=100;, score=0.875 total time=   1.7s
[CV 1/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=0.850 total time=   0.6s
[CV 2/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=0.840 total time=   0.5s
[CV 3/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=0.848 total time=   0.6s
[CV 4/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=0.837 total time=   0.6s
[CV 5/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=0.837 total time=   0.5s
[CV 1/5] END learning_rate=0.001, max_depth=6, n_estimators=100;, score=0.874 total time=   1.0s
[CV 2/5] END learning_rate=0.001, max_depth=6, n_estimators=100;, score=0.869 total time=   1.0s
[CV 3/5] END learning_rate=0.001, max_depth=6, n_estimators=100;, score=0.876 total time=   1.0s
[CV 4/5] END learning_rate=0.001, max_depth=6, n_estimators=100;, score=0.867 total time=   0.9s
[CV 5/5] END learning_rate=0.001, max_depth=6, n_estimators=100;, score=0.866 total time=   0.9s
[CV 1/5] END learning_rate=0.001, max_depth=10, n_estimators=100;, score=0.874 total time=   1.5s
[CV 2/5] END learning_rate=0.001, max_depth=10, n_estimators=100;, score=0.869 total time=   1.6s
[CV 3/5] END learning_rate=0.001, max_depth=10, n_estimators=100;, score=0.880 total time=   1.5s
[CV 4/5] END learning_rate=0.001, max_depth=10, n_estimators=100;, score=0.864 total time=   1.4s
[CV 5/5] END learning_rate=0.001, max_depth=10, n_estimators=100;, score=0.868 total time=   1.5s
GridSearchCV(cv=5,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     callbacks=None, colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None, device=None,
                                     early_stopping_rounds=None,
                                     enable_categorical=True,
                                     eval_metric='logloss', feature_types=None,
                                     gamma=None, grow_policy=None,
                                     importance_type=None,
                                     interaction_constraints=None,
                                     learning_rate=N...
                                     max_cat_threshold=None,
                                     max_cat_to_onehot=None,
                                     max_delta_step=None, max_depth=None,
                                     max_leaves=None, min_child_weight=None,
                                     missing=nan, monotone_constraints=None,
                                     multi_strategy=None, n_estimators=None,
                                     n_jobs=None, num_parallel_tree=None,
                                     random_state=None, ...),
             param_grid={'learning_rate': [0.1, 0.01, 0.001],
                         'max_depth': [3, 6, 10], 'n_estimators': [100]},
             scoring='roc_auc', verbose=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# print best parameters
best_params = xgb_tuned.best_params_
print(best_params)

# print best score
best_score = xgb_tuned.best_score_
print(best_score)
{'learning_rate': 0.1, 'max_depth': 6, 'n_estimators': 100}
0.886970840782522
# Train model on best parameters
best_xgb = XGBClassifier(**best_params, use_label_encoder=False, eval_metric="logloss", enable_categorical=True)

# Fit new model to training data
best_xgb.fit(X_train, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=True, eval_metric='logloss',
              feature_types=None, gamma=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.1, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=6,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=100,
              n_jobs=None, num_parallel_tree=None, random_state=None, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Feature importance of xgb model
from xgboost import plot_importance
import matplotlib.pyplot as plt

# Assuming best_xgb is your trained XGBoost model

# Plot feature importance
plot_importance(best_xgb, importance_type='weight')
plt.show()

Questions

1. For each customer determine the message (i.e., endurance, strength, water, team, backcountry, racquet, or no-message) predicted to lead to the highest probability of purchase.

To determine the message that will lead to the highest probability of purchase for each customer, we can fit the model using only the data for each categorical level in our model and use the fitted model to create predictions for all of the data. Once we have predictions on all of the data for each message, we can select the message with the highest predicted value for the record.

# Create a training copy of the data where 'training' equals 1
pentathlon_nptb_train = pentathlon_nptb[pentathlon_nptb['training'] == 1].copy()

# Define the 7 different "message" values
message_values = ['team', 'backcountry', 'endurance', 'water', 'racquet', 'strength', 'control']

for value in message_values:
    # Filter the training set for the current message
    current_train = pentathlon_nptb_train[pentathlon_nptb_train['message'] == value]
    
    # Create X_train and y_train for the current message
    X_train = current_train[evar]
    y_train = current_train[rvar]
    
    # Fit the model to the training data for the current message
    best_xgb.fit(X_train, y_train)
    
    # Predict probabilities for the entire dataset (you may want to adjust this based on your requirements)
    pentathlon_nptb[f'p_{value}_xgb'] = best_xgb.predict_proba(pentathlon_nptb[evar])[:, 1]
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1516221055.py:19: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1516221055.py:19: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1516221055.py:19: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1516221055.py:19: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1516221055.py:19: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1516221055.py:19: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
repl = {
    "p_control_xgb": "control", 
    "p_endurance_xgb": "endurance", 
    "p_backcountry_xgb": "backcountry", 
    "p_racquet_xgb": "racquet", 
    "p_strength_xgb": "strength", 
    "p_team_xgb": "team", 
    "p_water_xgb": "water"}
predictions_xgb = [
    "p_control_xgb", 
    "p_endurance_xgb", 
    "p_backcountry_xgb", 
    "p_racquet_xgb", 
    "p_strength_xgb", 
    "p_team_xgb", 
    "p_water_xgb"]
pentathlon_nptb['message_xgb'] = (
    pentathlon_nptb[predictions_xgb]
    .idxmax(axis=1)
    .map(repl)
)
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/3215817490.py:1: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
pd.crosstab(index=pentathlon_nptb.message_xgb, columns="count").apply(rsm.format_nr).sort_values("count", ascending=True)
col_0 count
message_xgb
endurance 179,225
backcountry 56,966
control 58,865
water 64,209
racquet 70,013
team 71,928
strength 98,794

2. For each message, The percentage of customers for whom that message or no-message maximizes their probability of purchase:

We can see from the plot that endurance messages most often maximizes the cusomer’s probability of purchase. After enducance, strength messages are the second most reported value that maximizes purchase probabilities. Surprisingly, despite team, raquet, and water being the next best message groups, none of the remaining messages seem to provide a meaningful improvement over sending no message at all.

message_percentages= pentathlon_nptb['message_xgb'].value_counts(normalize=True)
message_percentages
message_xgb
endurance      0.298708
strength       0.164657
team           0.119880
racquet        0.116688
water          0.107015
control        0.098108
backcountry    0.094943
Name: proportion, dtype: float64
# Plotting
plt.figure(figsize=(10, 6))
message_percentages.plot(kind='bar')
plt.ylabel('Percentage of Customers')
plt.xlabel('Message Type')
plt.title('Percentage of Customers for Each Message Value')
plt.xticks(rotation=45)  # Rotate labels to make them readable
plt.show()

For each customer, determine the message (i.e., endurance, strength, water, team, backcountry, racquet, or no-message) predicted to lead to the highest expected profit (COGS is 60%).

The message returning the highest profit is Endurance

To predict order size we tuned a new xgb model to the data and regressed the likely order size for each record. We then used the predicted order size to calculate the expected profit for each record. We calculated the expected profit by multiplying the predicted order size by the probability of purchase and subtracting the cost of goods sold from the result.

rvar_os = "total_os"
evar = ['message',
 'age',
 'female',
 'income',
 'education',
 'children',
 'freq_endurance',
 'freq_strength',
 'freq_water',
 'freq_team',
 'freq_backcountry',
 'freq_racquet']
# Create X_train, X_test, y_train, y_test using "training" column
X_train = pentathlon_nptb.loc[pentathlon_nptb['training'] == 1, evar]
X_test = pentathlon_nptb.loc[pentathlon_nptb['training'] == 0, evar]
y_train_os = pentathlon_nptb.loc[pentathlon_nptb['training'] == 1, 'total_os']
y_test_os = pentathlon_nptb.loc[pentathlon_nptb['training'] == 0, 'total_os']
from xgboost import XGBRegressor

# Define the parameter grid to be used in GridSearchCV
param_grid = {
    'learning_rate': [0.1, 0.01, 0.001],  
    'n_estimators': [100],
    'max_depth': [3, 6, 10]  # Depths from 3 to 10
}

# Initialize the XGBRegressor
xgb_os = XGBRegressor(use_label_encoder=False, eval_metric='rmse', enable_categorical=True)

# Initialize GridSearchCV with a scoring metric suitable for regression
xgb_tuned_os = GridSearchCV(estimator=xgb_os, param_grid=param_grid, scoring='neg_root_mean_squared_error', cv=5, verbose=5)

# Assuming X_train and y_train are defined
# Fit the model
xgb_tuned_os.fit(X_train, y_train_os)
Fitting 5 folds for each of 9 candidates, totalling 45 fits
[CV 1/5] END learning_rate=0.1, max_depth=3, n_estimators=100;, score=-12.399 total time=   0.4s
[CV 2/5] END learning_rate=0.1, max_depth=3, n_estimators=100;, score=-12.532 total time=   0.4s
[CV 3/5] END learning_rate=0.1, max_depth=3, n_estimators=100;, score=-12.659 total time=   0.4s
[CV 4/5] END learning_rate=0.1, max_depth=3, n_estimators=100;, score=-12.504 total time=   0.4s
[CV 5/5] END learning_rate=0.1, max_depth=3, n_estimators=100;, score=-11.961 total time=   0.5s
[CV 1/5] END learning_rate=0.1, max_depth=6, n_estimators=100;, score=-12.700 total time=   0.6s
[CV 2/5] END learning_rate=0.1, max_depth=6, n_estimators=100;, score=-12.639 total time=   0.7s
[CV 3/5] END learning_rate=0.1, max_depth=6, n_estimators=100;, score=-12.807 total time=   0.7s
[CV 4/5] END learning_rate=0.1, max_depth=6, n_estimators=100;, score=-12.660 total time=   0.7s
[CV 5/5] END learning_rate=0.1, max_depth=6, n_estimators=100;, score=-12.062 total time=   0.7s
[CV 1/5] END learning_rate=0.1, max_depth=10, n_estimators=100;, score=-13.273 total time=   1.1s
[CV 2/5] END learning_rate=0.1, max_depth=10, n_estimators=100;, score=-13.104 total time=   1.2s
[CV 3/5] END learning_rate=0.1, max_depth=10, n_estimators=100;, score=-13.254 total time=   1.1s
[CV 4/5] END learning_rate=0.1, max_depth=10, n_estimators=100;, score=-12.978 total time=   1.3s
[CV 5/5] END learning_rate=0.1, max_depth=10, n_estimators=100;, score=-12.550 total time=   1.2s
[CV 1/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=-12.550 total time=   0.4s
[CV 2/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=-12.691 total time=   0.4s
[CV 3/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=-12.816 total time=   0.4s
[CV 4/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=-12.654 total time=   0.4s
[CV 5/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=-12.135 total time=   0.4s
[CV 1/5] END learning_rate=0.01, max_depth=6, n_estimators=100;, score=-12.568 total time=   0.8s
[CV 2/5] END learning_rate=0.01, max_depth=6, n_estimators=100;, score=-12.663 total time=   0.7s
[CV 3/5] END learning_rate=0.01, max_depth=6, n_estimators=100;, score=-12.795 total time=   0.9s
[CV 4/5] END learning_rate=0.01, max_depth=6, n_estimators=100;, score=-12.622 total time=   0.8s
[CV 5/5] END learning_rate=0.01, max_depth=6, n_estimators=100;, score=-12.095 total time=   0.8s
[CV 1/5] END learning_rate=0.01, max_depth=10, n_estimators=100;, score=-12.658 total time=   1.5s
[CV 2/5] END learning_rate=0.01, max_depth=10, n_estimators=100;, score=-12.755 total time=   1.5s
[CV 3/5] END learning_rate=0.01, max_depth=10, n_estimators=100;, score=-12.850 total time=   1.4s
[CV 4/5] END learning_rate=0.01, max_depth=10, n_estimators=100;, score=-12.661 total time=   1.4s
[CV 5/5] END learning_rate=0.01, max_depth=10, n_estimators=100;, score=-12.113 total time=   1.5s
[CV 1/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=-12.971 total time=   0.4s
[CV 2/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=-13.065 total time=   0.4s
[CV 3/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=-13.206 total time=   0.4s
[CV 4/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=-13.016 total time=   0.4s
[CV 5/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=-12.524 total time=   0.5s
[CV 1/5] END learning_rate=0.001, max_depth=6, n_estimators=100;, score=-12.966 total time=   0.8s
[CV 2/5] END learning_rate=0.001, max_depth=6, n_estimators=100;, score=-13.058 total time=   0.8s
[CV 3/5] END learning_rate=0.001, max_depth=6, n_estimators=100;, score=-13.195 total time=   0.7s
[CV 4/5] END learning_rate=0.001, max_depth=6, n_estimators=100;, score=-13.005 total time=   0.7s
[CV 5/5] END learning_rate=0.001, max_depth=6, n_estimators=100;, score=-12.511 total time=   0.8s
[CV 1/5] END learning_rate=0.001, max_depth=10, n_estimators=100;, score=-12.957 total time=   1.6s
[CV 2/5] END learning_rate=0.001, max_depth=10, n_estimators=100;, score=-13.059 total time=   1.6s
[CV 3/5] END learning_rate=0.001, max_depth=10, n_estimators=100;, score=-13.197 total time=   1.5s
[CV 4/5] END learning_rate=0.001, max_depth=10, n_estimators=100;, score=-13.001 total time=   1.8s
[CV 5/5] END learning_rate=0.001, max_depth=10, n_estimators=100;, score=-12.500 total time=   1.6s
GridSearchCV(cv=5,
             estimator=XGBRegressor(base_score=None, booster=None,
                                    callbacks=None, colsample_bylevel=None,
                                    colsample_bynode=None,
                                    colsample_bytree=None, device=None,
                                    early_stopping_rounds=None,
                                    enable_categorical=True, eval_metric='rmse',
                                    feature_types=None, gamma=None,
                                    grow_policy=None, importance_type=None,
                                    interaction_constraints=None,
                                    learning_rate=None,...
                                    max_cat_to_onehot=None, max_delta_step=None,
                                    max_depth=None, max_leaves=None,
                                    min_child_weight=None, missing=nan,
                                    monotone_constraints=None,
                                    multi_strategy=None, n_estimators=None,
                                    n_jobs=None, num_parallel_tree=None,
                                    random_state=None, ...),
             param_grid={'learning_rate': [0.1, 0.01, 0.001],
                         'max_depth': [3, 6, 10], 'n_estimators': [100]},
             scoring='neg_root_mean_squared_error', verbose=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# After fitting, get the best parameters and retrain the model
best_params_os = xgb_tuned.best_params_
best_xgb_os = XGBRegressor(**best_params_os, use_label_encoder=False, eval_metric="rmse", enable_categorical=True)

# Fit the new model to training data
best_xgb_os.fit(X_train, y_train_os)
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=True, eval_metric='rmse', feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=0.1, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=6, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=100, n_jobs=None,
             num_parallel_tree=None, random_state=None, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Create a training copy of the data where 'training' equals 1
pentathlon_nptb_train = pentathlon_nptb[pentathlon_nptb['training'] == 1].copy()

# Define the 7 different "message" values
message_values = ['team', 'backcountry', 'endurance', 'water', 'racquet', 'strength', 'control']

for value in message_values:
    # Filter the training set for the current message
    current_train = pentathlon_nptb_train[pentathlon_nptb_train['message'] == value]
    
    # Create X_train and y_train for the current message
    X_train = current_train[evar]
    y_train = current_train[rvar_os]
    
    # Fit the model to the training data for the current message
    best_xgb_os.fit(X_train, y_train)
    
    # Predict orer size for the entire dataset
    pentathlon_nptb[f'os_{value}_xgb'] = best_xgb_os.predict(pentathlon_nptb[evar]).round(0)
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/777574624.py:19: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/777574624.py:19: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/777574624.py:19: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/777574624.py:19: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/777574624.py:19: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/777574624.py:19: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/777574624.py:19: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
values = ['team', 'backcountry', 'endurance', 'water', 'racquet', 'strength', 'control']
Profit_margin = 1-0.6


for value in values:
    pentathlon_nptb[f'ep_{value}_xgb'] = pentathlon_nptb[f'p_{value}_xgb'] * pentathlon_nptb[f'os_{value}_xgb'] * Profit_margin
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1144112266.py:6: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1144112266.py:6: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1144112266.py:6: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1144112266.py:6: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1144112266.py:6: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1144112266.py:6: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1144112266.py:6: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
expected_profit_xgb = [
    "ep_control_xgb", 
    "ep_endurance_xgb", 
    "ep_backcountry_xgb", 
    "ep_racquet_xgb", 
    "ep_strength_xgb", 
    "ep_team_xgb", 
    "ep_water_xgb"]
repl = {"ep_control_xgb": "control", "ep_endurance_xgb": "endurance", "ep_backcountry_xgb": "backcountry", "ep_racquet_xgb": "racquet", "ep_strength_xgb": "strength", "ep_team_xgb": "team", "ep_water_xgb": "water"}
pentathlon_nptb["ep_message_xgb"] = (
    pentathlon_nptb[expected_profit_xgb]
    .idxmax(axis=1)
    .map(repl)
)
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/3163695219.py:2: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
# Define a function to look up the profit value based on ep_message_xgb
def get_ep(row):
    # Mapping each ep_message_xgb value to the corresponding profit column
    mapping = {
        'control': 'ep_control_xgb',
        'endurance': 'ep_endurance_xgb',
        'backcountry': 'ep_backcountry_xgb',
        'racquet': 'ep_racquet_xgb',
        'strength': 'ep_strength_xgb',
        'team': 'ep_team_xgb',
        'water': 'ep_water_xgb',
    }
    
    # Get the profit column name for the current row's message
    profit_col = mapping.get(row['ep_message_xgb'])
    
    # Return the value from the corresponding profit column for this row
    if profit_col:
        return row[profit_col]
    else:
        return None

# Apply the function to each row to create a new column with the corresponding profit value
pentathlon_nptb['ep_message_max_p'] = pentathlon_nptb.apply(get_ep, axis=1)
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1040504524.py:24: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
ep_profit = pentathlon_nptb.groupby('ep_message_xgb')['ep_message_max_p'].sum().sort_values(ascending=False).round(2)
ep_profit
ep_message_xgb
endurance      43977.37
strength       37358.12
water          30991.18
backcountry    26557.67
team           21474.36
racquet        18135.07
control         9657.12
Name: ep_message_max_p, dtype: float64
# Histogram of ep_profit
plt.figure(figsize=(10, 6))
ep_profit.plot(kind='bar')
plt.ylabel('Expected Profit by Message')
plt.xlabel('Message Type')
plt.title('Total Profit')
plt.xticks(rotation=45)  # Rotate labels to make them readable
plt.show()

4. Report for each message, i.e., endurance, racket, etc., and no-message, the percentage of customers for whom that (no) message maximizes their expected profit.

Message_max = pentathlon_nptb.ep_message_xgb.value_counts(normalize=True)
Message_max
ep_message_xgb
control        0.576705
strength       0.091172
endurance      0.082492
water          0.067795
team           0.066832
backcountry    0.065020
racquet        0.049985
Name: proportion, dtype: float64
# Plotting
plt.figure(figsize=(10, 6))
Message_max.plot(kind='bar')
plt.ylabel('Percentage of Customers')
plt.xlabel('Message Type')
plt.title('Percentage of Customers for Each Message Value')
plt.xticks(rotation=45)  # Rotate labels to make them readable

plt.show()

5. The expected profit can we obtain, on average, per customer if we customize the message to each customer

avg_profit_pc = pentathlon_nptb[pentathlon_nptb.training == 0].groupby('ep_message_xgb')['ep_message_max_p'].mean().sort_values(ascending=False).round(2) * Profit_margin
avg_profit_pc
ep_message_xgb
endurance      0.336
strength       0.284
water          0.280
backcountry    0.264
racquet        0.232
team           0.196
control        0.008
Name: ep_message_max_p, dtype: float64
# Histogram of ep_profit
plt.figure(figsize=(10, 6))
avg_profit_pc.plot(kind='bar')
plt.ylabel('Expected Profit Per Customer')
plt.xlabel('Message Type')
plt.title('Average Profit')
plt.xticks(rotation=45)  # Rotate labels to make them readable
plt.show()

6. The expected profit per e-mailed customer if every customer receives the same message

expected_profit_xgb = [
    "ep_control_xgb", 
    "ep_endurance_xgb", 
    "ep_backcountry_xgb", 
    "ep_racquet_xgb", 
    "ep_strength_xgb", 
    "ep_team_xgb", 
    "ep_water_xgb"]

# Expected profit per customer for each message
epp_same_message = pentathlon_nptb[pentathlon_nptb.training == 0][expected_profit_xgb].mean().sort_values(ascending=False) * Profit_margin
# Histogram of ep_profit
plt.figure(figsize=(10, 6))
epp_same_message.plot(kind='bar')
plt.ylabel('Expected Profit Per Customer')
plt.xlabel('Message Type')
plt.title('Average Profit')
plt.xticks(rotation=45)  # Rotate labels to make them readable
plt.show()

7. The expected profit per e-mailed customer if every customer is assigned randomly to one of the messages or the no-message condition

pentathlon_nptb['ep_random_xgb'] = best_xgb.predict_proba(pentathlon_nptb[evar])[:, 1]
xgb_random_ppc = pentathlon_nptb[pentathlon_nptb.training == 0]['ep_random_xgb'].mean() * Profit_margin
xgb_random_ppc
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/3633768350.py:1: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
0.008396312594413757
expected_profit_xgb = [
    "ep_control_xgb", 
    "ep_endurance_xgb", 
    "ep_backcountry_xgb", 
    "ep_racquet_xgb", 
    "ep_strength_xgb", 
    "ep_team_xgb", 
    "ep_water_xgb",
    "ep_random_xgb"]

# Expected profit per customer for each message
epp_same_message = pentathlon_nptb[pentathlon_nptb.training == 0][expected_profit_xgb].mean().sort_values(ascending=False)
# Histogram of ep_profit
plt.figure(figsize=(10, 6))
epp_same_message.plot(kind='bar')
plt.ylabel('Expected Profit Per Customer')
plt.xlabel('Message Type')
plt.title('Average Profit')
plt.xticks(rotation=45)  # Rotate labels to make them readable
plt.show()

8. For the typical promotional e-mail blast to 5,000,000 customers, The improvement (in percent and in total Euros) could Pentathlon achieve by customizing the message (or no-message) to each customer:

pentathlon_nptb["ep_max_xgb"] = pentathlon_nptb[expected_profit_xgb].max(axis=1)
pentathlon_nptb["ep_max_xgb"].mean()
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1479503728.py:1: PerformanceWarning:

DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
0.31611574
# Profit where each customer is assigned to the message with the highest expected profit
profit_xgb = pentathlon_nptb.loc[pentathlon_nptb.training == 0, "ep_max_xgb"].agg("mean") * 5000000
profit_xgb 
1511183.9771270752
# Profit where each customer is sent to a random message
random_profit_xgb = xgb_random_ppc * 5000000
random_profit_xgb
41981.56297206879
# expected profit where no-message is sent (control)
pentathlon_nptb["ep_control_xgb"]

# expected profit per customer where no-message is sent (control)
control_profit_per_customer = pentathlon_nptb.loc[pentathlon_nptb.training == 0, "ep_control_xgb"].mean()
control_profit_per_customer
0.078689866
# Profit where no message is sent
control_profit_xgb = control_profit_per_customer * 5000000
control_profit_xgb
393449.3288397789
profit_improvement_xgb = profit_xgb - control_profit_xgb
profit_improvement_xgb
1117734.6482872963
profits_dct = {
    "Customize Message": profit_xgb,
    "No Message Sent": control_profit_nn,
    "Randomly Assign": random_profit_xgb,
    
}


import seaborn as sns
import matplotlib.pyplot as plt

# Convert dictionary to DataFrame
df = pd.DataFrame(list(profits_dct.items()), columns=['Model', 'Profit'])
plt.figure(figsize=(10, 5))  
# Plot
sns.set(style="whitegrid")
ax = sns.barplot(x="Model", y="Profit", data=df, palette="viridis")

# Annotations
for index, row in df.iterrows():
    ax.text(index, row.Profit, f'${row.Profit:.2f}', ha='center')

# Set labels and title
ax.set_xlabel("Model Type", fontdict={'family': 'serif', 'color': 'black', 'size': 12})
ax.set_ylabel("Profit", fontdict={'family': 'serif', 'color': 'black', 'size': 12})
ax.set_title("Profit by Model", fontdict={'family': 'serif', 'color': 'black', 'size': 15})

plt.xticks(rotation=45) 
plt.show()
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1574157434.py:17: FutureWarning:



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

Improvement Profit for 4 models

mod_perf = pd.DataFrame({
    "model": ["logit", "nn", "rf", "xgb"],
    "profit": [profit_improvement_lr, profit_improvement_nn, profit_improvement_rf, profit_improvement_xgb]
})
mod_perf
model profit
0 logit 1.238407e+06
1 nn 1.262522e+06
2 rf 2.510230e+06
3 xgb 1.117735e+06
profits_dct = {
    "Logit": profit_improvement_lr,
    "Neural Network": profit_improvement_nn,
    "Random Forest" : profit_improvement_rf,
    "XGBoost": profit_improvement_xgb
    
}


import seaborn as sns
import matplotlib.pyplot as plt

# Convert dictionary to DataFrame
df = pd.DataFrame(list(profits_dct.items()), columns=['Model', 'Profit'])
plt.figure(figsize=(10, 5))  
# Plot
sns.set(style="whitegrid")
ax = sns.barplot(x="Model", y="Profit", data=df, palette="viridis")

# Annotations
for index, row in df.iterrows():
    ax.text(index, row.Profit, f'${row.Profit:.2f}', ha='center')

# Set labels and title
ax.set_xlabel("Model Type", fontdict={'family': 'serif', 'color': 'black', 'size': 12})
ax.set_ylabel("Profit", fontdict={'family': 'serif', 'color': 'black', 'size': 12})
ax.set_title("Profit Improvement by Model", fontdict={'family': 'serif', 'color': 'black', 'size': 15})

plt.xticks(rotation=45) 
plt.show()
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/2950013552.py:18: FutureWarning:



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

According to the chart, Random Forest outperforms the other models in terms of profit improvement but takes more than 2 hours to tune the model.

  • Weaknesses of the Random Forest Model:

Computational Complexity: Random Forest can be computationally intensive, especially as the number of trees and the depth of each tree increase. This is likely the cause of the long tunning time. The complexity grows with the size of the dataset, the number of features, and the model parameters.

Scalability: Because of the computational cost, scaling Random Forest to very large datasets or high-dimensional data can be challenging and might require significant computational resources.

Model Interpretability: While not as complex as some other models like neural networks, Random Forests are still considered a “black box” compared to simpler models like Logistic Regression. This can make it difficult to understand the decision-making process of the model and to explain individual predictions.

Tuning Required: Random Forest models have several hyperparameters (like the number of trees, max depth, min samples split, etc.) that can greatly influence performance. Finding the optimal settings can be time-consuming and require a lot of trial and error or automated search techniques like grid search or random search, which can further increase the computational burden.

  • Suggested Improvement:

Model Simplification: We should simplying the model by reducing the number of trees or the depth of each tree, which might reduce training time at the cost of some accuracy.

Feature Selection: Reducing the number of features through feature selection can significantly decrease training time and sometimes even improve model performance by eliminating noise.

Incremental Learning: Some models allow for incremental learning, where the model is trained on chunks of the dataset sequentially, which can reduce memory consumption and potentially training time.