import pandas as pd
import pyrsm as rsm
Pentathlon: Next Product to Buy Models
PROJECT SUMMARY
• Project Goal: Optimize profit by tailoring email campaigns for 7 departments to individual customers based on predictive modeling.
Modeling Approach:
• Logistic Regression: Utilized for its ability to predict the probability of customer engagement based on email content. • Linear Regression: Employed to predict average order size from each customer as a continuous outcome. • Random Forest: Chosen for its robustness and ability to capture non-linear patterns and feature interactions. • Neural Network: Applied to leverage its deep learning capabilities in identifying complex patterns within customer data. • XGBoost:Selectedforitse iciency,modelperformance,andabilitytohandleavarietyofdata types and distributions.
Model Training and Tuning:
• Conducted comprehensive training and hyperparameter tuning for each model to ensure op- timal performance. • Employed methods like cross-validation and grid search, particularly for computationally intensive models like Random Forest.
Profit Analysis: • Evaluated the predicted profit increase from each model to identify the most effective ap- proach. • Analyzed model-specific trade-offs, such as training time (e.g., Random Forest’s extensive training period).
Email Policy Evaluation: • Assessed the impact of each model on the email policy proposal to ensure alignment with profit optimization goals.
Data Preparation
## loading the data - this dataset must NOT be changed
= pd.read_parquet("pentathlon_nptb.parquet") pentathlon_nptb
Pentathon: Next Product To Buy
The available data is based on the last e-mail sent to each Pentathlon customer. Hence, an observation or row in the data is a “customer-promotional e-mail” pair. The data contains the following basic demographic information available to Pentathlon:
- “age”: Customer age(coded in 4 buckets:“<30”, “30 to 44”, “45 to 59”, and “>=60”)
- “female”: Gender identity coded as Female “yes” or “no”
- “income”: Income in Euros, rounded to the nearest EUR5,000
- “education”: Percentage of college graduates in the customer’s neighborhood, coded from 0-100
- “children”: Average number of children in the customer’s neighborhood
The data also contains basic historical information about customer purchases, specifically, a department-specific frequency measure.
- “freq_endurance-freq_racquet”: Number of purchases in each department in the last year, excluding any purchase in response to the last email.
The key outcome variables are:
- “buyer”: Did the customer click on the e-mail and complete a purchase within two days of receiving the e-mail (“yes” or “no”)?
- “total_os”: Total order size (in Euros) conditional on the customer having purchased (buyer == “yes”). This measures spending across all departments, not just the department that sent the message
Note: In addition to the six message groups, a seventh group of customers received no promotional e-mails for the duration of the test (“control”).
pentathlon_nptb.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600000 entries, 0 to 599999
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 custid 600000 non-null object
1 buyer 600000 non-null category
2 total_os 600000 non-null int32
3 message 600000 non-null category
4 age 600000 non-null category
5 female 600000 non-null category
6 income 600000 non-null int32
7 education 600000 non-null int32
8 children 600000 non-null float64
9 freq_endurance 600000 non-null int32
10 freq_strength 600000 non-null int32
11 freq_water 600000 non-null int32
12 freq_team 600000 non-null int32
13 freq_backcountry 600000 non-null int32
14 freq_racquet 600000 non-null int32
15 training 600000 non-null float64
dtypes: category(4), float64(2), int32(9), object(1)
memory usage: 36.6+ MB
'buyer_yes'] = rsm.ifelse(pentathlon_nptb['buyer'] == 'yes', 1, 0)
pentathlon_nptb['buyer_yes'].value_counts() pentathlon_nptb[
buyer_yes
0 585600
1 14400
Name: count, dtype: int64
Check the training data and testing data, make sure they are splitted a 70-30 ratio
"message"], pentathlon_nptb["training"]) pd.crosstab(pentathlon_nptb[
training | 0.0 | 1.0 |
---|---|---|
message | ||
backcountry | 26179 | 60425 |
control | 26043 | 61217 |
endurance | 24773 | 58083 |
racquet | 26316 | 60772 |
strength | 25251 | 59029 |
team | 25942 | 60850 |
water | 25496 | 59624 |
=True) pentathlon_nptb.training.value_counts(normalize
training
1.0 0.7
0.0 0.3
Name: proportion, dtype: float64
=(10, 10), bins=30) pentathlon_nptb.hist(figsize
array([[<Axes: title={'center': 'total_os'}>,
<Axes: title={'center': 'income'}>,
<Axes: title={'center': 'education'}>],
[<Axes: title={'center': 'children'}>,
<Axes: title={'center': 'freq_endurance'}>,
<Axes: title={'center': 'freq_strength'}>],
[<Axes: title={'center': 'freq_water'}>,
<Axes: title={'center': 'freq_team'}>,
<Axes: title={'center': 'freq_backcountry'}>],
[<Axes: title={'center': 'freq_racquet'}>,
<Axes: title={'center': 'training'}>,
<Axes: title={'center': 'buyer_yes'}>]], dtype=object)
Logistic Regression Model
Check cross tab between the outcome variable and the key predictor variable
= rsm.basics.cross_tabs(pentathlon_nptb.query("training == 1"), "buyer", "age")
ct = "perc_row")
ct.summary(output = "perc_row") ct.plot(plots
Cross-tabs
Data : Not provided
Variables: buyer, age
Null hyp : There is no association between buyer and age
Alt. hyp : There is an association between buyer and age
Row percentages:
age < 30 30 to 44 45 to 59 >= 60 Total
buyer
yes 9.42% 40.74% 38.18% 11.65% 100.0%
no 17.7% 31.16% 32.42% 18.72% 100.0%
Total 17.5% 31.39% 32.56% 18.55% 100.0%
Chi-squared: 1038.94 df(3), p.value < .001
0.0% of cells have expected values below 5
= rsm.basics.cross_tabs(pentathlon_nptb.query("training == 1"), "buyer", "children")
ct = "perc_row") ct.summary(output
Cross-tabs
Data : Not provided
Variables: buyer, children
Null hyp : There is no association between buyer and children
Alt. hyp : There is an association between buyer and children
Row percentages:
children 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 \
buyer
yes 1.58% 2.16% 3.52% 4.06% 5.13% 7.48% 7.55% 10.71% 10.63%
no 2.98% 4.2% 6.18% 7.23% 7.9% 10.42% 9.56% 10.81% 9.11%
Total 2.94% 4.15% 6.12% 7.15% 7.84% 10.35% 9.51% 10.8% 9.15%
children 1.1 ... 4.5 4.6 4.7 4.8 4.9 5.0 5.1 5.2 5.8 \
buyer ...
yes 9.74% ... 0.0% 0.0% 0.0% 0.01% 0.0% 0.0% 0.0% 0.0% 0.0%
no 7.18% ... 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
Total 7.24% ... 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
children Total
buyer
yes 100.0%
no 100.0%
Total 100.0%
[3 rows x 53 columns]
Chi-squared: 2028.29 df(51), p.value < .001
700.0% of cells have expected values below 5
= rsm.basics.cross_tabs(pentathlon_nptb.query("training == 1"), "buyer", "message")
ct = "perc_row")
ct.summary(output = "perc_row") ct.plot(plots
Cross-tabs
Data : Not provided
Variables: buyer, message
Null hyp : There is no association between buyer and message
Alt. hyp : There is an association between buyer and message
Row percentages:
message backcountry control endurance racquet strength team water Total
buyer
yes 13.97% 13.27% 15.37% 13.77% 14.85% 14.58% 14.19% 100.0%
no 14.4% 14.61% 13.79% 14.49% 14.03% 14.49% 14.2% 100.0%
Total 14.39% 14.58% 13.83% 14.47% 14.05% 14.49% 14.2% 100.0%
Chi-squared: 39.15 df(6), p.value < .001
0.0% of cells have expected values below 5
= rsm.basics.cross_tabs(pentathlon_nptb.query("training == 1"), "buyer", "freq_endurance")
ct = "perc_row") ct.summary(output
Cross-tabs
Data : Not provided
Variables: buyer, freq_endurance
Null hyp : There is no association between buyer and freq_endurance
Alt. hyp : There is an association between buyer and freq_endurance
Row percentages:
freq_endurance 0 1 2 3 4 5 6 7 \
buyer
yes 24.2% 16.15% 14.66% 12.23% 9.58% 7.76% 5.24% 3.76%
no 56.16% 22.7% 10.21% 5.33% 2.82% 1.42% 0.72% 0.34%
Total 55.4% 22.54% 10.32% 5.5% 2.98% 1.58% 0.83% 0.42%
freq_endurance 8 9 10 11 12 13 14 15 Total
buyer
yes 2.56% 1.72% 0.97% 0.57% 0.34% 0.14% 0.11% 0.02% 100.0%
no 0.16% 0.08% 0.03% 0.01% 0.01% 0.0% 0.0% 0.0% 100.0%
Total 0.22% 0.12% 0.05% 0.03% 0.01% 0.01% 0.0% 0.0% 100.0%
Chi-squared: 21259.53 df(15), p.value < .001
150.0% of cells have expected values below 5
= rsm.basics.cross_tabs(pentathlon_nptb.query("training == 1"), "buyer", "freq_strength")
ct = "perc_row") ct.summary(output
Cross-tabs
Data : Not provided
Variables: buyer, freq_strength
Null hyp : There is no association between buyer and freq_strength
Alt. hyp : There is an association between buyer and freq_strength
Row percentages:
freq_strength 0 1 2 3 4 5 6 7 \
buyer
yes 21.4% 9.33% 9.92% 8.99% 8.33% 8.32% 6.73% 6.43%
no 46.85% 18.9% 11.59% 7.77% 5.32% 3.54% 2.26% 1.51%
Total 46.24% 18.67% 11.55% 7.8% 5.4% 3.65% 2.37% 1.63%
freq_strength 8 9 ... 15 16 17 18 19 20 \
buyer ...
yes 5.1% 4.09% ... 0.65% 0.51% 0.23% 0.25% 0.11% 0.1%
no 0.94% 0.55% ... 0.02% 0.01% 0.01% 0.0% 0.0% 0.0%
Total 1.04% 0.64% ... 0.03% 0.02% 0.01% 0.01% 0.0% 0.0%
freq_strength 21 22 23 Total
buyer
yes 0.04% 0.05% 0.01% 100.0%
no 0.0% 0.0% 0.0% 100.0%
Total 0.0% 0.0% 0.0% 100.0%
[3 rows x 25 columns]
Chi-squared: 22032.18 df(23), p.value < .001
300.0% of cells have expected values below 5
= rsm.basics.cross_tabs(pentathlon_nptb.query("training == 1"), "buyer", "freq_water")
ct = "perc_row") ct.summary(output
Cross-tabs
Data : Not provided
Variables: buyer, freq_water
Null hyp : There is no association between buyer and freq_water
Alt. hyp : There is an association between buyer and freq_water
Row percentages:
freq_water 0 1 2 3 4 5 6 7 8 \
buyer
yes 64.79% 18.38% 9.3% 4.7% 1.88% 0.66% 0.21% 0.06% 0.01%
no 92.53% 5.72% 1.33% 0.32% 0.08% 0.01% 0.0% 0.0% 0.0%
Total 91.87% 6.03% 1.52% 0.42% 0.12% 0.03% 0.01% 0.0% 0.0%
freq_water Total
buyer
yes 100.0%
no 100.0%
Total 100.0%
Chi-squared: 16794.94 df(8), p.value < .001
125.0% of cells have expected values below 5
= rsm.basics.cross_tabs(pentathlon_nptb.query("training == 1"), "buyer", "freq_team")
ct = "perc_row") ct.summary(output
Cross-tabs
Data : Not provided
Variables: buyer, freq_team
Null hyp : There is no association between buyer and freq_team
Alt. hyp : There is an association between buyer and freq_team
Row percentages:
freq_team 0 1 2 3 4 5 6 7 8 \
buyer
yes 39.2% 10.78% 9.88% 9.12% 8.45% 6.8% 4.74% 4.07% 2.71%
no 59.12% 16.61% 10.14% 6.28% 3.73% 2.08% 1.08% 0.53% 0.24%
Total 58.64% 16.47% 10.14% 6.35% 3.84% 2.19% 1.17% 0.61% 0.3%
freq_team 9 10 11 12 13 14 15 16 Total
buyer
yes 1.87% 1.07% 0.62% 0.32% 0.12% 0.2% 0.05% 0.02% 100.0%
no 0.11% 0.05% 0.02% 0.01% 0.0% 0.0% 0.0% 0.0% 100.0%
Total 0.16% 0.07% 0.04% 0.02% 0.01% 0.01% 0.0% 0.0% 100.0%
Chi-squared: 13743.4 df(16), p.value < .001
175.0% of cells have expected values below 5
= rsm.basics.cross_tabs(pentathlon_nptb.query("training == 1"), "buyer", "freq_backcountry")
ct = "perc_row") ct.summary(output
Cross-tabs
Data : Not provided
Variables: buyer, freq_backcountry
Null hyp : There is no association between buyer and freq_backcountry
Alt. hyp : There is an association between buyer and freq_backcountry
Row percentages:
freq_backcountry 0 1 2 3 4 5 6 Total
buyer
yes 45.69% 20.77% 18.53% 10.39% 3.85% 0.73% 0.03% 100.0%
no 61.34% 28.61% 8.24% 1.57% 0.22% 0.02% 0.0% 100.0%
Total 60.97% 28.42% 8.49% 1.78% 0.3% 0.03% 0.0% 100.0%
Chi-squared: 11914.73 df(6), p.value < .001
50.0% of cells have expected values below 5
= rsm.basics.cross_tabs(pentathlon_nptb.query("training == 1"), "buyer", "freq_racquet")
ct = "perc_row") ct.summary(output
Cross-tabs
Data : Not provided
Variables: buyer, freq_racquet
Null hyp : There is no association between buyer and freq_racquet
Alt. hyp : There is an association between buyer and freq_racquet
Row percentages:
freq_racquet 0 1 2 3 4 5 6 7 \
buyer
yes 23.83% 18.48% 17.7% 13.82% 10.43% 7.43% 4.32% 2.42%
no 48.75% 28.86% 12.68% 5.76% 2.48% 0.97% 0.34% 0.12%
Total 48.15% 28.61% 12.8% 5.95% 2.68% 1.13% 0.44% 0.17%
freq_racquet 8 9 10 11 Total
buyer
yes 0.97% 0.46% 0.13% 0.02% 100.0%
no 0.03% 0.01% 0.0% 0.0% 100.0%
Total 0.05% 0.02% 0.0% 0.0% 100.0%
Chi-squared: 18774.72 df(11), p.value < .001
100.0% of cells have expected values below 5
= rsm.basics.cross_tabs(pentathlon_nptb.query("training == 1"), "buyer", "income")
ct = "perc_row") ct.summary(output
Cross-tabs
Data : Not provided
Variables: buyer, income
Null hyp : There is no association between buyer and income
Alt. hyp : There is an association between buyer and income
Row percentages:
income 0 5000 10000 15000 20000 25000 30000 35000 40000 \
buyer
yes 0.0% 0.01% 0.0% 0.16% 0.26% 0.52% 0.84% 1.3% 2.22%
no 0.0% 0.03% 0.19% 0.73% 2.13% 4.78% 8.77% 11.78% 14.0%
Total 0.0% 0.03% 0.18% 0.71% 2.09% 4.68% 8.58% 11.53% 13.72%
income 45000 ... 160000 165000 170000 175000 180000 185000 190000 195000 \
buyer ...
yes 3.83% ... 0.26% 0.37% 0.19% 0.14% 0.19% 0.14% 0.1% 0.11%
no 12.71% ... 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
Total 12.5% ... 0.01% 0.01% 0.01% 0.01% 0.01% 0.0% 0.0% 0.0%
income 200000 Total
buyer
yes 0.54% 100.0%
no 0.01% 100.0%
Total 0.02% 100.0%
[3 rows x 42 columns]
Chi-squared: 40347.2 df(40), p.value < .001
400.0% of cells have expected values below 5
"training"] == 1] pentathlon_nptb[pentathlon_nptb[
custid | buyer | total_os | message | age | female | income | education | children | freq_endurance | freq_strength | freq_water | freq_team | freq_backcountry | freq_racquet | training | buyer_yes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | U1 | no | 0 | team | 30 to 44 | no | 55000 | 19 | 0.8 | 0 | 4 | 0 | 4 | 0 | 1 | 1.0 | 0 |
2 | U13 | no | 0 | endurance | 45 to 59 | yes | 45000 | 33 | 0.7 | 0 | 0 | 0 | 0 | 2 | 2 | 1.0 | 0 |
3 | U20 | no | 0 | water | 45 to 59 | yes | 25000 | 24 | 0.2 | 0 | 0 | 0 | 0 | 0 | 0 | 1.0 | 0 |
5 | U28 | no | 0 | strength | < 30 | yes | 25000 | 18 | 0.3 | 0 | 0 | 0 | 0 | 0 | 0 | 1.0 | 0 |
8 | U59 | no | 0 | strength | >= 60 | yes | 65000 | 36 | 1.2 | 1 | 1 | 0 | 2 | 0 | 3 | 1.0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
599989 | U3462835 | no | 0 | control | 30 to 44 | yes | 45000 | 20 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1.0 | 0 |
599992 | U3462858 | no | 0 | control | 45 to 59 | yes | 40000 | 16 | 1.1 | 0 | 0 | 0 | 0 | 0 | 0 | 1.0 | 0 |
599993 | U3462877 | no | 0 | backcountry | < 30 | no | 65000 | 30 | 1.0 | 2 | 3 | 1 | 0 | 1 | 1 | 1.0 | 0 |
599995 | U3462888 | no | 0 | water | >= 60 | yes | 40000 | 26 | 0.6 | 0 | 0 | 0 | 0 | 0 | 0 | 1.0 | 0 |
599997 | U3462902 | no | 0 | team | < 30 | yes | 55000 | 32 | 0.9 | 0 | 5 | 0 | 2 | 1 | 2 | 1.0 | 0 |
420000 rows × 17 columns
Build a logistic regression model
= pentathlon_nptb.columns.to_list()
evar = evar[evar.index("message"):]
evar = evar[:evar.index("freq_racquet")+1]
evar evar
['message',
'age',
'female',
'income',
'education',
'children',
'freq_endurance',
'freq_strength',
'freq_water',
'freq_team',
'freq_backcountry',
'freq_racquet']
= rsm.model.logistic(
lr = {"pentathlon_nptb": pentathlon_nptb[pentathlon_nptb["training"] == 1]},
data = "buyer_yes",
rvar = evar,
evar
) lr.summary()
Logistic regression (GLM)
Data : pentathlon_nptb
Response variable : buyer_yes
Level : None
Explanatory variables: message, age, female, income, education, children, freq_endurance, freq_strength, freq_water, freq_team, freq_backcountry, freq_racquet
Null hyp.: There is no effect of x on buyer_yes
Alt. hyp.: There is an effect of x on buyer_yes
OR OR% coefficient std.error z.value p.value
Intercept 0.000 -100.0% -8.28 0.064 -128.837 < .001 ***
message[control] 0.906 -9.4% -0.10 0.042 -2.335 0.02 *
message[endurance] 1.250 25.0% 0.22 0.041 5.462 < .001 ***
message[racquet] 0.993 -0.7% -0.01 0.042 -0.171 0.864
message[strength] 1.162 16.2% 0.15 0.041 3.660 < .001 ***
message[team] 1.035 3.5% 0.03 0.041 0.823 0.41
message[water] 1.056 5.6% 0.05 0.042 1.311 0.19
age[30 to 44] 2.386 138.6% 0.87 0.039 22.353 < .001 ***
age[45 to 59] 2.186 118.6% 0.78 0.040 19.555 < .001 ***
age[>= 60] 1.201 20.1% 0.18 0.048 3.805 < .001 ***
female[no] 1.353 35.3% 0.30 0.024 12.574 < .001 ***
income 1.000 0.0% 0.00 0.000 22.105 < .001 ***
education 1.037 3.7% 0.04 0.001 32.809 < .001 ***
children 1.618 61.8% 0.48 0.030 16.210 < .001 ***
freq_endurance 1.101 10.1% 0.10 0.005 18.300 < .001 ***
freq_strength 1.115 11.5% 0.11 0.003 33.044 < .001 ***
freq_water 1.178 17.8% 0.16 0.014 12.026 < .001 ***
freq_team 1.050 5.0% 0.05 0.004 10.797 < .001 ***
freq_backcountry 1.188 18.8% 0.17 0.010 17.313 < .001 ***
freq_racquet 1.110 11.0% 0.10 0.007 15.866 < .001 ***
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Pseudo R-squared (McFadden): 0.261
Pseudo R-squared (McFadden adjusted): 0.26
Area under the RO Curve (AUC): 0.884
Log-likelihood: -35159.009, AIC: 70358.018, BIC: 70576.978
Chi-squared: 24788.884, df(19), p.value < 0.001
Nr obs: 420,000
round(3) lr.coef.
index | OR | OR% | coefficient | std.error | z.value | p.value | ||
---|---|---|---|---|---|---|---|---|
0 | Intercept | 0.000 | -99.975 | -8.283 | 0.064 | -128.837 | 0.000 | *** |
1 | message[T.control] | 0.906 | -9.372 | -0.098 | 0.042 | -2.335 | 0.020 | * |
2 | message[T.endurance] | 1.250 | 24.991 | 0.223 | 0.041 | 5.462 | 0.000 | *** |
3 | message[T.racquet] | 0.993 | -0.711 | -0.007 | 0.042 | -0.171 | 0.864 | |
4 | message[T.strength] | 1.162 | 16.227 | 0.150 | 0.041 | 3.660 | 0.000 | *** |
5 | message[T.team] | 1.035 | 3.458 | 0.034 | 0.041 | 0.823 | 0.410 | |
6 | message[T.water] | 1.056 | 5.606 | 0.055 | 0.042 | 1.311 | 0.190 | |
7 | age[T.30 to 44] | 2.386 | 138.587 | 0.870 | 0.039 | 22.353 | 0.000 | *** |
8 | age[T.45 to 59] | 2.186 | 118.557 | 0.782 | 0.040 | 19.555 | 0.000 | *** |
9 | age[T.>= 60] | 1.201 | 20.099 | 0.183 | 0.048 | 3.805 | 0.000 | *** |
10 | female[T.no] | 1.353 | 35.328 | 0.303 | 0.024 | 12.574 | 0.000 | *** |
11 | income | 1.000 | 0.002 | 0.000 | 0.000 | 22.105 | 0.000 | *** |
12 | education | 1.037 | 3.736 | 0.037 | 0.001 | 32.809 | 0.000 | *** |
13 | children | 1.618 | 61.764 | 0.481 | 0.030 | 16.210 | 0.000 | *** |
14 | freq_endurance | 1.101 | 10.083 | 0.096 | 0.005 | 18.300 | 0.000 | *** |
15 | freq_strength | 1.115 | 11.550 | 0.109 | 0.003 | 33.044 | 0.000 | *** |
16 | freq_water | 1.178 | 17.847 | 0.164 | 0.014 | 12.026 | 0.000 | *** |
17 | freq_team | 1.050 | 4.978 | 0.049 | 0.004 | 10.797 | 0.000 | *** |
18 | freq_backcountry | 1.188 | 18.795 | 0.172 | 0.010 | 17.313 | 0.000 | *** |
19 | freq_racquet | 1.110 | 11.008 | 0.104 | 0.007 | 15.866 | 0.000 | *** |
= False, vif = True) lr.summary(main
Pseudo R-squared (McFadden): 0.261
Pseudo R-squared (McFadden adjusted): 0.26
Area under the RO Curve (AUC): 0.884
Log-likelihood: -35159.009, AIC: 70358.018, BIC: 70576.978
Chi-squared: 24788.884, df(19), p.value < 0.001
Nr obs: 420,000
Variance inflation factors:
vif Rsq
education 3.219 0.689
income 2.814 0.645
freq_endurance 1.686 0.407
freq_racquet 1.603 0.376
freq_strength 1.470 0.320
children 1.423 0.297
freq_team 1.339 0.253
age 1.307 0.235
freq_water 1.268 0.212
freq_backcountry 1.145 0.126
female 1.112 0.101
message 1.001 0.001
"vimp") lr.plot(
'pred_lr'] = lr.predict(pentathlon_nptb)['prediction'] pentathlon_nptb[
= {"train" : pentathlon_nptb.query("training == 1"), "test" : pentathlon_nptb.query("training == 0")}
dct = rsm.gains_plot(dct, "buyer", "yes", "pred_lr") fig
from sklearn import metrics
# prediction on training set
= pentathlon_nptb.query("training == 1")['pred_lr']
pred = pentathlon_nptb.query("training == 1")['buyer_yes']
actual = metrics.roc_curve(actual, pred)
fpr, tpr, thresholds round(3) metrics.auc(fpr, tpr).
0.884
# prediction on test set
= pentathlon_nptb.query("training == 0")['pred_lr']
pred = pentathlon_nptb.query("training == 0")['buyer_yes']
actual = metrics.roc_curve(actual, pred)
fpr, tpr, thresholds round(3) metrics.auc(fpr, tpr).
0.883
We can conclude that:
The AUC for the training set is 0.884, which indicates that the model has a very good ability to distinguish between the two classes in the training data.
The AUC for the test set is 0.883, which is slightly lower but still indicates a very good predictive performance on unseen data.
The fact that the test AUC is close to the training AUC suggests that the model is generalizing well and not overfitting significantly to the training data. A small decrease from training to test set performance is normal because models will usually perform slightly better on the data they were trained on.
1. For each customer determine the message (i.e., endurance, strength, water, team, backcountry, racquet, or no-message) predicted to lead to the highest probability of purchase
# Check the department of the message variable
"message"].value_counts() pentathlon_nptb[
message
control 87260
racquet 87088
team 86792
backcountry 86604
water 85120
strength 84280
endurance 82856
Name: count, dtype: int64
# Create predictions
"p_control_lr"] = lr.predict(pentathlon_nptb, data_cmd={"message": "control"})["prediction"]
pentathlon_nptb["p_racquet_lr"] = lr.predict(pentathlon_nptb, data_cmd={"message": "racquet"})["prediction"]
pentathlon_nptb["p_team_lr"] = lr.predict(pentathlon_nptb, data_cmd={"message": "team"})["prediction"]
pentathlon_nptb["p_backcountry_lr"] = lr.predict(pentathlon_nptb, data_cmd={"message": "backcountry"})["prediction"]
pentathlon_nptb["p_water_lr"] = lr.predict(pentathlon_nptb, data_cmd={"message": "water"})["prediction"]
pentathlon_nptb["p_strength_lr"] = lr.predict(pentathlon_nptb, data_cmd={"message": "strength"})["prediction"]
pentathlon_nptb["p_endurance_lr"] = lr.predict(pentathlon_nptb, data_cmd={"message": "endurance"})["prediction"]
pentathlon_nptb[ pentathlon_nptb
custid | buyer | total_os | message | age | female | income | education | children | freq_endurance | ... | training | buyer_yes | pred_lr | p_control_lr | p_racquet_lr | p_team_lr | p_backcountry_lr | p_water_lr | p_strength_lr | p_endurance_lr | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | U1 | no | 0 | team | 30 to 44 | no | 55000 | 19 | 0.8 | 0 | ... | 1.0 | 0 | 0.013031 | 0.011433 | 0.012512 | 0.013031 | 0.012601 | 0.013298 | 0.014615 | 0.015700 |
1 | U3 | no | 0 | backcountry | 45 to 59 | no | 35000 | 22 | 1.0 | 0 | ... | 0.0 | 0 | 0.005123 | 0.004645 | 0.005086 | 0.005299 | 0.005123 | 0.005408 | 0.005949 | 0.006395 |
2 | U13 | no | 0 | endurance | 45 to 59 | yes | 45000 | 33 | 0.7 | 0 | ... | 1.0 | 0 | 0.011948 | 0.008692 | 0.009515 | 0.009910 | 0.009582 | 0.010114 | 0.011120 | 0.011948 |
3 | U20 | no | 0 | water | 45 to 59 | yes | 25000 | 24 | 0.2 | 0 | ... | 1.0 | 0 | 0.002361 | 0.002027 | 0.002220 | 0.002313 | 0.002236 | 0.002361 | 0.002598 | 0.002793 |
4 | U25 | no | 0 | racquet | >= 60 | yes | 65000 | 32 | 1.1 | 1 | ... | 0.0 | 0 | 0.011717 | 0.010706 | 0.011717 | 0.012203 | 0.011800 | 0.012453 | 0.013688 | 0.014705 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
599995 | U3462888 | no | 0 | water | >= 60 | yes | 40000 | 26 | 0.6 | 0 | ... | 1.0 | 0 | 0.002182 | 0.001873 | 0.002052 | 0.002138 | 0.002067 | 0.002182 | 0.002401 | 0.002582 |
599996 | U3462900 | no | 0 | team | < 30 | no | 55000 | 32 | 0.9 | 3 | ... | 0.0 | 0 | 0.007347 | 0.006442 | 0.007053 | 0.007347 | 0.007103 | 0.007499 | 0.008247 | 0.008863 |
599997 | U3462902 | no | 0 | team | < 30 | yes | 55000 | 32 | 0.9 | 0 | ... | 1.0 | 0 | 0.009142 | 0.008017 | 0.008776 | 0.009142 | 0.008839 | 0.009330 | 0.010258 | 0.011023 |
599998 | U3462916 | no | 0 | team | < 30 | no | 50000 | 35 | 0.6 | 2 | ... | 0.0 | 0 | 0.006615 | 0.005799 | 0.006350 | 0.006615 | 0.006395 | 0.006751 | 0.007425 | 0.007981 |
599999 | U3462922 | no | 0 | endurance | 30 to 44 | yes | 50000 | 25 | 0.7 | 1 | ... | 0.0 | 0 | 0.010293 | 0.007484 | 0.008194 | 0.008535 | 0.008252 | 0.008710 | 0.009578 | 0.010293 |
600000 rows × 25 columns
We then extend the prediction to the full database and see how the predictions are distributed.
"message_lr"] = pentathlon_nptb[[
pentathlon_nptb["p_control_lr",
"p_endurance_lr",
"p_backcountry_lr",
"p_racquet_lr",
"p_strength_lr",
"p_team_lr",
"p_water_lr"]].idxmax(axis=1)
"message_lr"].value_counts() pentathlon_nptb[
message_lr
p_endurance_lr 600000
Name: count, dtype: int64
After checking the distribution, we can see that the model predicts that every customer should be sent the same product. The mistake in the analysis above was that the specified model is not sufficiently flexible to allow customization across customers!
The output of the logistic regression above shows that there are indeed some interaction effects taking place in the data. We want to create a model that can capture these interaction effects, which we have done. We are now going to use the model to predict the next product to buy for the full database.
=[f"{e}:message" for e in evar if e != "message"]
ivar ivar
['age:message',
'female:message',
'income:message',
'education:message',
'children:message',
'freq_endurance:message',
'freq_strength:message',
'freq_water:message',
'freq_team:message',
'freq_backcountry:message',
'freq_racquet:message']
The model is then:
= rsm.model.logistic(
lr_int ={"penathlon_nptb": pentathlon_nptb[pentathlon_nptb.training == 1]},
data="buyer",
rvar="yes",
lev=evar,
evar=ivar
ivar
) lr_int.summary()
Logistic regression (GLM)
Data : penathlon_nptb
Response variable : buyer
Level : yes
Explanatory variables: message, age, female, income, education, children, freq_endurance, freq_strength, freq_water, freq_team, freq_backcountry, freq_racquet
Null hyp.: There is no effect of x on buyer
Alt. hyp.: There is an effect of x on buyer
OR OR% coefficient std.error z.value p.value
Intercept 0.000 -100.0% -8.25 0.153 -53.948 < .001 ***
message[control] 1.115 11.5% 0.11 0.220 0.496 0.62
message[endurance] 1.199 19.9% 0.18 0.213 0.850 0.395
message[racquet] 1.084 8.4% 0.08 0.217 0.370 0.711
message[strength] 0.959 -4.1% -0.04 0.216 -0.195 0.846
message[team] 0.881 -11.9% -0.13 0.217 -0.581 0.561
message[water] 0.903 -9.7% -0.10 0.219 -0.468 0.64
age[30 to 44] 1.979 97.9% 0.68 0.100 6.802 < .001 ***
age[45 to 59] 2.056 105.6% 0.72 0.102 7.033 < .001 ***
age[>= 60] 0.992 -0.8% -0.01 0.127 -0.060 0.952
female[no] 1.365 36.5% 0.31 0.064 4.833 < .001 ***
age[30 to 44]:message[control] 1.196 19.6% 0.18 0.146 1.225 0.22
age[45 to 59]:message[control] 1.016 1.6% 0.02 0.150 0.107 0.914
age[>= 60]:message[control] 1.155 15.5% 0.14 0.184 0.785 0.432
age[30 to 44]:message[endurance] 1.302 30.2% 0.26 0.142 1.855 0.064 .
age[45 to 59]:message[endurance] 1.142 14.2% 0.13 0.146 0.907 0.364
age[>= 60]:message[endurance] 1.473 47.3% 0.39 0.176 2.203 0.028 *
age[30 to 44]:message[racquet] 1.053 5.3% 0.05 0.144 0.358 0.72
age[45 to 59]:message[racquet] 1.004 0.4% 0.00 0.147 0.030 0.976
age[>= 60]:message[racquet] 1.269 26.9% 0.24 0.178 1.334 0.182
age[30 to 44]:message[strength] 1.307 30.7% 0.27 0.142 1.886 0.059 .
age[45 to 59]:message[strength] 0.947 -5.3% -0.05 0.146 -0.375 0.708
age[>= 60]:message[strength] 1.084 8.4% 0.08 0.178 0.456 0.649
age[30 to 44]:message[team] 1.268 26.8% 0.24 0.144 1.654 0.098 .
age[45 to 59]:message[team] 1.111 11.1% 0.10 0.147 0.713 0.476
age[>= 60]:message[team] 1.164 16.4% 0.15 0.180 0.841 0.4
age[30 to 44]:message[water] 1.359 35.9% 0.31 0.147 2.081 0.037 *
age[45 to 59]:message[water] 1.264 26.4% 0.23 0.151 1.552 0.121
age[>= 60]:message[water] 1.362 36.2% 0.31 0.184 1.677 0.094 .
female[no]:message[control] 0.974 -2.6% -0.03 0.092 -0.283 0.777
female[no]:message[endurance] 0.884 -11.6% -0.12 0.089 -1.391 0.164
female[no]:message[racquet] 1.192 19.2% 0.18 0.092 1.911 0.056 .
female[no]:message[strength] 1.015 1.5% 0.02 0.090 0.168 0.867
female[no]:message[team] 1.019 1.9% 0.02 0.090 0.206 0.837
female[no]:message[water] 0.897 -10.3% -0.11 0.091 -1.190 0.234
income 1.000 0.0% 0.00 0.000 8.324 < .001 ***
income:message[control] 1.000 0.0% 0.00 0.000 0.020 0.984
income:message[endurance] 1.000 0.0% 0.00 0.000 1.688 0.091 .
income:message[racquet] 1.000 0.0% 0.00 0.000 0.191 0.849
income:message[strength] 1.000 -0.0% -0.00 0.000 -0.368 0.713
income:message[team] 1.000 0.0% 0.00 0.000 0.053 0.958
income:message[water] 1.000 -0.0% -0.00 0.000 -0.353 0.724
education 1.041 4.1% 0.04 0.003 13.546 < .001 ***
education:message[control] 0.995 -0.5% -0.01 0.004 -1.273 0.203
education:message[endurance] 0.993 -0.7% -0.01 0.004 -1.722 0.085 .
education:message[racquet] 0.992 -0.8% -0.01 0.004 -1.959 0.05 .
education:message[strength] 1.002 0.2% 0.00 0.004 0.551 0.581
education:message[team] 1.000 -0.0% -0.00 0.004 -0.071 0.943
education:message[water] 0.997 -0.3% -0.00 0.004 -0.718 0.473
children 1.723 72.3% 0.54 0.078 6.982 < .001 ***
children:message[control] 0.838 -16.2% -0.18 0.114 -1.552 0.121
children:message[endurance] 0.885 -11.5% -0.12 0.109 -1.120 0.263
children:message[racquet] 0.933 -6.7% -0.07 0.110 -0.629 0.529
children:message[strength] 1.023 2.3% 0.02 0.110 0.207 0.836
children:message[team] 0.960 -4.0% -0.04 0.111 -0.369 0.712
children:message[water] 0.935 -6.5% -0.07 0.112 -0.602 0.547
freq_endurance 1.081 8.1% 0.08 0.014 5.617 < .001 ***
freq_endurance:message[control] 1.029 2.9% 0.03 0.020 1.442 0.149
freq_endurance:message[endurance] 0.990 -1.0% -0.01 0.020 -0.512 0.609
freq_endurance:message[racquet] 1.016 1.6% 0.02 0.020 0.792 0.428
freq_endurance:message[strength] 1.016 1.6% 0.02 0.020 0.804 0.421
freq_endurance:message[team] 1.043 4.3% 0.04 0.019 2.157 0.031 *
freq_endurance:message[water] 1.036 3.6% 0.04 0.020 1.807 0.071 .
freq_strength 1.114 11.4% 0.11 0.009 12.347 < .001 ***
freq_strength:message[control] 1.001 0.1% 0.00 0.012 0.045 0.964
freq_strength:message[endurance] 0.991 -0.9% -0.01 0.012 -0.732 0.464
freq_strength:message[racquet] 1.002 0.2% 0.00 0.012 0.142 0.887
freq_strength:message[strength] 0.998 -0.2% -0.00 0.012 -0.174 0.862
freq_strength:message[team] 1.003 0.3% 0.00 0.012 0.249 0.803
freq_strength:message[water] 1.015 1.5% 0.01 0.012 1.162 0.245
freq_water 1.139 13.9% 0.13 0.035 3.678 < .001 ***
freq_water:message[control] 1.036 3.6% 0.04 0.051 0.699 0.484
freq_water:message[endurance] 1.067 6.7% 0.06 0.051 1.272 0.204
freq_water:message[racquet] 1.028 2.8% 0.03 0.051 0.539 0.59
freq_water:message[strength] 1.014 1.4% 0.01 0.051 0.268 0.788
freq_water:message[team] 0.970 -3.0% -0.03 0.050 -0.602 0.547
freq_water:message[water] 1.158 15.8% 0.15 0.051 2.859 0.004 **
freq_team 1.037 3.7% 0.04 0.012 3.069 0.002 **
freq_team:message[control] 1.005 0.5% 0.01 0.017 0.302 0.763
freq_team:message[endurance] 0.999 -0.1% -0.00 0.017 -0.057 0.955
freq_team:message[racquet] 1.025 2.5% 0.02 0.017 1.431 0.152
freq_team:message[strength] 1.031 3.1% 0.03 0.017 1.788 0.074 .
freq_team:message[team] 0.993 -0.7% -0.01 0.017 -0.406 0.685
freq_team:message[water] 1.036 3.6% 0.04 0.017 2.081 0.037 *
freq_backcountry 1.210 21.0% 0.19 0.026 7.256 < .001 ***
freq_backcountry:message[control] 0.973 -2.7% -0.03 0.037 -0.727 0.467
freq_backcountry:message[endurance] 0.992 -0.8% -0.01 0.037 -0.224 0.823
freq_backcountry:message[racquet] 0.968 -3.2% -0.03 0.037 -0.864 0.388
freq_backcountry:message[strength] 0.982 -1.8% -0.02 0.037 -0.492 0.623
freq_backcountry:message[team] 0.976 -2.4% -0.02 0.037 -0.658 0.511
freq_backcountry:message[water] 0.981 -1.9% -0.02 0.038 -0.517 0.605
freq_racquet 1.100 10.0% 0.10 0.017 5.456 < .001 ***
freq_racquet:message[control] 1.034 3.4% 0.03 0.025 1.356 0.175
freq_racquet:message[endurance] 1.034 3.4% 0.03 0.025 1.375 0.169
freq_racquet:message[racquet] 1.039 3.9% 0.04 0.025 1.523 0.128
freq_racquet:message[strength] 0.975 -2.5% -0.03 0.025 -1.023 0.306
freq_racquet:message[team] 0.992 -0.8% -0.01 0.024 -0.345 0.73
freq_racquet:message[water] 0.998 -0.2% -0.00 0.025 -0.090 0.928
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Pseudo R-squared (McFadden): 0.262
Pseudo R-squared (McFadden adjusted): 0.26
Area under the RO Curve (AUC): 0.884
Log-likelihood: -35102.941, AIC: 70401.883, BIC: 71474.788
Chi-squared: 24901.02, df(97), p.value < 0.001
Nr obs: 420,000
"pred_lr_int"] = lr_int.predict(pentathlon_nptb)["prediction"] pentathlon_nptb[
= {"train": pentathlon_nptb.query("training == 1"), "test": pentathlon_nptb.query("training == 0")}
dct = rsm.gains_plot(dct, "buyer", "yes", "pred_lr_int") fig
= rsm.gains_plot(
fig
pentathlon_nptb,"buyer", "yes",
"pred_lr", "pred_lr_int"]
[ )
# prediction on training set
= pentathlon_nptb.query("training == 1")['pred_lr_int']
pred = pentathlon_nptb.query("training == 1")['buyer_yes']
actual = metrics.roc_curve(actual, pred)
fpr, tpr, thresholds round(3) metrics.auc(fpr, tpr).
0.884
# prediction on test set
= pentathlon_nptb.query("training == 0")['pred_lr_int']
pred = pentathlon_nptb.query("training == 0")['buyer_yes']
actual = metrics.roc_curve(actual, pred)
fpr, tpr, thresholds round(3) metrics.auc(fpr, tpr).
0.883
Now, lets repeat the analysis above with the new model.
# Create predictions
"p_controli_lr"] = lr_int.predict(pentathlon_nptb, data_cmd={"message": "control"})["prediction"]
pentathlon_nptb["p_racqueti_lr"] = lr_int.predict(pentathlon_nptb, data_cmd={"message": "racquet"})["prediction"]
pentathlon_nptb["p_teami_lr"] = lr_int.predict(pentathlon_nptb, data_cmd={"message": "team"})["prediction"]
pentathlon_nptb["p_backcountryi_lr"] = lr_int.predict(pentathlon_nptb, data_cmd={"message": "backcountry"})["prediction"]
pentathlon_nptb["p_wateri_lr"] = lr_int.predict(pentathlon_nptb, data_cmd={"message": "water"})["prediction"]
pentathlon_nptb["p_strengthi_lr"] = lr_int.predict(pentathlon_nptb, data_cmd={"message": "strength"})["prediction"]
pentathlon_nptb["p_endurancei_lr"] = lr_int.predict(pentathlon_nptb, data_cmd={"message": "endurance"})["prediction"]
pentathlon_nptb[ pentathlon_nptb
custid | buyer | total_os | message | age | female | income | education | children | freq_endurance | ... | p_endurance_lr | message_lr | pred_lr_int | p_controli_lr | p_racqueti_lr | p_teami_lr | p_backcountryi_lr | p_wateri_lr | p_strengthi_lr | p_endurancei_lr | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | U1 | no | 0 | team | 30 to 44 | no | 55000 | 19 | 0.8 | 0 | ... | 0.015700 | p_endurance_lr | 0.012008 | 0.012022 | 0.014499 | 0.012008 | 0.011131 | 0.012604 | 0.015452 | 0.015682 |
1 | U3 | no | 0 | backcountry | 45 to 59 | no | 35000 | 22 | 1.0 | 0 | ... | 0.006395 | p_endurance_lr | 0.005556 | 0.004605 | 0.005858 | 0.005279 | 0.005556 | 0.004981 | 0.005475 | 0.006014 |
2 | U13 | no | 0 | endurance | 45 to 59 | yes | 45000 | 33 | 0.7 | 0 | ... | 0.011948 | p_endurance_lr | 0.013884 | 0.009140 | 0.008789 | 0.009533 | 0.010718 | 0.009680 | 0.009343 | 0.013884 |
3 | U20 | no | 0 | water | 45 to 59 | yes | 25000 | 24 | 0.2 | 0 | ... | 0.002793 | p_endurance_lr | 0.002389 | 0.002252 | 0.002089 | 0.002264 | 0.002339 | 0.002389 | 0.002196 | 0.002970 |
4 | U25 | no | 0 | racquet | >= 60 | yes | 65000 | 32 | 1.1 | 1 | ... | 0.014705 | p_endurance_lr | 0.012039 | 0.010784 | 0.012039 | 0.011119 | 0.011526 | 0.011437 | 0.011463 | 0.019675 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
599995 | U3462888 | no | 0 | water | >= 60 | yes | 40000 | 26 | 0.6 | 0 | ... | 0.002582 | p_endurance_lr | 0.002042 | 0.001966 | 0.002118 | 0.001946 | 0.001947 | 0.002042 | 0.002089 | 0.003217 |
599996 | U3462900 | no | 0 | team | < 30 | no | 55000 | 32 | 0.9 | 3 | ... | 0.008863 | p_endurance_lr | 0.007692 | 0.006959 | 0.008257 | 0.007692 | 0.007884 | 0.005832 | 0.008117 | 0.007731 |
599997 | U3462902 | no | 0 | team | < 30 | yes | 55000 | 32 | 0.9 | 0 | ... | 0.011023 | p_endurance_lr | 0.008225 | 0.008535 | 0.008976 | 0.008225 | 0.010062 | 0.008286 | 0.009822 | 0.011345 |
599998 | U3462916 | no | 0 | team | < 30 | no | 50000 | 35 | 0.6 | 2 | ... | 0.007981 | p_endurance_lr | 0.006796 | 0.006378 | 0.007352 | 0.006796 | 0.007164 | 0.005272 | 0.007284 | 0.006971 |
599999 | U3462922 | no | 0 | endurance | 30 to 44 | yes | 50000 | 25 | 0.7 | 1 | ... | 0.010293 | p_endurance_lr | 0.011857 | 0.008527 | 0.007495 | 0.008401 | 0.007500 | 0.008214 | 0.009222 | 0.011857 |
600000 rows × 34 columns
We then extend the prediction to the full database and see how the predictions are distributed.
= {
repl "p_controli_lr": "control",
"p_endurancei_lr": "endurance",
"p_backcountryi_lr": "backcountry",
"p_racqueti_lr": "racquet",
"p_strengthi_lr": "strength",
"p_teami_lr": "team",
"p_wateri_lr": "water"}
= [
predictions_lr "p_controli_lr",
"p_endurancei_lr",
"p_backcountryi_lr",
"p_racqueti_lr",
"p_strengthi_lr",
"p_teami_lr",
"p_wateri_lr"]
Find the highest probability of purchase for each customer, labeled them by the product with the highest probability of purchase
"messagei_lr"] = (
pentathlon_nptb[
pentathlon_nptb[predictions_lr]=1)
.idxmax(axismap(repl)
.
) pentathlon_nptb
custid | buyer | total_os | message | age | female | income | education | children | freq_endurance | ... | message_lr | pred_lr_int | p_controli_lr | p_racqueti_lr | p_teami_lr | p_backcountryi_lr | p_wateri_lr | p_strengthi_lr | p_endurancei_lr | messagei_lr | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | U1 | no | 0 | team | 30 to 44 | no | 55000 | 19 | 0.8 | 0 | ... | p_endurance_lr | 0.012008 | 0.012022 | 0.014499 | 0.012008 | 0.011131 | 0.012604 | 0.015452 | 0.015682 | endurance |
1 | U3 | no | 0 | backcountry | 45 to 59 | no | 35000 | 22 | 1.0 | 0 | ... | p_endurance_lr | 0.005556 | 0.004605 | 0.005858 | 0.005279 | 0.005556 | 0.004981 | 0.005475 | 0.006014 | endurance |
2 | U13 | no | 0 | endurance | 45 to 59 | yes | 45000 | 33 | 0.7 | 0 | ... | p_endurance_lr | 0.013884 | 0.009140 | 0.008789 | 0.009533 | 0.010718 | 0.009680 | 0.009343 | 0.013884 | endurance |
3 | U20 | no | 0 | water | 45 to 59 | yes | 25000 | 24 | 0.2 | 0 | ... | p_endurance_lr | 0.002389 | 0.002252 | 0.002089 | 0.002264 | 0.002339 | 0.002389 | 0.002196 | 0.002970 | endurance |
4 | U25 | no | 0 | racquet | >= 60 | yes | 65000 | 32 | 1.1 | 1 | ... | p_endurance_lr | 0.012039 | 0.010784 | 0.012039 | 0.011119 | 0.011526 | 0.011437 | 0.011463 | 0.019675 | endurance |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
599995 | U3462888 | no | 0 | water | >= 60 | yes | 40000 | 26 | 0.6 | 0 | ... | p_endurance_lr | 0.002042 | 0.001966 | 0.002118 | 0.001946 | 0.001947 | 0.002042 | 0.002089 | 0.003217 | endurance |
599996 | U3462900 | no | 0 | team | < 30 | no | 55000 | 32 | 0.9 | 3 | ... | p_endurance_lr | 0.007692 | 0.006959 | 0.008257 | 0.007692 | 0.007884 | 0.005832 | 0.008117 | 0.007731 | racquet |
599997 | U3462902 | no | 0 | team | < 30 | yes | 55000 | 32 | 0.9 | 0 | ... | p_endurance_lr | 0.008225 | 0.008535 | 0.008976 | 0.008225 | 0.010062 | 0.008286 | 0.009822 | 0.011345 | endurance |
599998 | U3462916 | no | 0 | team | < 30 | no | 50000 | 35 | 0.6 | 2 | ... | p_endurance_lr | 0.006796 | 0.006378 | 0.007352 | 0.006796 | 0.007164 | 0.005272 | 0.007284 | 0.006971 | racquet |
599999 | U3462922 | no | 0 | endurance | 30 to 44 | yes | 50000 | 25 | 0.7 | 1 | ... | p_endurance_lr | 0.011857 | 0.008527 | 0.007495 | 0.008401 | 0.007500 | 0.008214 | 0.009222 | 0.011857 | endurance |
600000 rows × 35 columns
# Find the maximum probability of purchase
"p_max"] = pentathlon_nptb[predictions_lr].max(axis=1) pentathlon_nptb[
Approach: First of all, We build a logistic regression model to predict the next product to buy, use buyer_yes as the response variable and the other variables as predictors. After relizing that the model predicts that every customer should be sent the same product i.e, endurance
, we build another model that can capture these interaction effects.
We then extend the prediction to the full database. To predict the probability of purchasing for different messages, we use the data_cmd
to predict the probability of purchasing for different messages.
Finally, we choose the product with the highest probability using idxmax
to automatically find the best message for each customer. This command also provides a label for the category with the maximum predicted propability of buying accross all the products. Then, create p_max
to store the maximum predicted probability of purchasing for the best message sellected for each customer.
2. For each message, report the percentage of customers for whom that message or no-message maximizes their probability of purchase
Let’s create a crosstab
to see which products should we send to each customer.
=pentathlon_nptb[pentathlon_nptb.training == 0].messagei_lr, columns="count").apply(rsm.format_nr) pd.crosstab(index
col_0 | count |
---|---|
messagei_lr | |
backcountry | 1,447 |
endurance | 125,861 |
racquet | 12,299 |
strength | 36,584 |
team | 1,680 |
water | 2,129 |
Report the percentage of customers for whom that message is the best message or not the best message.
"messagei_lr"].value_counts(normalize=True) pentathlon_nptb[
messagei_lr
endurance 0.699675
strength 0.202202
racquet 0.069037
water 0.011852
team 0.009358
backcountry 0.007875
control 0.000002
Name: proportion, dtype: float64
edurance
: The distribution suggests thatendurance
- related messages resonate significantly more with the customers than the other messages.strength
: Wilte not as dominant asendurance
,strength
- related messages also play an important role. There might be opportunities to further optimize or tailor these messages to increase their effectiveness.Other categories such as
racquet
,water
,team
,backcountry
have much smaller proportions, suggesting that they are less often the best choice for maximizing the probability of purchase.control
: The negligible role ofcontrol
in maximizing the probability suggrest that any engagement, even if not perfectly oplimized, tends to be better than no engagement at all. However, it’s also essential to consider the context and the potential costs of sending messages to customers.
Create a table with the average purchase probability if we send the message for each product to everyone.
== 0, predictions_lr].agg("mean").sort_values(
pentathlon_nptb.loc[pentathlon_nptb.training =False).apply(rsm.format_nr, perc=True) ascending
p_endurancei_lr 2.79%
p_strengthi_lr 2.64%
p_wateri_lr 2.44%
p_teami_lr 2.39%
p_backcountryi_lr 2.32%
p_racqueti_lr 2.31%
p_controli_lr 2.14%
dtype: object
3. Expected profit
# Calculate the actual order size
= pentathlon_nptb[(pentathlon_nptb.training == 1) & (pentathlon_nptb.buyer_yes == 1)].groupby("message")["total_os"].mean()
ordersize ordersize
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/476348601.py:2: FutureWarning:
The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
message
backcountry 64.034091
control 49.900598
endurance 55.584893
racquet 56.405620
strength 56.708751
team 56.522449
water 61.957343
Name: total_os, dtype: float64
Creat Linear model to predict ordersize
= rsm.model.regress(
reg = {"pentathlon_nptb": pentathlon_nptb[(pentathlon_nptb.training == 1) & (pentathlon_nptb.buyer == "yes")]},
data ="total_os",
rvar = evar,
evar = ivar
ivar
) reg.summary()
Linear regression (OLS)
Data : pentathlon_nptb
Response variable : total_os
Explanatory variables: message, age, female, income, education, children, freq_endurance, freq_strength, freq_water, freq_team, freq_backcountry, freq_racquet
Null hyp.: the effect of x on total_os is zero
Alt. hyp.: the effect of x on total_os is not zero
coefficient std.error t.value p.value
Intercept -9.616 8.976 -1.071 0.284
message[control] 17.412 12.953 1.344 0.179
message[endurance] 9.106 12.058 0.755 0.45
message[racquet] 4.179 12.822 0.326 0.745
message[strength] 1.414 12.601 0.112 0.911
message[team] 10.566 12.573 0.840 0.401
message[water] 6.659 12.905 0.516 0.606
age[30 to 44] 4.166 5.701 0.731 0.465
age[45 to 59] 6.059 5.712 1.061 0.289
age[>= 60] -0.679 7.066 -0.096 0.923
female[no] 3.578 3.548 1.008 0.313
age[30 to 44]:message[control] -6.037 8.297 -0.728 0.467
age[45 to 59]:message[control] -3.440 8.437 -0.408 0.683
age[>= 60]:message[control] 5.342 10.292 0.519 0.604
age[30 to 44]:message[endurance] -3.088 8.017 -0.385 0.7
age[45 to 59]:message[endurance] -12.434 8.132 -1.529 0.126
age[>= 60]:message[endurance] 6.071 9.755 0.622 0.534
age[30 to 44]:message[racquet] -1.337 8.176 -0.163 0.87
age[45 to 59]:message[racquet] -8.117 8.242 -0.985 0.325
age[>= 60]:message[racquet] -5.150 10.010 -0.514 0.607
age[30 to 44]:message[strength] -0.696 7.975 -0.087 0.93
age[45 to 59]:message[strength] 0.160 8.076 0.020 0.984
age[>= 60]:message[strength] 5.451 9.893 0.551 0.582
age[30 to 44]:message[team] 8.784 8.077 1.088 0.277
age[45 to 59]:message[team] 6.381 8.161 0.782 0.434
age[>= 60]:message[team] 9.798 10.020 0.978 0.328
age[30 to 44]:message[water] -7.117 8.285 -0.859 0.39
age[45 to 59]:message[water] -12.193 8.378 -1.455 0.146
age[>= 60]:message[water] 5.907 10.209 0.579 0.563
female[no]:message[control] -4.082 5.058 -0.807 0.42
female[no]:message[endurance] -3.209 4.877 -0.658 0.511
female[no]:message[racquet] -2.000 5.096 -0.392 0.695
female[no]:message[strength] -3.970 4.920 -0.807 0.42
female[no]:message[team] -8.944 4.904 -1.824 0.068 .
female[no]:message[water] 1.978 4.972 0.398 0.691
income 0.000 0.000 3.970 < .001 ***
income:message[control] -0.000 0.000 -1.166 0.244
income:message[endurance] -0.000 0.000 -0.710 0.478
income:message[racquet] 0.000 0.000 0.702 0.483
income:message[strength] -0.000 0.000 -1.205 0.228
income:message[team] -0.000 0.000 -1.392 0.164
income:message[water] -0.000 0.000 -1.353 0.176
education 0.773 0.161 4.812 < .001 ***
education:message[control] -0.370 0.230 -1.613 0.107
education:message[endurance] -0.238 0.218 -1.090 0.276
education:message[racquet] -0.292 0.228 -1.283 0.199
education:message[strength] 0.005 0.221 0.025 0.98
education:message[team] -0.337 0.224 -1.505 0.132
education:message[water] -0.088 0.226 -0.390 0.696
children 7.307 4.406 1.659 0.097 .
children:message[control] 4.666 6.217 0.750 0.453
children:message[endurance] 1.572 5.994 0.262 0.793
children:message[racquet] 5.236 6.255 0.837 0.403
children:message[strength] -0.699 6.269 -0.112 0.911
children:message[team] 4.814 6.233 0.772 0.44
children:message[water] 11.040 6.482 1.703 0.089 .
freq_endurance -1.400 0.698 -2.004 0.045 *
freq_endurance:message[control] 1.082 0.995 1.088 0.277
freq_endurance:message[endurance] -0.138 0.974 -0.142 0.887
freq_endurance:message[racquet] -0.927 1.005 -0.922 0.356
freq_endurance:message[strength] 0.160 0.981 0.163 0.871
freq_endurance:message[team] 1.424 0.963 1.479 0.139
freq_endurance:message[water] -0.298 0.968 -0.308 0.758
freq_strength -2.181 0.437 -4.988 < .001 ***
freq_strength:message[control] 0.352 0.626 0.563 0.573
freq_strength:message[endurance] 0.286 0.607 0.471 0.638
freq_strength:message[racquet] 0.029 0.635 0.045 0.964
freq_strength:message[strength] 1.337 0.614 2.179 0.029 *
freq_strength:message[team] 0.807 0.615 1.312 0.19
freq_strength:message[water] 1.055 0.614 1.718 0.086 .
freq_water 2.979 1.679 1.774 0.076 .
freq_water:message[control] 0.263 2.417 0.109 0.913
freq_water:message[endurance] -1.171 2.375 -0.493 0.622
freq_water:message[racquet] -2.418 2.392 -1.011 0.312
freq_water:message[strength] 0.200 2.369 0.084 0.933
freq_water:message[team] -1.236 2.347 -0.527 0.598
freq_water:message[water] -3.641 2.368 -1.538 0.124
freq_team -0.075 0.599 -0.125 0.901
freq_team:message[control] -0.229 0.847 -0.270 0.787
freq_team:message[endurance] 2.188 0.829 2.638 0.008 **
freq_team:message[racquet] -0.563 0.861 -0.654 0.513
freq_team:message[strength] -0.657 0.850 -0.774 0.439
freq_team:message[team] 0.320 0.829 0.386 0.7
freq_team:message[water] -0.932 0.836 -1.115 0.265
freq_backcountry 2.686 1.320 2.035 0.042 *
freq_backcountry:message[control] -0.962 1.902 -0.506 0.613
freq_backcountry:message[endurance] 2.745 1.848 1.486 0.137
freq_backcountry:message[racquet] -0.198 1.887 -0.105 0.916
freq_backcountry:message[strength] -0.097 1.864 -0.052 0.958
freq_backcountry:message[team] -0.529 1.858 -0.284 0.776
freq_backcountry:message[water] 1.071 1.862 0.575 0.565
freq_racquet 0.297 0.868 0.343 0.732
freq_racquet:message[control] -0.775 1.246 -0.622 0.534
freq_racquet:message[endurance] -0.021 1.186 -0.018 0.986
freq_racquet:message[racquet] 0.944 1.237 0.764 0.445
freq_racquet:message[strength] 1.148 1.200 0.957 0.339
freq_racquet:message[team] 0.199 1.207 0.165 0.869
freq_racquet:message[water] 1.503 1.214 1.238 0.216
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-squared: 0.089, Adjusted R-squared: 0.08
F-statistic: 10.021 df(97, 9982), p.value < 0.001
Nr obs: 10,080
round(3) reg.coef.
index | coefficient | std.error | t.value | p.value | ||
---|---|---|---|---|---|---|
0 | Intercept | -9.616 | 8.976 | -1.071 | 0.284 | |
1 | message[T.control] | 17.412 | 12.953 | 1.344 | 0.179 | |
2 | message[T.endurance] | 9.106 | 12.058 | 0.755 | 0.450 | |
3 | message[T.racquet] | 4.179 | 12.822 | 0.326 | 0.745 | |
4 | message[T.strength] | 1.414 | 12.601 | 0.112 | 0.911 | |
... | ... | ... | ... | ... | ... | ... |
93 | freq_racquet:message[T.endurance] | -0.021 | 1.186 | -0.018 | 0.986 | |
94 | freq_racquet:message[T.racquet] | 0.944 | 1.237 | 0.764 | 0.445 | |
95 | freq_racquet:message[T.strength] | 1.148 | 1.200 | 0.957 | 0.339 | |
96 | freq_racquet:message[T.team] | 0.199 | 1.207 | 0.165 | 0.869 | |
97 | freq_racquet:message[T.water] | 1.503 | 1.214 | 1.238 | 0.216 |
98 rows × 6 columns
"pos_control_reg"] = reg.predict(pentathlon_nptb, data_cmd={"message": "control"})["prediction"]
pentathlon_nptb["pos_racquet_reg"] = reg.predict(pentathlon_nptb, data_cmd={"message": "racquet"})["prediction"]
pentathlon_nptb["pos_team_reg"] = reg.predict(pentathlon_nptb, data_cmd={"message": "team"})["prediction"]
pentathlon_nptb["pos_backcountry_reg"] = reg.predict(pentathlon_nptb, data_cmd={"message": "backcountry"})["prediction"]
pentathlon_nptb["pos_water_reg"] = reg.predict(pentathlon_nptb, data_cmd={"message": "water"})["prediction"]
pentathlon_nptb["pos_strength_reg"] = reg.predict(pentathlon_nptb, data_cmd={"message": "strength"})["prediction"]
pentathlon_nptb["pos_endurance_reg"] = reg.predict(pentathlon_nptb, data_cmd={"message": "endurance"})["prediction"] pentathlon_nptb[
Calculate the expected profit for each product
"ep_control_lr"] = pentathlon_nptb["p_control_lr"] * pentathlon_nptb["pos_control_reg"] * 0.4
pentathlon_nptb["ep_racquet_lr"] = pentathlon_nptb["p_racquet_lr"] * pentathlon_nptb["pos_racquet_reg"] * 0.4
pentathlon_nptb["ep_team_lr"] = pentathlon_nptb["p_team_lr"] * pentathlon_nptb["pos_team_reg"] * 0.4
pentathlon_nptb["ep_backcountry_lr"] = pentathlon_nptb["p_backcountry_lr"] * pentathlon_nptb["pos_backcountry_reg"] * 0.4
pentathlon_nptb["ep_water_lr"] = pentathlon_nptb["p_water_lr"] * pentathlon_nptb["pos_water_reg"] * 0.4
pentathlon_nptb["ep_strength_lr"] = pentathlon_nptb["p_strength_lr"] * pentathlon_nptb["pos_strength_reg"] * 0.4
pentathlon_nptb["ep_endurance_lr"] = pentathlon_nptb["p_endurance_lr"] * pentathlon_nptb["pos_endurance_reg"] * 0.4 pentathlon_nptb[
To determine the message to send that will maximize the expected profit, we can use the following formula:
= [
expected_profit_lr "ep_control_lr",
"ep_endurance_lr",
"ep_backcountry_lr",
"ep_racquet_lr",
"ep_strength_lr",
"ep_team_lr",
"ep_water_lr"]
= {"ep_control_lr": "control", "ep_endurance_lr": "endurance", "ep_backcountry_lr": "backcountry", "ep_racquet_lr": "racquet", "ep_strength_lr": "strength", "ep_team_lr": "team", "ep_water_lr": "water"}
repl "ep_message_lr"] = (
pentathlon_nptb[
pentathlon_nptb[expected_profit_lr]=1)
.idxmax(axismap(repl)
. )
To predict the order size, we use the total_os
as the response variable and the other variables as predictors as well as the interaction between message and other variables. We then extend the prediction to the full database and see how the predictions are distributed.
Same as the previous model, we use the data_cmd
to predict the order size for different messages.
After that, we calculate the expected profit for each product by multiplying the probability of purchasing and the order size for each product. We then choose the product with the highest expected profit using idxmax
to automatically find the best message for each customer. This command also provides a label for the category with the maximum expected profit. Then, create p_max
to store the maximum expected profit of purchasing for the best message sellected for each customer.
4.Report for each message, i.e., endurance, racket, etc., and no-message, the percentage of customers for whom that (no) message maximizes their expected profit.
=True) pentathlon_nptb.ep_message_lr.value_counts(normalize
ep_message_lr
water 0.340890
endurance 0.306053
team 0.206722
backcountry 0.073778
strength 0.066898
racquet 0.004322
control 0.001337
Name: proportion, dtype: float64
5. Expected profit can we obtain, on average, per customer if we customize the message to each customer?
"ep_max"] = pentathlon_nptb[expected_profit_lr].max(axis=1)
pentathlon_nptb["ep_max"].mean() pentathlon_nptb[
0.6754439327865267
Let create the crosstab
=pentathlon_nptb[pentathlon_nptb.training == 0].ep_message_lr, columns="count").apply(rsm.format_nr) pd.crosstab(index
col_0 | count |
---|---|
ep_message_lr | |
backcountry | 13,379 |
control | 238 |
endurance | 54,956 |
racquet | 740 |
strength | 12,092 |
team | 37,539 |
water | 61,056 |
6. Expected profit per e-mailed customer if every customer receives the same message
(
pentathlon_nptb== 0, ["ep_control_lr", "ep_endurance_lr", "ep_backcountry_lr", "ep_racquet_lr", "ep_strength_lr", "ep_team_lr", "ep_water_lr"]]
.loc[pentathlon_nptb.training "mean")
.agg(=False)
.sort_values(ascendingapply(rsm.format_nr, sym = "$", dec = 2)
. )
ep_endurance_lr $0.62
ep_water_lr $0.6
ep_strength_lr $0.6
ep_backcountry_lr $0.59
ep_team_lr $0.54
ep_racquet_lr $0.53
ep_control_lr $0.43
dtype: object
7. Expected profit per e-mailed customer if every customer is assigned randomly to one of the messages or the no-message condition?
# probabilty of purchase where customer is assigned to a random message
"p_random_lr"] = lr_int.predict(pentathlon_nptb)["prediction"]
pentathlon_nptb[
# expected avg order size where customer is assigned to a random message
"ordersize_random_reg"] = reg.predict(pentathlon_nptb)["prediction"]
pentathlon_nptb[
# expected profit where customer is assigned to a random message
"ep_random_lr"] = pentathlon_nptb["p_random_lr"] * pentathlon_nptb["ordersize_random_reg"] * 0.4
pentathlon_nptb[
# expected profit per customer where customer is assigned to a random message
= pentathlon_nptb.loc[pentathlon_nptb.training == 0, "ep_random_lr"].mean()
random_profit_per_customer random_profit_per_customer
0.5569904087502335
# expected profit where no-message is sent (control)
"ep_control_lr"]
pentathlon_nptb[
# expected profit per customer where no-message is sent (control)
= pentathlon_nptb.loc[pentathlon_nptb.training == 0, "ep_control_lr"].mean()
control_profit_per_customer control_profit_per_customer
0.4339519244630221
8. Profit for 5,000,000 customers
# Profit where each customer is assigned to the message with the highest expected profit
= pentathlon_nptb.loc[pentathlon_nptb.training == 0, "ep_max"].agg("mean") * 5000000
profit_logit profit_logit
3408166.566437483
# Profit where each customer is sent to a random message
= random_profit_per_customer * 5000000
random_profit random_profit
2784952.0437511676
# Profit where no message is sent
= control_profit_per_customer * 5000000
control_profit control_profit
2169759.6223151106
= profit_logit - control_profit
profit_improvement_lr profit_improvement_lr
1238406.9441223722
= {
profits_dct "Customize Message": profit_logit,
"Randomly Assign": random_profit,
"No Message Sent": control_profit,
}
import seaborn as sns
import matplotlib.pyplot as plt
# Convert dictionary to DataFrame
= pd.DataFrame(list(profits_dct.items()), columns=['Model', 'Profit'])
df =(10, 5)) # Adjust the width and height to your preference
plt.figure(figsize# Plot
set(style="white")
sns.= sns.barplot(x="Model", y="Profit", data=df, palette="magma")
ax
# Annotations
for index, row in df.iterrows():
f'${row.Profit:.2f}', ha='center')
ax.text(index, row.Profit,
# Set labels and title
"Model Type", fontdict={'family': 'serif', 'color': 'black', 'size': 12})
ax.set_xlabel("Profit", fontdict={'family': 'serif', 'color': 'black', 'size': 12})
ax.set_ylabel("Profit by Model", fontdict={'family': 'serif', 'color': 'black', 'size': 15})
ax.set_title(
=45) # Rotate x labels for better readability
plt.xticks(rotation plt.show()
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1123293812.py:16: FutureWarning:
Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.
Neural Networks Model
"training"] == 1] pentathlon_nptb[pentathlon_nptb[
custid | buyer | total_os | message | age | female | income | education | children | freq_endurance | ... | ep_team_lr | ep_backcountry_lr | ep_water_lr | ep_strength_lr | ep_endurance_lr | ep_message_lr | ep_max | p_random_lr | ordersize_random_reg | ep_random_lr | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | U1 | no | 0 | team | 30 to 44 | no | 55000 | 19 | 0.8 | 0 | ... | 0.170065 | 0.152448 | 0.165323 | 0.127039 | 0.216326 | endurance | 0.216326 | 0.012008 | 32.627819 | 0.156712 |
2 | U13 | no | 0 | endurance | 45 to 59 | yes | 45000 | 33 | 0.7 | 0 | ... | 0.197624 | 0.190290 | 0.186217 | 0.204221 | 0.194732 | strength | 0.204221 | 0.013884 | 40.744326 | 0.226274 |
3 | U20 | no | 0 | water | 45 to 59 | yes | 25000 | 24 | 0.2 | 0 | ... | 0.028568 | 0.022970 | 0.014892 | 0.024216 | 0.016298 | team | 0.028568 | 0.002389 | 15.768889 | 0.015070 |
5 | U28 | no | 0 | strength | < 30 | yes | 25000 | 18 | 0.3 | 0 | ... | 0.006081 | 0.005424 | 0.007153 | 0.005244 | 0.008038 | endurance | 0.008038 | 0.000929 | 13.078011 | 0.004860 |
8 | U59 | no | 0 | strength | >= 60 | yes | 65000 | 36 | 1.2 | 1 | ... | 0.244028 | 0.211679 | 0.291270 | 0.243492 | 0.301579 | endurance | 0.301579 | 0.011213 | 47.059571 | 0.211075 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
599989 | U3462835 | no | 0 | control | 30 to 44 | yes | 45000 | 20 | 0.8 | 1 | ... | 0.081865 | 0.060368 | 0.063264 | 0.057766 | 0.070770 | team | 0.081865 | 0.004871 | 32.245840 | 0.062828 |
599992 | U3462858 | no | 0 | control | 45 to 59 | yes | 40000 | 16 | 1.1 | 0 | ... | 0.056256 | 0.041872 | 0.041478 | 0.040307 | 0.037076 | team | 0.056256 | 0.003052 | 38.458959 | 0.046946 |
599993 | U3462877 | no | 0 | backcountry | < 30 | no | 65000 | 30 | 1.0 | 2 | ... | 0.176618 | 0.232750 | 0.283269 | 0.223465 | 0.266163 | water | 0.283269 | 0.014259 | 45.079704 | 0.257119 |
599995 | U3462888 | no | 0 | water | >= 60 | yes | 40000 | 26 | 0.6 | 0 | ... | 0.030821 | 0.023935 | 0.033779 | 0.028076 | 0.036240 | endurance | 0.036240 | 0.002042 | 38.698478 | 0.031609 |
599997 | U3462902 | no | 0 | team | < 30 | yes | 55000 | 32 | 0.9 | 0 | ... | 0.119600 | 0.121016 | 0.170371 | 0.139849 | 0.178244 | endurance | 0.178244 | 0.008225 | 32.707711 | 0.107603 |
420000 rows × 55 columns
= rsm.model.mlp(
NN ={"pentathlon_nptb": pentathlon_nptb[pentathlon_nptb["training"] == 1]},
data='buyer_yes',
rvar= evar,
evar=(1,), # Simple NN with 1 hidden layer
hidden_layer_sizes='classification'
mod_type
)
NN.summary()
Multi-layer Perceptron (NN)
Data : pentathlon_nptb
Response variable : buyer_yes
Level : None
Explanatory variables: message, age, female, income, education, children, freq_endurance, freq_strength, freq_water, freq_team, freq_backcountry, freq_racquet
Model type : classification
Nr. of features : (12, 19)
Nr. of observations : 420,000
Hidden_layer_sizes : (1,)
Activation function : tanh
Solver : lbfgs
Alpha : 0.0001
Batch size : auto
Learning rate : 0.001
Maximum iterations : 10000
random_state : 1234
AUC : 0.884
Raw data :
message age female income education children freq_endurance freq_strength freq_water freq_team freq_backcountry freq_racquet
team 30 to 44 no 55000 19 0.8 0 4 0 4 0 1
endurance 45 to 59 yes 45000 33 0.7 0 0 0 0 2 2
water 45 to 59 yes 25000 24 0.2 0 0 0 0 0 0
strength < 30 yes 25000 18 0.3 0 0 0 0 0 0
strength >= 60 yes 65000 36 1.2 1 1 0 2 0 3
Estimation data :
income education children freq_endurance freq_strength freq_water freq_team freq_backcountry freq_racquet message_control message_endurance message_racquet message_strength message_team message_water age_30 to 44 age_45 to 59 age_>= 60 female_no
0.388655 -0.663713 -0.265321 -0.640336 1.100861 -0.261727 1.892088 -0.690593 0.058975 False False False False True False True False False True
-0.184052 0.279378 -0.479844 -0.640336 -0.706563 -0.261727 -0.620020 1.958233 0.882489 False True False False False False False True False False
-1.329467 -0.326895 -1.552460 -0.640336 -0.706563 -0.261727 -0.620020 -0.690593 -0.764538 False False False False False True False True False False
-1.329467 -0.731077 -1.337937 -0.640336 -0.706563 -0.261727 -0.620020 -0.690593 -0.764538 False False False True False False False False False False
0.961362 0.481469 0.592772 0.060592 -0.254707 -0.261727 0.636034 -0.690593 1.706002 False False False True False False False False True False
= rsm.model.mlp(
NN100 ={"pentathlon_nptb": pentathlon_nptb[pentathlon_nptb["training"] == 1]},
data='buyer_yes',
rvar= evar,
evar=(100,), # more complex NN with 100 hidden layers
hidden_layer_sizes='classification'
mod_type
)
NN100.summary()
Multi-layer Perceptron (NN)
Data : pentathlon_nptb
Response variable : buyer_yes
Level : None
Explanatory variables: message, age, female, income, education, children, freq_endurance, freq_strength, freq_water, freq_team, freq_backcountry, freq_racquet
Model type : classification
Nr. of features : (12, 19)
Nr. of observations : 420,000
Hidden_layer_sizes : (100,)
Activation function : tanh
Solver : lbfgs
Alpha : 0.0001
Batch size : auto
Learning rate : 0.001
Maximum iterations : 10000
random_state : 1234
AUC : 0.892
Raw data :
message age female income education children freq_endurance freq_strength freq_water freq_team freq_backcountry freq_racquet
team 30 to 44 no 55000 19 0.8 0 4 0 4 0 1
endurance 45 to 59 yes 45000 33 0.7 0 0 0 0 2 2
water 45 to 59 yes 25000 24 0.2 0 0 0 0 0 0
strength < 30 yes 25000 18 0.3 0 0 0 0 0 0
strength >= 60 yes 65000 36 1.2 1 1 0 2 0 3
Estimation data :
income education children freq_endurance freq_strength freq_water freq_team freq_backcountry freq_racquet message_control message_endurance message_racquet message_strength message_team message_water age_30 to 44 age_45 to 59 age_>= 60 female_no
0.388655 -0.663713 -0.265321 -0.640336 1.100861 -0.261727 1.892088 -0.690593 0.058975 False False False False True False True False False True
-0.184052 0.279378 -0.479844 -0.640336 -0.706563 -0.261727 -0.620020 1.958233 0.882489 False True False False False False False True False False
-1.329467 -0.326895 -1.552460 -0.640336 -0.706563 -0.261727 -0.620020 -0.690593 -0.764538 False False False False False True False True False False
-1.329467 -0.731077 -1.337937 -0.640336 -0.706563 -0.261727 -0.620020 -0.690593 -0.764538 False False False True False False False False False False
0.961362 0.481469 0.592772 0.060592 -0.254707 -0.261727 0.636034 -0.690593 1.706002 False False False True False False False False True False
Model Tuning
# NN_cv = rsm.load_state("./data/NN_cv.pkl")['NN_cv']
After fine-tuning the model in a separate notebook, we preserved the model’s state with save_state. To reintegrate it into the main notebook, load_state was utilized. However, due to GitHub’s file evaluation constraints, we are unable to successfully pass the automated tests using the aforementioned code. Should you wish to review our tuned model, please refer to the auxiliary notebooks. You can replicate our steps there by employing the provided code above
# NN_cv.best_params_
# NN_cv.best_score_.round(3)
= rsm.model.mlp(
NN_best ={"pentathlon_nptb": pentathlon_nptb[pentathlon_nptb["training"] == 1]},
data='buyer_yes',
rvar= evar,
evar= 0.01,
alpha = (3, 3),
hidden_layer_sizes ='classification'
mod_type
)
NN_best.summary()
Multi-layer Perceptron (NN)
Data : pentathlon_nptb
Response variable : buyer_yes
Level : None
Explanatory variables: message, age, female, income, education, children, freq_endurance, freq_strength, freq_water, freq_team, freq_backcountry, freq_racquet
Model type : classification
Nr. of features : (12, 19)
Nr. of observations : 420,000
Hidden_layer_sizes : (3, 3)
Activation function : tanh
Solver : lbfgs
Alpha : 0.01
Batch size : auto
Learning rate : 0.001
Maximum iterations : 10000
random_state : 1234
AUC : 0.89
Raw data :
message age female income education children freq_endurance freq_strength freq_water freq_team freq_backcountry freq_racquet
team 30 to 44 no 55000 19 0.8 0 4 0 4 0 1
endurance 45 to 59 yes 45000 33 0.7 0 0 0 0 2 2
water 45 to 59 yes 25000 24 0.2 0 0 0 0 0 0
strength < 30 yes 25000 18 0.3 0 0 0 0 0 0
strength >= 60 yes 65000 36 1.2 1 1 0 2 0 3
Estimation data :
income education children freq_endurance freq_strength freq_water freq_team freq_backcountry freq_racquet message_control message_endurance message_racquet message_strength message_team message_water age_30 to 44 age_45 to 59 age_>= 60 female_no
0.388655 -0.663713 -0.265321 -0.640336 1.100861 -0.261727 1.892088 -0.690593 0.058975 False False False False True False True False False True
-0.184052 0.279378 -0.479844 -0.640336 -0.706563 -0.261727 -0.620020 1.958233 0.882489 False True False False False False False True False False
-1.329467 -0.326895 -1.552460 -0.640336 -0.706563 -0.261727 -0.620020 -0.690593 -0.764538 False False False False False True False True False False
-1.329467 -0.731077 -1.337937 -0.640336 -0.706563 -0.261727 -0.620020 -0.690593 -0.764538 False False False True False False False False False False
0.961362 0.481469 0.592772 0.060592 -0.254707 -0.261727 0.636034 -0.690593 1.706002 False False False True False False False False True False
'pred_NN'] = NN_best.predict(pentathlon_nptb)['prediction'] pentathlon_nptb[
= {"train" : pentathlon_nptb.query("training == 1"), "test" : pentathlon_nptb.query("training == 0")}
dct = rsm.gains_plot(dct, "buyer", "yes", "pred_NN") fig
from sklearn import metrics
# prediction on training set
= pentathlon_nptb.query("training == 1")['pred_NN']
pred = pentathlon_nptb.query("training == 1")['buyer_yes']
actual = metrics.roc_curve(actual, pred)
fpr, tpr, thresholds round(5) metrics.auc(fpr, tpr).
0.89011
# prediction on test set
= pentathlon_nptb.query("training == 0")['pred_NN']
pred = pentathlon_nptb.query("training == 0")['buyer_yes']
actual = metrics.roc_curve(actual, pred)
fpr, tpr, thresholds round(5) metrics.auc(fpr, tpr).
0.88819
We can conclude that:
The AUC for the training set is 0.890, which indicates that the model has a very good ability to distinguish between the two classes in the training data.
The AUC for the test set is 0.888, which is nearly the same as the training AUC. This means that the model is generalizing well to new data and not overfitting to the training data. The small decrease from training to test is nearly negligible.
1. Create predictions
# Check for all unique message values in the dataset and count how many
"message"].value_counts() pentathlon_nptb[
message
control 87260
racquet 87088
team 86792
backcountry 86604
water 85120
strength 84280
endurance 82856
Name: count, dtype: int64
# Create predictions for different scenarios where the message variable is set to specific values
"p_control_nn"] = NN_best.predict(pentathlon_nptb, data_cmd={"message": "control"})["prediction"]
pentathlon_nptb["p_racquet_nn"] = NN_best.predict(pentathlon_nptb, data_cmd={"message": "racquet"})["prediction"]
pentathlon_nptb["p_team_nn"] = NN_best.predict(pentathlon_nptb, data_cmd={"message": "team"})["prediction"]
pentathlon_nptb["p_backcountry_nn"] = NN_best.predict(pentathlon_nptb, data_cmd={"message": "backcountry"})["prediction"]
pentathlon_nptb["p_water_nn"] = NN_best.predict(pentathlon_nptb, data_cmd={"message": "water"})["prediction"]
pentathlon_nptb["p_strength_nn"] = NN_best.predict(pentathlon_nptb, data_cmd={"message": "strength"})["prediction"]
pentathlon_nptb["p_endurance_nn"] = NN_best.predict(pentathlon_nptb, data_cmd={"message": "endurance"})["prediction"]
pentathlon_nptb[ pentathlon_nptb
custid | buyer | total_os | message | age | female | income | education | children | freq_endurance | ... | ordersize_random_reg | ep_random_lr | pred_NN | p_control_nn | p_racquet_nn | p_team_nn | p_backcountry_nn | p_water_nn | p_strength_nn | p_endurance_nn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | U1 | no | 0 | team | 30 to 44 | no | 55000 | 19 | 0.8 | 0 | ... | 32.627819 | 0.156712 | 0.011532 | 0.009833 | 0.011503 | 0.011532 | 0.011059 | 0.010634 | 0.011800 | 0.011186 |
1 | U3 | no | 0 | backcountry | 45 to 59 | no | 35000 | 22 | 1.0 | 0 | ... | 37.181377 | 0.082633 | 0.002658 | 0.002535 | 0.002404 | 0.003184 | 0.002658 | 0.003101 | 0.003468 | 0.002342 |
2 | U13 | no | 0 | endurance | 45 to 59 | yes | 45000 | 33 | 0.7 | 0 | ... | 40.744326 | 0.226274 | 0.008840 | 0.006773 | 0.007218 | 0.008068 | 0.007635 | 0.008054 | 0.009127 | 0.008840 |
3 | U20 | no | 0 | water | 45 to 59 | yes | 25000 | 24 | 0.2 | 0 | ... | 15.768889 | 0.015070 | 0.001161 | 0.001065 | 0.001360 | 0.000875 | 0.001357 | 0.001161 | 0.001291 | 0.003342 |
4 | U25 | no | 0 | racquet | >= 60 | yes | 65000 | 32 | 1.1 | 1 | ... | 48.946607 | 0.235699 | 0.009574 | 0.007788 | 0.009574 | 0.008922 | 0.008792 | 0.008076 | 0.008884 | 0.008817 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
599995 | U3462888 | no | 0 | water | >= 60 | yes | 40000 | 26 | 0.6 | 0 | ... | 38.698478 | 0.031609 | 0.001644 | 0.001504 | 0.001932 | 0.001207 | 0.001928 | 0.001644 | 0.001835 | 0.004317 |
599996 | U3462900 | no | 0 | team | < 30 | no | 55000 | 32 | 0.9 | 3 | ... | 29.767369 | 0.091593 | 0.004586 | 0.003494 | 0.003388 | 0.004586 | 0.003605 | 0.004077 | 0.004503 | 0.002707 |
599997 | U3462902 | no | 0 | team | < 30 | yes | 55000 | 32 | 0.9 | 0 | ... | 32.707711 | 0.107603 | 0.002481 | 0.001935 | 0.002163 | 0.002481 | 0.001966 | 0.001951 | 0.002062 | 0.001856 |
599998 | U3462916 | no | 0 | team | < 30 | no | 50000 | 35 | 0.6 | 2 | ... | 25.117259 | 0.068280 | 0.003051 | 0.002224 | 0.002025 | 0.003051 | 0.002200 | 0.002624 | 0.002867 | 0.001594 |
599999 | U3462922 | no | 0 | endurance | 30 to 44 | yes | 50000 | 25 | 0.7 | 1 | ... | 29.102240 | 0.138029 | 0.002656 | 0.003538 | 0.003329 | 0.004762 | 0.003617 | 0.004233 | 0.004701 | 0.002656 |
600000 rows × 63 columns
= {
repl "p_control_nn": "control",
"p_endurance_nn": "endurance",
"p_backcountry_nn": "backcountry",
"p_racquet_nn": "racquet",
"p_strength_nn": "strength",
"p_team_nn": "team",
"p_water_nn": "water"}
# extending the prediction to the full database to see distribution
= [
predictions_nn "p_control_nn",
"p_endurance_nn",
"p_backcountry_nn",
"p_racquet_nn",
"p_strength_nn",
"p_team_nn",
"p_water_nn"]
# extending the prediction to the full database to see distribution
"message_nn"] = pentathlon_nptb[predictions_nn].idxmax(axis=1)
pentathlon_nptb["message_nn"].value_counts() pentathlon_nptb[
message_nn
p_endurance_nn 321011
p_strength_nn 161291
p_team_nn 106081
p_racquet_nn 11617
Name: count, dtype: int64
# Find the maximum probability of purchase
"p_max_nn"] = pentathlon_nptb[predictions_nn].max(axis=1) pentathlon_nptb[
Approach: First of all, a neural network model was built to predict the next product to buy using buyer_yes as the response variable and the other variables as predictors. The model was trained using the training data and then used to predict the probability of purchasing for the test data. The model was then used to predict the probability of purchasing for the full database. To predict the probability of purchasing for different messages, we use the data_cmd
to predict the probability of purchasing for different messages.
Finally, we choose the product with the highest probability using idxmax
to automatically find the best message for each customer. This command also provides a label for the category with the maximum predicted probability of buying across all the products. Then, p_max
was created to store the maximum predicted probability of purchasing for the best message selected for each customer.
2. Percentage of customersfor whom that message or not message
# create a crosstab to see which products should we send to each customer.
=pentathlon_nptb[pentathlon_nptb.training == 0].message_nn, columns="count").apply(rsm.format_nr) pd.crosstab(index
col_0 | count |
---|---|
message_nn | |
p_endurance_nn | 96,065 |
p_racquet_nn | 3,473 |
p_strength_nn | 48,593 |
p_team_nn | 31,869 |
# percentage of customers for whom that message is the best message
"message_nn"].value_counts(normalize=True) pentathlon_nptb[
message_nn
p_endurance_nn 0.535018
p_strength_nn 0.268818
p_team_nn 0.176802
p_racquet_nn 0.019362
Name: proportion, dtype: float64
edurance
: The distribution suggests thatendurance
, related messages resonate significantly more with the customers than the other messages.team
: While not as dominant asendurance
, related messages also play an important role. There might be opportunities to further optimize or tailor these messages to increase their effectiveness.Other categories such as
strength
,racquet
, andwater
, have much smaller proportions, suggesting that they are less often the best choice for maximizing the probability of purchase.backcountry
was not included in the original dataset, so it is not surprising that it is not the best message for any customer.`
# average purchase probability if we send the message for each product to everyone
== 0, predictions_nn].agg("mean").sort_values(
pentathlon_nptb.loc[pentathlon_nptb.training =False).apply(rsm.format_nr, perc=True) ascending
p_endurance_nn 2.75%
p_strength_nn 2.64%
p_water_nn 2.42%
p_team_nn 2.37%
p_backcountry_nn 2.33%
p_racquet_nn 2.29%
p_control_nn 2.14%
dtype: object
3. Expected profit
# Calculate the avg order size/message
= pentathlon_nptb[(pentathlon_nptb.training == 1) & (pentathlon_nptb.buyer_yes == 1)].groupby("message")["total_os"].mean()
ordersize ordersize
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/3828228129.py:2: FutureWarning:
The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
message
backcountry 64.034091
control 49.900598
endurance 55.584893
racquet 56.405620
strength 56.708751
team 56.522449
water 61.957343
Name: total_os, dtype: float64
= rsm.model.mlp(
NN_os ={"pentathlon_nptb": pentathlon_nptb[pentathlon_nptb["training"] == 1]},
data= "total_os",
rvar = evar,
evar = (1,), # Simple NN with 1 hidden layer
hidden_layer_sizes = 'regression'
mod_type
) NN_os.summary()
Multi-layer Perceptron (NN)
Data : pentathlon_nptb
Response variable : total_os
Explanatory variables: message, age, female, income, education, children, freq_endurance, freq_strength, freq_water, freq_team, freq_backcountry, freq_racquet
Model type : regression
Nr. of features : (12, 19)
Nr. of observations : 420,000
Hidden_layer_sizes : (1,)
Activation function : tanh
Solver : lbfgs
Alpha : 0.0001
Batch size : auto
Learning rate : 0.001
Maximum iterations : 10000
random_state : 1234
Model fit :
n r2 mse mae
420000 0.109 0.891 0.179
Raw data :
message age female income education children freq_endurance freq_strength freq_water freq_team freq_backcountry freq_racquet
team 30 to 44 no 55000 19 0.8 0 4 0 4 0 1
endurance 45 to 59 yes 45000 33 0.7 0 0 0 0 2 2
water 45 to 59 yes 25000 24 0.2 0 0 0 0 0 0
strength < 30 yes 25000 18 0.3 0 0 0 0 0 0
strength >= 60 yes 65000 36 1.2 1 1 0 2 0 3
Estimation data :
income education children freq_endurance freq_strength freq_water freq_team freq_backcountry freq_racquet message_control message_endurance message_racquet message_strength message_team message_water age_30 to 44 age_45 to 59 age_>= 60 female_no
0.388655 -0.663713 -0.265321 -0.640336 1.100861 -0.261727 1.892088 -0.690593 0.058975 False False False False True False True False False True
-0.184052 0.279378 -0.479844 -0.640336 -0.706563 -0.261727 -0.620020 1.958233 0.882489 False True False False False False False True False False
-1.329467 -0.326895 -1.552460 -0.640336 -0.706563 -0.261727 -0.620020 -0.690593 -0.764538 False False False False False True False True False False
-1.329467 -0.731077 -1.337937 -0.640336 -0.706563 -0.261727 -0.620020 -0.690593 -0.764538 False False False True False False False False False False
0.961362 0.481469 0.592772 0.060592 -0.254707 -0.261727 0.636034 -0.690593 1.706002 False False False True False False False False True False
Model Tuning
# NN_os = rsm.load_state("./data/NN_os.pkl")['NN_os']
Again, after fine-tuning the model in a separate notebook, we preserved the model’s state with save_state. To reintegrate it into the main notebook, load_state was utilized. However, due to GitHub’s file evaluation constraints, we are unable to successfully pass the automated tests using the aforementioned code. Should you wish to review our tuned model, please refer to the auxiliary notebooks. You can replicate our steps there by employing the provided code above
# NN_os.best_params_
# NN_cv.best_score_.round(3)
= rsm.model.mlp(
NN_os_best ={"pentathlon_nptb": pentathlon_nptb[pentathlon_nptb["training"] == 1]},
data="total_os",
rvar= evar,
evar= 0.0001,
alpha = (3, 3),
hidden_layer_sizes ='regression'
mod_type
)
NN_os_best.summary()
Multi-layer Perceptron (NN)
Data : pentathlon_nptb
Response variable : total_os
Explanatory variables: message, age, female, income, education, children, freq_endurance, freq_strength, freq_water, freq_team, freq_backcountry, freq_racquet
Model type : regression
Nr. of features : (12, 19)
Nr. of observations : 420,000
Hidden_layer_sizes : (3, 3)
Activation function : tanh
Solver : lbfgs
Alpha : 0.0001
Batch size : auto
Learning rate : 0.001
Maximum iterations : 10000
random_state : 1234
Model fit :
n r2 mse mae
420000 0.114 0.886 0.175
Raw data :
message age female income education children freq_endurance freq_strength freq_water freq_team freq_backcountry freq_racquet
team 30 to 44 no 55000 19 0.8 0 4 0 4 0 1
endurance 45 to 59 yes 45000 33 0.7 0 0 0 0 2 2
water 45 to 59 yes 25000 24 0.2 0 0 0 0 0 0
strength < 30 yes 25000 18 0.3 0 0 0 0 0 0
strength >= 60 yes 65000 36 1.2 1 1 0 2 0 3
Estimation data :
income education children freq_endurance freq_strength freq_water freq_team freq_backcountry freq_racquet message_control message_endurance message_racquet message_strength message_team message_water age_30 to 44 age_45 to 59 age_>= 60 female_no
0.388655 -0.663713 -0.265321 -0.640336 1.100861 -0.261727 1.892088 -0.690593 0.058975 False False False False True False True False False True
-0.184052 0.279378 -0.479844 -0.640336 -0.706563 -0.261727 -0.620020 1.958233 0.882489 False True False False False False False True False False
-1.329467 -0.326895 -1.552460 -0.640336 -0.706563 -0.261727 -0.620020 -0.690593 -0.764538 False False False False False True False True False False
-1.329467 -0.731077 -1.337937 -0.640336 -0.706563 -0.261727 -0.620020 -0.690593 -0.764538 False False False True False False False False False False
0.961362 0.481469 0.592772 0.060592 -0.254707 -0.261727 0.636034 -0.690593 1.706002 False False False True False False False False True False
"pos_control_nn"] = NN_os_best.predict(pentathlon_nptb, data_cmd={"message": "control"})["prediction"]
pentathlon_nptb["pos_racquet_nn"] = NN_os_best.predict(pentathlon_nptb, data_cmd={"message": "racquet"})["prediction"]
pentathlon_nptb["pos_team_nn"] = NN_os_best.predict(pentathlon_nptb, data_cmd={"message": "team"})["prediction"]
pentathlon_nptb["pos_backcountry_nn"] = NN_os_best.predict(pentathlon_nptb, data_cmd={"message": "backcountry"})["prediction"]
pentathlon_nptb["pos_water_nn"] = NN_os_best.predict(pentathlon_nptb, data_cmd={"message": "water"})["prediction"]
pentathlon_nptb["pos_strength_nn"] = NN_os_best.predict(pentathlon_nptb, data_cmd={"message": "strength"})["prediction"]
pentathlon_nptb["pos_endurance_nn"] = NN_os_best.predict(pentathlon_nptb, data_cmd={"message": "endurance"})["prediction"] pentathlon_nptb[
"ep_control_nn"] = pentathlon_nptb["p_control_nn"] * pentathlon_nptb["pos_control_reg"] * 0.4
pentathlon_nptb["ep_racquet_nn"] = pentathlon_nptb["p_racquet_nn"] * pentathlon_nptb["pos_racquet_reg"] * 0.4
pentathlon_nptb["ep_team_nn"] = pentathlon_nptb["p_team_nn"] * pentathlon_nptb["pos_team_reg"] * 0.4
pentathlon_nptb["ep_backcountry_nn"] = pentathlon_nptb["p_backcountry_nn"] * pentathlon_nptb["pos_backcountry_reg"] * 0.4
pentathlon_nptb["ep_water_nn"] = pentathlon_nptb["p_water_nn"] * pentathlon_nptb["pos_water_reg"] * 0.4
pentathlon_nptb["ep_strength_nn"] = pentathlon_nptb["p_strength_nn"] * pentathlon_nptb["pos_strength_reg"] * 0.4
pentathlon_nptb["ep_endurance_nn"] = pentathlon_nptb["p_endurance_nn"] * pentathlon_nptb["pos_endurance_reg"] * 0.4 pentathlon_nptb[
= [
expected_profit_nn "ep_control_nn",
"ep_endurance_nn",
"ep_backcountry_nn",
"ep_racquet_nn",
"ep_strength_nn",
"ep_team_nn",
"ep_water_nn"]
= {"ep_control_nn": "control", "ep_endurance_nn": "endurance", "ep_backcountry_nn": "backcountry", "ep_racquet_nn": "racquet", "ep_strength_nn": "strength", "ep_team_nn": "team", "ep_water_nn": "water"}
repl "ep_message_nn"] = (
pentathlon_nptb[
pentathlon_nptb[expected_profit_nn]=1)
.idxmax(axismap(repl)
. )
4. Respective message maximizes expected profit
=True) pentathlon_nptb.ep_message_nn.value_counts(normalize
ep_message_nn
endurance 0.369478
team 0.245130
water 0.221107
strength 0.110613
backcountry 0.024117
racquet 0.019118
control 0.010437
Name: proportion, dtype: float64
The message that most customers respond the best to is endurance
holding close to almost half of the customer. Messages such as backcountry
, racquet
, and control
are not the best choice for nearly any customer. The remaining three messages, team
, strength
, and water
, are the best choice for a small proportion of customers.
5. Expected profit/customer
"ep_max_nn"] = pentathlon_nptb[expected_profit_nn].max(axis=1)
pentathlon_nptb["ep_max_nn"].mean() pentathlon_nptb[
0.6816084703184525
=pentathlon_nptb[pentathlon_nptb.training == 0].ep_message_nn, columns="count").apply(rsm.format_nr) pd.crosstab(index
col_0 | count |
---|---|
ep_message_nn | |
backcountry | 4,371 |
control | 1,914 |
endurance | 66,071 |
racquet | 3,429 |
strength | 19,986 |
team | 44,334 |
water | 39,895 |
6. Expected profit per e-mailed customer if every customer receives the same message
(
pentathlon_nptb== 0, expected_profit_nn]
.loc[pentathlon_nptb.training "mean")
.agg(=False)
.sort_values(ascendingapply(rsm.format_nr, sym = "$", dec = 2)
. )
ep_endurance_nn $0.61
ep_strength_nn $0.6
ep_water_nn $0.6
ep_backcountry_nn $0.59
ep_team_nn $0.53
ep_racquet_nn $0.52
ep_control_nn $0.43
dtype: object
To no surprise, sending no message yields the lowest expected profit per customer. The message that maximizes the expected profit per customer are endurance
and strength
tied. The remaining messages of team
, water
, backcountry
yield slightly less than strength
and endurance
. Racquet
yields slightly more than sending no message but still less than the other messages.
7. Expected profit per e-mailed customer if every customer is assigned randomly to one of the messages options
# probabilty of purchase where customer is assigned to a random message
"p_random_nn"] = NN_best.predict(pentathlon_nptb)["prediction"]
pentathlon_nptb[
# expected avg order size where customer is assigned to a random message
"ordersize_random_nn"] = NN_os_best.predict(pentathlon_nptb)["prediction"]
pentathlon_nptb[
# expected profit where customer is assigned to a random message
"ep_random_nn"] = pentathlon_nptb["p_random_nn"] * pentathlon_nptb["ordersize_random_nn"] * 0.4
pentathlon_nptb[
# expected profit per customer where customer is assigned to a random message
= pentathlon_nptb.loc[pentathlon_nptb.training == 0, "ep_random_nn"].mean()
random_profit_per_customer random_profit_per_customer
0.1078448520217575
# expected profit where no-message is sent (control)
"ep_control_nn"]
pentathlon_nptb[
# expected profit per customer where no-message is sent (control)
= pentathlon_nptb.loc[pentathlon_nptb.training == 0, "ep_control_nn"].mean()
control_profit_per_customer control_profit_per_customer
0.4335109104691227
8. Expected profit for 5,000,000 customers
# Profit where each customer is assigned to the message with the highest expected profit
= pentathlon_nptb.loc[pentathlon_nptb.training == 0, "ep_max_nn"].agg("mean") * 5000000
profit_nn profit_nn
3430076.6444671643
# Profit where each customer is sent to a random message
= random_profit_per_customer * 5000000
random_profit_nn random_profit_nn
539224.2601087876
# Profit where no message is sent
= control_profit_per_customer * 5000000
control_profit_nn control_profit_nn
2167554.5523456135
= profit_nn - control_profit_nn
profit_improvement_nn profit_improvement_nn
1262522.0921215508
= {
profits_dct "Customize Message": profit_nn,
"Randomly Assign": random_profit_nn,
"No Message Sent": control_profit_nn,
}
import seaborn as sns
import matplotlib.pyplot as plt
# Convert dictionary to DataFrame
= pd.DataFrame(list(profits_dct.items()), columns=['Model', 'Profit'])
df =(10, 5))
plt.figure(figsize# Plot
set(style="whitegrid")
sns.= sns.barplot(x="Model", y="Profit", data=df, palette="viridis")
ax
# Annotations
for index, row in df.iterrows():
f'${row.Profit:.2f}', ha='center')
ax.text(index, row.Profit,
# Set labels and title
"Model Type", fontdict={'family': 'serif', 'color': 'black', 'size': 12})
ax.set_xlabel("Profit", fontdict={'family': 'serif', 'color': 'black', 'size': 12})
ax.set_ylabel("Profit by Model", fontdict={'family': 'serif', 'color': 'black', 'size': 15})
ax.set_title(
=45)
plt.xticks(rotation plt.show()
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/3855792413.py:16: FutureWarning:
Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.
Random Forest Model
Train a Random Forest Model to predict the next product to buy
= rsm.model.rforest(
rf = {'NPTB': pentathlon_nptb.query("training == 1")},
data = 'buyer',
rvar = 'yes',
lev = evar,
evar
) rf.summary()
Random Forest
Data : NPTB
Response variable : buyer
Level : yes
Explanatory variables: message, age, female, income, education, children, freq_endurance, freq_strength, freq_water, freq_team, freq_backcountry, freq_racquet
OOB : True
Model type : classification
Nr. of features : (12, 21)
Nr. of observations : 420,000
max_features : sqrt (4)
n_estimators : 100
min_samples_leaf : 1
random_state : 1234
AUC : 0.825
Estimation data :
income education children freq_endurance freq_strength freq_water freq_team freq_backcountry freq_racquet message_backcountry message_control message_endurance message_racquet message_strength message_team message_water age_< 30 age_30 to 44 age_45 to 59 age_>= 60 female_no
55000 19 0.8 0 4 0 4 0 1 False False False False False True False False True False False True
45000 33 0.7 0 0 0 0 2 2 False False True False False False False False False True False False
25000 24 0.2 0 0 0 0 0 0 False False False False False False True False False True False False
25000 18 0.3 0 0 0 0 0 0 False False False False True False False True False False False False
65000 36 1.2 1 1 0 2 0 3 False False False False True False False False False False True False
We want to also use a random forest model to predict the next product to buy and see if we can create a robust NPTB model that accurately can predict which product a customer would purchase based on the messaging we send them. Above, we are training our base random forest model, and we will use this base random forest model to tune our hyperparameters, as seen below.
Some interesting key features of the output of this random forest model is the AUC value. In our base model, we see that the AUC is a 0.818. This is a decent value for AUC, with anything over 0.5 typically being considered a good model.
Random Forest Model Tuning
#max_features = [None, 'auto', 'sqrt', 'log2', 0.25, 0.5, 0.75, 1.0]
#n_estimators = [10, 50, 100, 200, 500, 1000]
#param_grid = {"max_features": max_features, "n_estimators": n_estimators}
#scoring = {"AUC": "roc_auc"}
#rf_cv = GridSearchCV(rf.fitted, param_grid, scoring=scoring, cv=5, n_jobs=2, refit="AUC", verbose=5)
Below is the code for the grid search of a very large matrix. I would not recommend running this, as it takes a very long time to run. However, I have included the code for the grid search below.
rf_cv.fit(rf.data_onehot, rf.data.buyer_yes)
#bdc7b1cb .cell execution_count=107}
::: {-code}
``` {.python .cell#rf_cv.fit(rf_treatment.data_onehot, rf_treatment.data.buyer_yes)
:::
Above, we conduct a grid search to identify which hyperparameters are best for our random forest model. However, when running a gridsearch this large, we ran into a computational cost issue that caused either our hyperparameter grid search to run for 4 hours, or caused the kernel to crash. Therefore, we conducted multiple step-wise grid searches where we limited the gridsearch matrix size, and went through the various iterations of combinations to see which parameters created the most robut model. Below, you can see the various rounds of model tuning.
Random Forest Model Tuning - Stepwise, limiting the grid search
#max_features1 = ['sqrt']
#n_estimators1 = [1000, 2000]
#param_grid = {"max_features": max_features1, "n_estimators": n_estimators1}
#scoring = {"AUC": "roc_auc"}
#rf_cv_treatment1 = GridSearchCV(rf_treatment.fitted, param_grid, scoring=scoring, cv=4, n_jobs=2, refit="AUC", verbose=5)
Because our random forest model took over 4 hours to run previously, we decided to limit the grid search to 9 iterations. Our goal is to create a balance between exploring the hyperparameter space and keeping the computational cost reasonable.
#rf_cv_treatment1.fit(rf_treatment.data_onehot, rf_treatment.data.buyer_yes)
#rf_cv_treatment1.best_params_
As seen above, the grid search identified that the best parameters were as follows: - max_features: ‘sqrt’ - n_estimators: 1000
When we run our random forest model with the best parameters identified from the grid search, the AUC score is approximately 0.862.
= rsm.model.rforest(
rf_tuned = {'NPTB': pentathlon_nptb.query("training == 1")},
data = 'buyer',
rvar = 'yes',
lev = evar,
evar = 1000,
n_estimators = 'sqrt'
max_features
) rf_tuned.summary()
Random Forest
Data : NPTB
Response variable : buyer
Level : yes
Explanatory variables: message, age, female, income, education, children, freq_endurance, freq_strength, freq_water, freq_team, freq_backcountry, freq_racquet
OOB : True
Model type : classification
Nr. of features : (12, 21)
Nr. of observations : 420,000
max_features : sqrt (4)
n_estimators : 1000
min_samples_leaf : 1
random_state : 1234
AUC : 0.864
Estimation data :
income education children freq_endurance freq_strength freq_water freq_team freq_backcountry freq_racquet message_backcountry message_control message_endurance message_racquet message_strength message_team message_water age_< 30 age_30 to 44 age_45 to 59 age_>= 60 female_no
55000 19 0.8 0 4 0 4 0 1 False False False False False True False False True False False True
45000 33 0.7 0 0 0 0 2 2 False False True False False False False False False True False False
25000 24 0.2 0 0 0 0 0 0 False False False False False False True False False True False False
25000 18 0.3 0 0 0 0 0 0 False False False False True False False True False False False False
65000 36 1.2 1 1 0 2 0 3 False False False False True False False False False False True False
'pred_rf'] = rf_tuned.predict(pentathlon_nptb)['prediction']
pentathlon_nptb['pred_rf'].mean() pentathlon_nptb[
0.025235874483417724
= {"train_rf": pentathlon_nptb.query("training == 1"), "test_rf": pentathlon_nptb.query("training == 0")}
dct_rf = rsm.gains_plot(dct_rf, "buyer", "yes", "pred_rf") fig
The gains chart above is a visual representation of the AUC score. The gains chart shows the percentage of the total number of cases in a given category that are captured by the model. Based on the steep rise at the beginning of the curve, indicates that the model is effective at ranking the positive events, i.e.. buyers, higher than the negative ones. The more steep the initial part of the curve, the better the model is at identifying the positive events early on when you start targeting the population based on the model’s scores.
In this specific chart, the model seems to perform fairly well, particularly because the initial sections of both curves are quite steep, indicating that a significant percentage of the positive events can be captured by targeting a relatively small percentage of the population.
from sklearn import metrics
def apply_rand_message(data, pred_col, message_col):
1234)
np.random.seed("random_number"] = np.random.randint(2, size = len(data))
data["random_message"] = rsm.ifelse(data["random_number"] == 1, data[message_col], "no message")
data["random_pred"] = rsm.ifelse(data["random_number"] == 1, data[pred_col], 0)
data[return data
# prediction on training data
= pentathlon_nptb.query("training == 1")["pred_rf"]
pred_rf = pentathlon_nptb.query("training == 1")["buyer_yes"]
actual_rf = metrics.roc_curve(actual_rf, pred_rf)
fpr, tpr, thresholds round(3) metrics.auc(fpr, tpr).
1.0
# prediction on test data
= pentathlon_nptb.query("training == 0")["pred_rf"]
pred_rf = pentathlon_nptb.query("training == 0")["buyer_yes"]
actual_rf = metrics.roc_curve(actual_rf, pred_rf)
fpr, tpr, thresholds round(3) metrics.auc(fpr, tpr).
0.869
Based on the Gains Chart, and the AUC values for the test and training set, we can conclude that the model is performing decently well. Because of the perfect AUC score on the training data, there is a slight hesitation that the model may be overfitting. However, the AUC score on the test data is still quite high, indicating that the model is generalizing well.
The AUC for the training set is 1, which indicates that the model has a very good ability to distinguish between the two classes in the training data. This is also supported by the steep increase in the first part of the gains chart in the training data. The AUC for the test set is 0.865, which is slightly lower but still indicates a very good predictive performance on unseen data.
1. Create prediction for each product
'p_control_rf'] = rf_tuned.predict(pentathlon_nptb, data_cmd={"message": "control"})["prediction"]
pentathlon_nptb['p_racquet_rf'] = rf_tuned.predict(pentathlon_nptb, data_cmd={"message": "racquet"})["prediction"]
pentathlon_nptb['p_team_rf'] = rf_tuned.predict(pentathlon_nptb, data_cmd={"message": "team"})["prediction"]
pentathlon_nptb['p_backcountry_rf'] = rf_tuned.predict(pentathlon_nptb, data_cmd={"message": "backcountry"})["prediction"]
pentathlon_nptb['p_water_rf'] = rf_tuned.predict(pentathlon_nptb, data_cmd={"message": "water"})["prediction"]
pentathlon_nptb['p_strength_rf'] = rf_tuned.predict(pentathlon_nptb, data_cmd={"message": "strength"})["prediction"]
pentathlon_nptb['p_endurance_rf'] = rf_tuned.predict(pentathlon_nptb, data_cmd={"message": "endurance"})["prediction"] pentathlon_nptb[
pentathlon_nptb
custid | buyer | total_os | message | age | female | income | education | children | freq_endurance | ... | ordersize_random_nn | ep_random_nn | pred_rf | p_control_rf | p_racquet_rf | p_team_rf | p_backcountry_rf | p_water_rf | p_strength_rf | p_endurance_rf | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | U1 | no | 0 | team | 30 to 44 | no | 55000 | 19 | 0.8 | 0 | ... | 0.404840 | 0.001868 | 0.001000 | 0.004 | 0.001 | 0.001 | 0.008000 | 0.012 | 0.028 | 0.070000 |
1 | U3 | no | 0 | backcountry | 45 to 59 | no | 35000 | 22 | 1.0 | 0 | ... | 0.162739 | 0.000173 | 0.000000 | 0.000 | 0.000 | 0.000 | 0.000000 | 0.000 | 0.000 | 0.000000 |
2 | U13 | no | 0 | endurance | 45 to 59 | yes | 45000 | 33 | 0.7 | 0 | ... | 0.604567 | 0.002138 | 0.001000 | 0.002 | 0.000 | 0.002 | 0.001333 | 0.000 | 0.000 | 0.001000 |
3 | U20 | no | 0 | water | 45 to 59 | yes | 25000 | 24 | 0.2 | 0 | ... | -0.002303 | -0.000001 | 0.000000 | 0.000 | 0.000 | 0.000 | 0.000000 | 0.000 | 0.000 | 0.000000 |
4 | U25 | no | 0 | racquet | >= 60 | yes | 65000 | 32 | 1.1 | 1 | ... | 0.780294 | 0.002988 | 0.019000 | 0.009 | 0.019 | 0.007 | 0.023000 | 0.009 | 0.010 | 0.005000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
599995 | U3462888 | no | 0 | water | >= 60 | yes | 40000 | 26 | 0.6 | 0 | ... | -0.482575 | -0.000317 | 0.000000 | 0.000 | 0.000 | 0.000 | 0.000000 | 0.000 | 0.000 | 0.000000 |
599996 | U3462900 | no | 0 | team | < 30 | no | 55000 | 32 | 0.9 | 3 | ... | 0.155320 | 0.000285 | 0.000000 | 0.001 | 0.001 | 0.000 | 0.000000 | 0.000 | 0.001 | 0.000000 |
599997 | U3462902 | no | 0 | team | < 30 | yes | 55000 | 32 | 0.9 | 0 | ... | 0.153595 | 0.000152 | 0.000000 | 0.006 | 0.003 | 0.000 | 0.029000 | 0.005 | 0.010 | 0.003000 |
599998 | U3462916 | no | 0 | team | < 30 | no | 50000 | 35 | 0.6 | 2 | ... | -0.042065 | -0.000051 | 0.011000 | 0.002 | 0.005 | 0.011 | 0.001000 | 0.000 | 0.000 | 0.000000 |
599999 | U3462922 | no | 0 | endurance | 30 to 44 | yes | 50000 | 25 | 0.7 | 1 | ... | 0.483980 | 0.000514 | 0.021667 | 0.002 | 0.000 | 0.005 | 0.002000 | 0.001 | 0.003 | 0.021667 |
600000 rows × 92 columns
To identify which message will lead to the highest probability of purchase, we will use our random forest model to extrapolate predictions to the full database.
= {
repl "p_control_rf": "control",
"p_endurance_rf": "endurance",
"p_backcountry_rf": "backcountry",
"p_racquet_rf": "racquet",
"p_strength_rf": "strength",
"p_team_rf": "team",
"p_water_rf": "water"}
= [
predictions_rf "p_control_rf",
"p_endurance_rf",
"p_backcountry_rf",
"p_racquet_rf",
"p_strength_rf",
"p_team_rf",
"p_water_rf"]
'message_rf'] = (
pentathlon_nptb[
pentathlon_nptb[predictions_rf]=1)
.idxmax(axismap(repl)
. )
The code above is creating a new column in our Pentathlon_NPTB database, called message_rf
which identifies which of the predicted messages has the highest probability, and then filling in the message_rf
column with the name of the message that has the highest probability.
#plot the distribution of message_rf column
'message_rf'].value_counts().plot(kind='bar', title = 'Distribution of message_rf') pentathlon_nptb[
'p_max_rf'] = pentathlon_nptb[predictions_rf].max(axis=1) pentathlon_nptb[
Approach
Our approach to creating the targeting messaging is building a random forest model to predict the next product the customer will buy based on a variety of customer features. We will use buyer_yes as the response variable and the other variables as predictors.
We use our random forest model to extend the prediction to the full database. To predict the probability of purchasing for different messages, we use the data_cmd
to predict the probability of purchasing for different messages.
Finally, we select the product with the highest probability using idxmax
. This command also provides a label for the category with the maximum predicted probability of buying across all the products. Then, create p_max
to store the maximum predicted probability of purchasing an item based on the messaging sent to said customer.
2. Percentage of customers to message and not to message
=pentathlon_nptb[pentathlon_nptb.training == 0].message_rf, columns="count").apply(rsm.format_nr) pd.crosstab(index
col_0 | count |
---|---|
message_rf | |
backcountry | 18,952 |
control | 53,019 |
endurance | 29,327 |
racquet | 18,859 |
strength | 23,927 |
team | 18,809 |
water | 17,107 |
'message_rf'].value_counts(normalize=True) pentathlon_nptb[
message_rf
control 0.299772
endurance 0.162343
strength 0.132712
racquet 0.104497
backcountry 0.103977
team 0.102348
water 0.094352
Name: proportion, dtype: float64
control
: The distribution suggests that thecontrol_rf
control messaging seems to resonate more with the customers than the other messages based on this model.endurance
,strength
: theendurance
andstrength
messages are approximately similar in their values. There might be opportunities to further optimize or tailor these messages to increase their effectiveness.backcountry
,racquet
,water
,team
, all have roughly the same proportion of customers to message at around 10%. This suggests that these messages are not as effective as thecontrol
andendurance
messages.
'message_rf'].value_counts(normalize=True).plot(kind='bar', title='Message Distribution') pentathlon_nptb[
== 0, predictions_rf].agg("mean").sort_values(
pentathlon_nptb.loc[pentathlon_nptb.training =False).apply(rsm.format_nr, perc=True) ascending
p_endurance_rf 2.91%
p_strength_rf 2.79%
p_water_rf 2.63%
p_team_rf 2.61%
p_backcountry_rf 2.55%
p_racquet_rf 2.55%
p_control_rf 2.43%
dtype: object
Above is a table with the average purchase probability if we send the message for each product to everyone. Even though we are sending much more control messages compared to the other categories, we see that this does not lead to an increase in the probability of purchase. This is an interesting insight, and we should consider other ways to optimize our messaging strategy to increase the number of messages sent to those in the endurance
and strength
categories.
3. Expected profit
#Average Ordersize per message
= pentathlon_nptb[(pentathlon_nptb.training == 1) & (pentathlon_nptb.buyer_yes == 1)].groupby("message_rf")["total_os"].mean()
ordersize ordersize
message_rf
backcountry 64.065387
control 49.908751
endurance 55.584893
racquet 56.416727
strength 56.684246
team 56.522449
water 61.926676
Name: total_os, dtype: float64
Create another random forest model to Predict the Ordersize
= rsm.model.rforest(
rf_os = {"pentathlon_nptb": pentathlon_nptb[(pentathlon_nptb.training == 1) & (pentathlon_nptb.buyer == "yes")]},
data ="total_os",
rvar = evar,
evar = 200,
n_estimators = 'sqrt'
max_features
) rf_os.summary()
Random Forest
Data : pentathlon_nptb
Response variable : total_os
Level : None
Explanatory variables: message, age, female, income, education, children, freq_endurance, freq_strength, freq_water, freq_team, freq_backcountry, freq_racquet
OOB : True
Model type : classification
Nr. of features : (12, 21)
Nr. of observations : 10,080
max_features : sqrt (4)
n_estimators : 200
min_samples_leaf : 1
random_state : 1234
AUC : nan
Estimation data :
income education children freq_endurance freq_strength freq_water freq_team freq_backcountry freq_racquet message_backcountry message_control message_endurance message_racquet message_strength message_team message_water age_< 30 age_30 to 44 age_45 to 59 age_>= 60 female_no
65000 52 0.5 3 0 0 2 1 3 False False True False False False False False False True False True
55000 43 0.2 1 2 0 0 2 0 False True False False False False False False False True False True
160000 82 1.3 13 5 0 15 5 2 True False False False False False False False True False False True
90000 60 1.2 7 2 2 5 0 1 True False False False False False False False False True False True
55000 40 0.7 0 1 0 1 1 0 False False False False True False False False True False False True
/Users/duyentran/Library/Python/3.11/lib/python/site-packages/pyrsm/model/perf.py:1240: RuntimeWarning:
invalid value encountered in scalar divide
'pos_control_rf'] = rf_os.predict(pentathlon_nptb, data_cmd={"message": "control"})["prediction"]
pentathlon_nptb['pos_racquet_rf'] = rf_os.predict(pentathlon_nptb, data_cmd={"message": "racquet"})["prediction"]
pentathlon_nptb['pos_team_rf'] = rf_os.predict(pentathlon_nptb, data_cmd={"message": "team"})["prediction"]
pentathlon_nptb['pos_backcountry_rf'] = rf_os.predict(pentathlon_nptb, data_cmd={"message": "backcountry"})["prediction"]
pentathlon_nptb['pos_water_rf'] = rf_os.predict(pentathlon_nptb, data_cmd={"message": "water"})["prediction"]
pentathlon_nptb['pos_strength_rf'] = rf_os.predict(pentathlon_nptb, data_cmd={"message": "strength"})["prediction"]
pentathlon_nptb['pos_endurance_rf'] = rf_os.predict(pentathlon_nptb, data_cmd={"message": "endurance"})["prediction"] pentathlon_nptb[
Calculating the expected profit for each product
'ep_control_rf'] = pentathlon_nptb['p_control_rf'] * pentathlon_nptb['pos_control_reg'] * 0.4
pentathlon_nptb['ep_racquet_rf'] = pentathlon_nptb['p_racquet_rf'] * pentathlon_nptb['pos_racquet_reg'] * 0.4
pentathlon_nptb['ep_team_rf'] = pentathlon_nptb['p_team_rf'] * pentathlon_nptb['pos_team_reg'] * 0.4
pentathlon_nptb['ep_backcountry_rf'] = pentathlon_nptb['p_backcountry_rf'] * pentathlon_nptb['pos_backcountry_reg'] * 0.4
pentathlon_nptb['ep_water_rf'] = pentathlon_nptb['p_water_rf'] * pentathlon_nptb['pos_water_reg'] * 0.4
pentathlon_nptb['ep_strength_rf'] = pentathlon_nptb['p_strength_rf'] * pentathlon_nptb['pos_strength_reg'] * 0.4
pentathlon_nptb['ep_endurance_rf'] = pentathlon_nptb['p_endurance_rf'] * pentathlon_nptb['pos_endurance_reg'] * 0.4 pentathlon_nptb[
= [
expected_profit_rf "ep_control_rf",
"ep_endurance_rf",
"ep_backcountry_rf",
"ep_racquet_rf",
"ep_strength_rf",
"ep_team_rf",
"ep_water_rf"]
= {"ep_control_rf": "control", "ep_endurance_rf": "endurance", "ep_backcountry_rf": "backcountry", "ep_racquet_rf": "racquet", "ep_strength_rf": "strength", "ep_team_rf": "team", "ep_water_rf": "water"}
repl "ep_message_rf"] = (
pentathlon_nptb[
pentathlon_nptb[expected_profit_rf]=1)
.idxmax(axismap(repl)
.
)'ep_message_rf'].value_counts() pentathlon_nptb[
ep_message_rf
control 171206
endurance 88843
water 72279
strength 71154
backcountry 68671
team 65027
racquet 62820
Name: count, dtype: int64
4. Percentage of customer for whom no message maximizes their expected profits
=True) pentathlon_nptb.ep_message_rf.value_counts(normalize
ep_message_rf
control 0.285343
endurance 0.148072
water 0.120465
strength 0.118590
backcountry 0.114452
team 0.108378
racquet 0.104700
Name: proportion, dtype: float64
5. Expected profit per email if we customize the message to each customers
"ep_max_rf"] = pentathlon_nptb[expected_profit_rf].max(axis=1)
pentathlon_nptb["ep_max_rf"].mean() pentathlon_nptb[
1.0516237742883443
=pentathlon_nptb[pentathlon_nptb.training == 0].ep_message_rf, columns="count").apply(rsm.format_nr) pd.crosstab(index
col_0 | count |
---|---|
ep_message_rf | |
backcountry | 21,405 |
control | 50,280 |
endurance | 26,442 |
racquet | 18,786 |
strength | 21,082 |
team | 19,690 |
water | 22,315 |
6. Expected profit per e-mailed customer if every customer receives the same message
(
pentathlon_nptb== 0, expected_profit_rf]
.loc[pentathlon_nptb.training "mean")
.agg(=False)
.sort_values(ascendingapply(rsm.format_nr, sym = "$", dec = 2)
. )
ep_water_rf $0.65
ep_backcountry_rf $0.65
ep_endurance_rf $0.64
ep_strength_rf $0.63
ep_team_rf $0.58
ep_racquet_rf $0.58
ep_control_rf $0.49
dtype: object
7. Expected profit per e-mailed customer if every customer is assigned randomly to one of the messages or the no-message condition
# probabilty of purchase where customer is assigned to a random message
"p_random_rf"] = rf.predict(pentathlon_nptb)["prediction"]
pentathlon_nptb[
# expected avg order size where customer is assigned to a random message
"ordersize_random_reg"] = reg.predict(pentathlon_nptb)["prediction"]
pentathlon_nptb[
# expected profit when customer is assigned to a random message
"ep_random_rf"] = pentathlon_nptb["p_random_rf"] * pentathlon_nptb["ordersize_random_reg"] * 0.4
pentathlon_nptb[
# expected profit/customer when a customer is assigned to a random message
= pentathlon_nptb.loc[pentathlon_nptb.training == 0, "ep_random_rf"].mean()
random_profit_per_customer_rf random_profit_per_customer_rf
0.5981461370014217
# expected profit where no-message is sent (control)
"ep_control_rf"]
pentathlon_nptb[
# expected profit per customer where no-message is sent (control)
= pentathlon_nptb.loc[pentathlon_nptb.training == 0, "ep_control_rf"].mean()
control_profit_per_customer_rf control_profit_per_customer_rf
0.48744714166587705
8. Profit for 5,000,000 customers
= pentathlon_nptb.loc[pentathlon_nptb.training == 0, "ep_max_rf"].agg("mean") * 5000000
profit_rf profit_rf
4947465.355117234
# Profit when a customer is sent a random message
= random_profit_per_customer_rf * 5000000
random_profit_rf
random_profit_rf
# Profit when there is no message sent
= control_profit_per_customer_rf * 5000000
control_profit_rf control_profit_rf
2437235.708329385
= profit_rf - control_profit_rf
profit_improvement_rf profit_improvement_rf
2510229.6467878493
= {
profits_dct_rf "Customize Message": profit_rf,
"Randomly Assign": random_profit_rf,
"No Message Sent": control_profit_rf,
}
import seaborn as sns
import matplotlib.pyplot as plt
# Convert dictionary to DataFrame
= pd.DataFrame(list(profits_dct_rf.items()), columns=['Model', 'Profit'])
df =(10, 5)) # Adjust the width and height to your preference
plt.figure(figsize# Plot
set(style="white")
sns.= sns.barplot(x="Model", y="Profit", data=df, palette="coolwarm")
ax
# Annotations
for index, row in df.iterrows():
f'${row.Profit:.2f}', ha='center')
ax.text(index, row.Profit,
# Set labels and title
"Model Type", fontdict={'family': 'serif', 'color': 'black', 'size': 12})
ax.set_xlabel("Profit", fontdict={'family': 'serif', 'color': 'black', 'size': 12})
ax.set_ylabel("Profit by Model", fontdict={'family': 'serif', 'color': 'black', 'size': 15})
ax.set_title(
=45) # Rotate x labels for better readability
plt.xticks(rotation plt.show()
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/2755170672.py:16: FutureWarning:
Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.
XGBoost Model
= "buyer_yes"
rvar
= pentathlon_nptb.columns.to_list()
evar = evar[evar.index("message"):] # all columns after "message"
evar = evar[:evar.index("freq_racquet")+1] # all columns before "freq_racquet"
evar evar
['message',
'age',
'female',
'income',
'education',
'children',
'freq_endurance',
'freq_strength',
'freq_water',
'freq_team',
'freq_backcountry',
'freq_racquet']
# Create X_train, X_test, y_train, y_test using "training" column
= pentathlon_nptb.loc[pentathlon_nptb['training'] == 1, evar]
X_train = pentathlon_nptb.loc[pentathlon_nptb['training'] == 0, evar]
X_test = pentathlon_nptb.loc[pentathlon_nptb['training'] == 1, rvar]
y_train = pentathlon_nptb.loc[pentathlon_nptb['training'] == 0, rvar] y_test
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
# Define the parameter grid to be used in RandomizedSearchCV
= {
param_grid 'learning_rate': [0.1,0.01,0.001],
'n_estimators': [100],
'max_depth': [3, 6 ,10 ] # Depths from 3 to 10
}
# Initialize the XGBClassifier
= XGBClassifier(use_label_encoder=False, eval_metric='logloss',enable_categorical = True) xgb
# Initialize GridSearchCV
= GridSearchCV(estimator=xgb, param_grid= param_grid, scoring='roc_auc', cv=5, verbose=5) xgb_tuned
# Fit the model
xgb_tuned.fit(X_train, y_train)
Fitting 5 folds for each of 9 candidates, totalling 45 fits
[CV 1/5] END learning_rate=0.1, max_depth=3, n_estimators=100;, score=0.889 total time= 0.8s
[CV 2/5] END learning_rate=0.1, max_depth=3, n_estimators=100;, score=0.884 total time= 0.7s
[CV 3/5] END learning_rate=0.1, max_depth=3, n_estimators=100;, score=0.890 total time= 0.7s
[CV 4/5] END learning_rate=0.1, max_depth=3, n_estimators=100;, score=0.882 total time= 0.7s
[CV 5/5] END learning_rate=0.1, max_depth=3, n_estimators=100;, score=0.880 total time= 0.7s
[CV 1/5] END learning_rate=0.1, max_depth=6, n_estimators=100;, score=0.891 total time= 1.1s
[CV 2/5] END learning_rate=0.1, max_depth=6, n_estimators=100;, score=0.886 total time= 1.1s
[CV 3/5] END learning_rate=0.1, max_depth=6, n_estimators=100;, score=0.892 total time= 1.1s
[CV 4/5] END learning_rate=0.1, max_depth=6, n_estimators=100;, score=0.884 total time= 1.1s
[CV 5/5] END learning_rate=0.1, max_depth=6, n_estimators=100;, score=0.881 total time= 1.1s
[CV 1/5] END learning_rate=0.1, max_depth=10, n_estimators=100;, score=0.884 total time= 1.6s
[CV 2/5] END learning_rate=0.1, max_depth=10, n_estimators=100;, score=0.879 total time= 1.8s
[CV 3/5] END learning_rate=0.1, max_depth=10, n_estimators=100;, score=0.885 total time= 1.6s
[CV 4/5] END learning_rate=0.1, max_depth=10, n_estimators=100;, score=0.875 total time= 1.7s
[CV 5/5] END learning_rate=0.1, max_depth=10, n_estimators=100;, score=0.874 total time= 1.6s
[CV 1/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=0.864 total time= 0.6s
[CV 2/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=0.864 total time= 0.6s
[CV 3/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=0.870 total time= 0.6s
[CV 4/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=0.856 total time= 0.6s
[CV 5/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=0.854 total time= 0.6s
[CV 1/5] END learning_rate=0.01, max_depth=6, n_estimators=100;, score=0.880 total time= 1.0s
[CV 2/5] END learning_rate=0.01, max_depth=6, n_estimators=100;, score=0.876 total time= 1.0s
[CV 3/5] END learning_rate=0.01, max_depth=6, n_estimators=100;, score=0.882 total time= 1.0s
[CV 4/5] END learning_rate=0.01, max_depth=6, n_estimators=100;, score=0.873 total time= 1.0s
[CV 5/5] END learning_rate=0.01, max_depth=6, n_estimators=100;, score=0.870 total time= 1.0s
[CV 1/5] END learning_rate=0.01, max_depth=10, n_estimators=100;, score=0.882 total time= 1.6s
[CV 2/5] END learning_rate=0.01, max_depth=10, n_estimators=100;, score=0.876 total time= 1.6s
[CV 3/5] END learning_rate=0.01, max_depth=10, n_estimators=100;, score=0.885 total time= 1.6s
[CV 4/5] END learning_rate=0.01, max_depth=10, n_estimators=100;, score=0.877 total time= 1.6s
[CV 5/5] END learning_rate=0.01, max_depth=10, n_estimators=100;, score=0.875 total time= 1.7s
[CV 1/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=0.850 total time= 0.6s
[CV 2/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=0.840 total time= 0.5s
[CV 3/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=0.848 total time= 0.6s
[CV 4/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=0.837 total time= 0.6s
[CV 5/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=0.837 total time= 0.5s
[CV 1/5] END learning_rate=0.001, max_depth=6, n_estimators=100;, score=0.874 total time= 1.0s
[CV 2/5] END learning_rate=0.001, max_depth=6, n_estimators=100;, score=0.869 total time= 1.0s
[CV 3/5] END learning_rate=0.001, max_depth=6, n_estimators=100;, score=0.876 total time= 1.0s
[CV 4/5] END learning_rate=0.001, max_depth=6, n_estimators=100;, score=0.867 total time= 0.9s
[CV 5/5] END learning_rate=0.001, max_depth=6, n_estimators=100;, score=0.866 total time= 0.9s
[CV 1/5] END learning_rate=0.001, max_depth=10, n_estimators=100;, score=0.874 total time= 1.5s
[CV 2/5] END learning_rate=0.001, max_depth=10, n_estimators=100;, score=0.869 total time= 1.6s
[CV 3/5] END learning_rate=0.001, max_depth=10, n_estimators=100;, score=0.880 total time= 1.5s
[CV 4/5] END learning_rate=0.001, max_depth=10, n_estimators=100;, score=0.864 total time= 1.4s
[CV 5/5] END learning_rate=0.001, max_depth=10, n_estimators=100;, score=0.868 total time= 1.5s
GridSearchCV(cv=5, estimator=XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=True, eval_metric='logloss', feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=N... max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=None, n_jobs=None, num_parallel_tree=None, random_state=None, ...), param_grid={'learning_rate': [0.1, 0.01, 0.001], 'max_depth': [3, 6, 10], 'n_estimators': [100]}, scoring='roc_auc', verbose=5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=True, eval_metric='logloss', feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=N... max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=None, n_jobs=None, num_parallel_tree=None, random_state=None, ...), param_grid={'learning_rate': [0.1, 0.01, 0.001], 'max_depth': [3, 6, 10], 'n_estimators': [100]}, scoring='roc_auc', verbose=5)
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=True, eval_metric='logloss', feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=None, n_jobs=None, num_parallel_tree=None, random_state=None, ...)
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=True, eval_metric='logloss', feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=None, n_jobs=None, num_parallel_tree=None, random_state=None, ...)
# print best parameters
= xgb_tuned.best_params_
best_params print(best_params)
# print best score
= xgb_tuned.best_score_
best_score print(best_score)
{'learning_rate': 0.1, 'max_depth': 6, 'n_estimators': 100}
0.886970840782522
# Train model on best parameters
= XGBClassifier(**best_params, use_label_encoder=False, eval_metric="logloss", enable_categorical=True)
best_xgb
# Fit new model to training data
best_xgb.fit(X_train, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=True, eval_metric='logloss', feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.1, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=6, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=100, n_jobs=None, num_parallel_tree=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=True, eval_metric='logloss', feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.1, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=6, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=100, n_jobs=None, num_parallel_tree=None, random_state=None, ...)
# Feature importance of xgb model
from xgboost import plot_importance
import matplotlib.pyplot as plt
# Assuming best_xgb is your trained XGBoost model
# Plot feature importance
='weight')
plot_importance(best_xgb, importance_type plt.show()
Questions
1. For each customer determine the message (i.e., endurance, strength, water, team, backcountry, racquet, or no-message) predicted to lead to the highest probability of purchase.
To determine the message that will lead to the highest probability of purchase for each customer, we can fit the model using only the data for each categorical level in our model and use the fitted model to create predictions for all of the data. Once we have predictions on all of the data for each message, we can select the message with the highest predicted value for the record.
# Create a training copy of the data where 'training' equals 1
= pentathlon_nptb[pentathlon_nptb['training'] == 1].copy()
pentathlon_nptb_train
# Define the 7 different "message" values
= ['team', 'backcountry', 'endurance', 'water', 'racquet', 'strength', 'control']
message_values
for value in message_values:
# Filter the training set for the current message
= pentathlon_nptb_train[pentathlon_nptb_train['message'] == value]
current_train
# Create X_train and y_train for the current message
= current_train[evar]
X_train = current_train[rvar]
y_train
# Fit the model to the training data for the current message
best_xgb.fit(X_train, y_train)
# Predict probabilities for the entire dataset (you may want to adjust this based on your requirements)
f'p_{value}_xgb'] = best_xgb.predict_proba(pentathlon_nptb[evar])[:, 1] pentathlon_nptb[
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1516221055.py:19: PerformanceWarning:
DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1516221055.py:19: PerformanceWarning:
DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1516221055.py:19: PerformanceWarning:
DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1516221055.py:19: PerformanceWarning:
DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1516221055.py:19: PerformanceWarning:
DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1516221055.py:19: PerformanceWarning:
DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
= {
repl "p_control_xgb": "control",
"p_endurance_xgb": "endurance",
"p_backcountry_xgb": "backcountry",
"p_racquet_xgb": "racquet",
"p_strength_xgb": "strength",
"p_team_xgb": "team",
"p_water_xgb": "water"}
= [
predictions_xgb "p_control_xgb",
"p_endurance_xgb",
"p_backcountry_xgb",
"p_racquet_xgb",
"p_strength_xgb",
"p_team_xgb",
"p_water_xgb"]
'message_xgb'] = (
pentathlon_nptb[
pentathlon_nptb[predictions_xgb]=1)
.idxmax(axismap(repl)
. )
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/3215817490.py:1: PerformanceWarning:
DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
=pentathlon_nptb.message_xgb, columns="count").apply(rsm.format_nr).sort_values("count", ascending=True) pd.crosstab(index
col_0 | count |
---|---|
message_xgb | |
endurance | 179,225 |
backcountry | 56,966 |
control | 58,865 |
water | 64,209 |
racquet | 70,013 |
team | 71,928 |
strength | 98,794 |
2. For each message, The percentage of customers for whom that message or no-message maximizes their probability of purchase:
We can see from the plot that endurance messages most often maximizes the cusomer’s probability of purchase. After enducance, strength messages are the second most reported value that maximizes purchase probabilities. Surprisingly, despite team, raquet, and water being the next best message groups, none of the remaining messages seem to provide a meaningful improvement over sending no message at all.
= pentathlon_nptb['message_xgb'].value_counts(normalize=True)
message_percentages message_percentages
message_xgb
endurance 0.298708
strength 0.164657
team 0.119880
racquet 0.116688
water 0.107015
control 0.098108
backcountry 0.094943
Name: proportion, dtype: float64
# Plotting
=(10, 6))
plt.figure(figsize='bar')
message_percentages.plot(kind'Percentage of Customers')
plt.ylabel('Message Type')
plt.xlabel('Percentage of Customers for Each Message Value')
plt.title(=45) # Rotate labels to make them readable
plt.xticks(rotation plt.show()
For each customer, determine the message (i.e., endurance, strength, water, team, backcountry, racquet, or no-message) predicted to lead to the highest expected profit (COGS is 60%).
The message returning the highest profit is Endurance
To predict order size we tuned a new xgb model to the data and regressed the likely order size for each record. We then used the predicted order size to calculate the expected profit for each record. We calculated the expected profit by multiplying the predicted order size by the probability of purchase and subtracting the cost of goods sold from the result.
= "total_os"
rvar_os = ['message',
evar 'age',
'female',
'income',
'education',
'children',
'freq_endurance',
'freq_strength',
'freq_water',
'freq_team',
'freq_backcountry',
'freq_racquet']
# Create X_train, X_test, y_train, y_test using "training" column
= pentathlon_nptb.loc[pentathlon_nptb['training'] == 1, evar]
X_train = pentathlon_nptb.loc[pentathlon_nptb['training'] == 0, evar]
X_test = pentathlon_nptb.loc[pentathlon_nptb['training'] == 1, 'total_os']
y_train_os = pentathlon_nptb.loc[pentathlon_nptb['training'] == 0, 'total_os'] y_test_os
from xgboost import XGBRegressor
# Define the parameter grid to be used in GridSearchCV
= {
param_grid 'learning_rate': [0.1, 0.01, 0.001],
'n_estimators': [100],
'max_depth': [3, 6, 10] # Depths from 3 to 10
}
# Initialize the XGBRegressor
= XGBRegressor(use_label_encoder=False, eval_metric='rmse', enable_categorical=True)
xgb_os
# Initialize GridSearchCV with a scoring metric suitable for regression
= GridSearchCV(estimator=xgb_os, param_grid=param_grid, scoring='neg_root_mean_squared_error', cv=5, verbose=5)
xgb_tuned_os
# Assuming X_train and y_train are defined
# Fit the model
xgb_tuned_os.fit(X_train, y_train_os)
Fitting 5 folds for each of 9 candidates, totalling 45 fits
[CV 1/5] END learning_rate=0.1, max_depth=3, n_estimators=100;, score=-12.399 total time= 0.4s
[CV 2/5] END learning_rate=0.1, max_depth=3, n_estimators=100;, score=-12.532 total time= 0.4s
[CV 3/5] END learning_rate=0.1, max_depth=3, n_estimators=100;, score=-12.659 total time= 0.4s
[CV 4/5] END learning_rate=0.1, max_depth=3, n_estimators=100;, score=-12.504 total time= 0.4s
[CV 5/5] END learning_rate=0.1, max_depth=3, n_estimators=100;, score=-11.961 total time= 0.5s
[CV 1/5] END learning_rate=0.1, max_depth=6, n_estimators=100;, score=-12.700 total time= 0.6s
[CV 2/5] END learning_rate=0.1, max_depth=6, n_estimators=100;, score=-12.639 total time= 0.7s
[CV 3/5] END learning_rate=0.1, max_depth=6, n_estimators=100;, score=-12.807 total time= 0.7s
[CV 4/5] END learning_rate=0.1, max_depth=6, n_estimators=100;, score=-12.660 total time= 0.7s
[CV 5/5] END learning_rate=0.1, max_depth=6, n_estimators=100;, score=-12.062 total time= 0.7s
[CV 1/5] END learning_rate=0.1, max_depth=10, n_estimators=100;, score=-13.273 total time= 1.1s
[CV 2/5] END learning_rate=0.1, max_depth=10, n_estimators=100;, score=-13.104 total time= 1.2s
[CV 3/5] END learning_rate=0.1, max_depth=10, n_estimators=100;, score=-13.254 total time= 1.1s
[CV 4/5] END learning_rate=0.1, max_depth=10, n_estimators=100;, score=-12.978 total time= 1.3s
[CV 5/5] END learning_rate=0.1, max_depth=10, n_estimators=100;, score=-12.550 total time= 1.2s
[CV 1/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=-12.550 total time= 0.4s
[CV 2/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=-12.691 total time= 0.4s
[CV 3/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=-12.816 total time= 0.4s
[CV 4/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=-12.654 total time= 0.4s
[CV 5/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=-12.135 total time= 0.4s
[CV 1/5] END learning_rate=0.01, max_depth=6, n_estimators=100;, score=-12.568 total time= 0.8s
[CV 2/5] END learning_rate=0.01, max_depth=6, n_estimators=100;, score=-12.663 total time= 0.7s
[CV 3/5] END learning_rate=0.01, max_depth=6, n_estimators=100;, score=-12.795 total time= 0.9s
[CV 4/5] END learning_rate=0.01, max_depth=6, n_estimators=100;, score=-12.622 total time= 0.8s
[CV 5/5] END learning_rate=0.01, max_depth=6, n_estimators=100;, score=-12.095 total time= 0.8s
[CV 1/5] END learning_rate=0.01, max_depth=10, n_estimators=100;, score=-12.658 total time= 1.5s
[CV 2/5] END learning_rate=0.01, max_depth=10, n_estimators=100;, score=-12.755 total time= 1.5s
[CV 3/5] END learning_rate=0.01, max_depth=10, n_estimators=100;, score=-12.850 total time= 1.4s
[CV 4/5] END learning_rate=0.01, max_depth=10, n_estimators=100;, score=-12.661 total time= 1.4s
[CV 5/5] END learning_rate=0.01, max_depth=10, n_estimators=100;, score=-12.113 total time= 1.5s
[CV 1/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=-12.971 total time= 0.4s
[CV 2/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=-13.065 total time= 0.4s
[CV 3/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=-13.206 total time= 0.4s
[CV 4/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=-13.016 total time= 0.4s
[CV 5/5] END learning_rate=0.001, max_depth=3, n_estimators=100;, score=-12.524 total time= 0.5s
[CV 1/5] END learning_rate=0.001, max_depth=6, n_estimators=100;, score=-12.966 total time= 0.8s
[CV 2/5] END learning_rate=0.001, max_depth=6, n_estimators=100;, score=-13.058 total time= 0.8s
[CV 3/5] END learning_rate=0.001, max_depth=6, n_estimators=100;, score=-13.195 total time= 0.7s
[CV 4/5] END learning_rate=0.001, max_depth=6, n_estimators=100;, score=-13.005 total time= 0.7s
[CV 5/5] END learning_rate=0.001, max_depth=6, n_estimators=100;, score=-12.511 total time= 0.8s
[CV 1/5] END learning_rate=0.001, max_depth=10, n_estimators=100;, score=-12.957 total time= 1.6s
[CV 2/5] END learning_rate=0.001, max_depth=10, n_estimators=100;, score=-13.059 total time= 1.6s
[CV 3/5] END learning_rate=0.001, max_depth=10, n_estimators=100;, score=-13.197 total time= 1.5s
[CV 4/5] END learning_rate=0.001, max_depth=10, n_estimators=100;, score=-13.001 total time= 1.8s
[CV 5/5] END learning_rate=0.001, max_depth=10, n_estimators=100;, score=-12.500 total time= 1.6s
GridSearchCV(cv=5, estimator=XGBRegressor(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=True, eval_metric='rmse', feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None,... max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=None, n_jobs=None, num_parallel_tree=None, random_state=None, ...), param_grid={'learning_rate': [0.1, 0.01, 0.001], 'max_depth': [3, 6, 10], 'n_estimators': [100]}, scoring='neg_root_mean_squared_error', verbose=5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=XGBRegressor(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=True, eval_metric='rmse', feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None,... max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=None, n_jobs=None, num_parallel_tree=None, random_state=None, ...), param_grid={'learning_rate': [0.1, 0.01, 0.001], 'max_depth': [3, 6, 10], 'n_estimators': [100]}, scoring='neg_root_mean_squared_error', verbose=5)
XGBRegressor(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=True, eval_metric='rmse', feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=None, n_jobs=None, num_parallel_tree=None, random_state=None, ...)
XGBRegressor(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=True, eval_metric='rmse', feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=None, n_jobs=None, num_parallel_tree=None, random_state=None, ...)
# After fitting, get the best parameters and retrain the model
= xgb_tuned.best_params_
best_params_os = XGBRegressor(**best_params_os, use_label_encoder=False, eval_metric="rmse", enable_categorical=True)
best_xgb_os
# Fit the new model to training data
best_xgb_os.fit(X_train, y_train_os)
XGBRegressor(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=True, eval_metric='rmse', feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.1, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=6, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=100, n_jobs=None, num_parallel_tree=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBRegressor(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=True, eval_metric='rmse', feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.1, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=6, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=100, n_jobs=None, num_parallel_tree=None, random_state=None, ...)
# Create a training copy of the data where 'training' equals 1
= pentathlon_nptb[pentathlon_nptb['training'] == 1].copy()
pentathlon_nptb_train
# Define the 7 different "message" values
= ['team', 'backcountry', 'endurance', 'water', 'racquet', 'strength', 'control']
message_values
for value in message_values:
# Filter the training set for the current message
= pentathlon_nptb_train[pentathlon_nptb_train['message'] == value]
current_train
# Create X_train and y_train for the current message
= current_train[evar]
X_train = current_train[rvar_os]
y_train
# Fit the model to the training data for the current message
best_xgb_os.fit(X_train, y_train)
# Predict orer size for the entire dataset
f'os_{value}_xgb'] = best_xgb_os.predict(pentathlon_nptb[evar]).round(0) pentathlon_nptb[
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/777574624.py:19: PerformanceWarning:
DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/777574624.py:19: PerformanceWarning:
DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/777574624.py:19: PerformanceWarning:
DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/777574624.py:19: PerformanceWarning:
DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/777574624.py:19: PerformanceWarning:
DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/777574624.py:19: PerformanceWarning:
DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/777574624.py:19: PerformanceWarning:
DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
= ['team', 'backcountry', 'endurance', 'water', 'racquet', 'strength', 'control']
values = 1-0.6
Profit_margin
for value in values:
f'ep_{value}_xgb'] = pentathlon_nptb[f'p_{value}_xgb'] * pentathlon_nptb[f'os_{value}_xgb'] * Profit_margin pentathlon_nptb[
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1144112266.py:6: PerformanceWarning:
DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1144112266.py:6: PerformanceWarning:
DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1144112266.py:6: PerformanceWarning:
DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1144112266.py:6: PerformanceWarning:
DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1144112266.py:6: PerformanceWarning:
DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1144112266.py:6: PerformanceWarning:
DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1144112266.py:6: PerformanceWarning:
DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
= [
expected_profit_xgb "ep_control_xgb",
"ep_endurance_xgb",
"ep_backcountry_xgb",
"ep_racquet_xgb",
"ep_strength_xgb",
"ep_team_xgb",
"ep_water_xgb"]
= {"ep_control_xgb": "control", "ep_endurance_xgb": "endurance", "ep_backcountry_xgb": "backcountry", "ep_racquet_xgb": "racquet", "ep_strength_xgb": "strength", "ep_team_xgb": "team", "ep_water_xgb": "water"}
repl "ep_message_xgb"] = (
pentathlon_nptb[
pentathlon_nptb[expected_profit_xgb]=1)
.idxmax(axismap(repl)
. )
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/3163695219.py:2: PerformanceWarning:
DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
# Define a function to look up the profit value based on ep_message_xgb
def get_ep(row):
# Mapping each ep_message_xgb value to the corresponding profit column
= {
mapping 'control': 'ep_control_xgb',
'endurance': 'ep_endurance_xgb',
'backcountry': 'ep_backcountry_xgb',
'racquet': 'ep_racquet_xgb',
'strength': 'ep_strength_xgb',
'team': 'ep_team_xgb',
'water': 'ep_water_xgb',
}
# Get the profit column name for the current row's message
= mapping.get(row['ep_message_xgb'])
profit_col
# Return the value from the corresponding profit column for this row
if profit_col:
return row[profit_col]
else:
return None
# Apply the function to each row to create a new column with the corresponding profit value
'ep_message_max_p'] = pentathlon_nptb.apply(get_ep, axis=1) pentathlon_nptb[
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1040504524.py:24: PerformanceWarning:
DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
= pentathlon_nptb.groupby('ep_message_xgb')['ep_message_max_p'].sum().sort_values(ascending=False).round(2)
ep_profit ep_profit
ep_message_xgb
endurance 43977.37
strength 37358.12
water 30991.18
backcountry 26557.67
team 21474.36
racquet 18135.07
control 9657.12
Name: ep_message_max_p, dtype: float64
# Histogram of ep_profit
=(10, 6))
plt.figure(figsize='bar')
ep_profit.plot(kind'Expected Profit by Message')
plt.ylabel('Message Type')
plt.xlabel('Total Profit')
plt.title(=45) # Rotate labels to make them readable
plt.xticks(rotation plt.show()
4. Report for each message, i.e., endurance, racket, etc., and no-message, the percentage of customers for whom that (no) message maximizes their expected profit.
= pentathlon_nptb.ep_message_xgb.value_counts(normalize=True)
Message_max Message_max
ep_message_xgb
control 0.576705
strength 0.091172
endurance 0.082492
water 0.067795
team 0.066832
backcountry 0.065020
racquet 0.049985
Name: proportion, dtype: float64
# Plotting
=(10, 6))
plt.figure(figsize='bar')
Message_max.plot(kind'Percentage of Customers')
plt.ylabel('Message Type')
plt.xlabel('Percentage of Customers for Each Message Value')
plt.title(=45) # Rotate labels to make them readable
plt.xticks(rotation
plt.show()
5. The expected profit can we obtain, on average, per customer if we customize the message to each customer
= pentathlon_nptb[pentathlon_nptb.training == 0].groupby('ep_message_xgb')['ep_message_max_p'].mean().sort_values(ascending=False).round(2) * Profit_margin
avg_profit_pc avg_profit_pc
ep_message_xgb
endurance 0.336
strength 0.284
water 0.280
backcountry 0.264
racquet 0.232
team 0.196
control 0.008
Name: ep_message_max_p, dtype: float64
# Histogram of ep_profit
=(10, 6))
plt.figure(figsize='bar')
avg_profit_pc.plot(kind'Expected Profit Per Customer')
plt.ylabel('Message Type')
plt.xlabel('Average Profit')
plt.title(=45) # Rotate labels to make them readable
plt.xticks(rotation plt.show()
6. The expected profit per e-mailed customer if every customer receives the same message
= [
expected_profit_xgb "ep_control_xgb",
"ep_endurance_xgb",
"ep_backcountry_xgb",
"ep_racquet_xgb",
"ep_strength_xgb",
"ep_team_xgb",
"ep_water_xgb"]
# Expected profit per customer for each message
= pentathlon_nptb[pentathlon_nptb.training == 0][expected_profit_xgb].mean().sort_values(ascending=False) * Profit_margin epp_same_message
# Histogram of ep_profit
=(10, 6))
plt.figure(figsize='bar')
epp_same_message.plot(kind'Expected Profit Per Customer')
plt.ylabel('Message Type')
plt.xlabel('Average Profit')
plt.title(=45) # Rotate labels to make them readable
plt.xticks(rotation plt.show()
7. The expected profit per e-mailed customer if every customer is assigned randomly to one of the messages or the no-message condition
'ep_random_xgb'] = best_xgb.predict_proba(pentathlon_nptb[evar])[:, 1]
pentathlon_nptb[= pentathlon_nptb[pentathlon_nptb.training == 0]['ep_random_xgb'].mean() * Profit_margin
xgb_random_ppc xgb_random_ppc
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/3633768350.py:1: PerformanceWarning:
DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
0.008396312594413757
= [
expected_profit_xgb "ep_control_xgb",
"ep_endurance_xgb",
"ep_backcountry_xgb",
"ep_racquet_xgb",
"ep_strength_xgb",
"ep_team_xgb",
"ep_water_xgb",
"ep_random_xgb"]
# Expected profit per customer for each message
= pentathlon_nptb[pentathlon_nptb.training == 0][expected_profit_xgb].mean().sort_values(ascending=False) epp_same_message
# Histogram of ep_profit
=(10, 6))
plt.figure(figsize='bar')
epp_same_message.plot(kind'Expected Profit Per Customer')
plt.ylabel('Message Type')
plt.xlabel('Average Profit')
plt.title(=45) # Rotate labels to make them readable
plt.xticks(rotation plt.show()
8. For the typical promotional e-mail blast to 5,000,000 customers, The improvement (in percent and in total Euros) could Pentathlon achieve by customizing the message (or no-message) to each customer:
"ep_max_xgb"] = pentathlon_nptb[expected_profit_xgb].max(axis=1)
pentathlon_nptb["ep_max_xgb"].mean() pentathlon_nptb[
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1479503728.py:1: PerformanceWarning:
DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
0.31611574
# Profit where each customer is assigned to the message with the highest expected profit
= pentathlon_nptb.loc[pentathlon_nptb.training == 0, "ep_max_xgb"].agg("mean") * 5000000
profit_xgb profit_xgb
1511183.9771270752
# Profit where each customer is sent to a random message
= xgb_random_ppc * 5000000
random_profit_xgb random_profit_xgb
41981.56297206879
# expected profit where no-message is sent (control)
"ep_control_xgb"]
pentathlon_nptb[
# expected profit per customer where no-message is sent (control)
= pentathlon_nptb.loc[pentathlon_nptb.training == 0, "ep_control_xgb"].mean()
control_profit_per_customer control_profit_per_customer
0.078689866
# Profit where no message is sent
= control_profit_per_customer * 5000000
control_profit_xgb control_profit_xgb
393449.3288397789
= profit_xgb - control_profit_xgb
profit_improvement_xgb profit_improvement_xgb
1117734.6482872963
= {
profits_dct "Customize Message": profit_xgb,
"No Message Sent": control_profit_nn,
"Randomly Assign": random_profit_xgb,
}
import seaborn as sns
import matplotlib.pyplot as plt
# Convert dictionary to DataFrame
= pd.DataFrame(list(profits_dct.items()), columns=['Model', 'Profit'])
df =(10, 5))
plt.figure(figsize# Plot
set(style="whitegrid")
sns.= sns.barplot(x="Model", y="Profit", data=df, palette="viridis")
ax
# Annotations
for index, row in df.iterrows():
f'${row.Profit:.2f}', ha='center')
ax.text(index, row.Profit,
# Set labels and title
"Model Type", fontdict={'family': 'serif', 'color': 'black', 'size': 12})
ax.set_xlabel("Profit", fontdict={'family': 'serif', 'color': 'black', 'size': 12})
ax.set_ylabel("Profit by Model", fontdict={'family': 'serif', 'color': 'black', 'size': 15})
ax.set_title(
=45)
plt.xticks(rotation plt.show()
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/1574157434.py:17: FutureWarning:
Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.
Improvement Profit for 4 models
= pd.DataFrame({
mod_perf "model": ["logit", "nn", "rf", "xgb"],
"profit": [profit_improvement_lr, profit_improvement_nn, profit_improvement_rf, profit_improvement_xgb]
}) mod_perf
model | profit | |
---|---|---|
0 | logit | 1.238407e+06 |
1 | nn | 1.262522e+06 |
2 | rf | 2.510230e+06 |
3 | xgb | 1.117735e+06 |
= {
profits_dct "Logit": profit_improvement_lr,
"Neural Network": profit_improvement_nn,
"Random Forest" : profit_improvement_rf,
"XGBoost": profit_improvement_xgb
}
import seaborn as sns
import matplotlib.pyplot as plt
# Convert dictionary to DataFrame
= pd.DataFrame(list(profits_dct.items()), columns=['Model', 'Profit'])
df =(10, 5))
plt.figure(figsize# Plot
set(style="whitegrid")
sns.= sns.barplot(x="Model", y="Profit", data=df, palette="viridis")
ax
# Annotations
for index, row in df.iterrows():
f'${row.Profit:.2f}', ha='center')
ax.text(index, row.Profit,
# Set labels and title
"Model Type", fontdict={'family': 'serif', 'color': 'black', 'size': 12})
ax.set_xlabel("Profit", fontdict={'family': 'serif', 'color': 'black', 'size': 12})
ax.set_ylabel("Profit Improvement by Model", fontdict={'family': 'serif', 'color': 'black', 'size': 15})
ax.set_title(
=45)
plt.xticks(rotation plt.show()
/var/folders/28/cfl1_cfs3bb536qkz8wkys_w0000gn/T/ipykernel_31928/2950013552.py:18: FutureWarning:
Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.
According to the chart, Random Forest outperforms the other models in terms of profit improvement but takes more than 2 hours to tune the model.
- Weaknesses of the Random Forest Model:
Computational Complexity: Random Forest can be computationally intensive, especially as the number of trees and the depth of each tree increase. This is likely the cause of the long tunning time. The complexity grows with the size of the dataset, the number of features, and the model parameters.
Scalability: Because of the computational cost, scaling Random Forest to very large datasets or high-dimensional data can be challenging and might require significant computational resources.
Model Interpretability: While not as complex as some other models like neural networks, Random Forests are still considered a “black box” compared to simpler models like Logistic Regression. This can make it difficult to understand the decision-making process of the model and to explain individual predictions.
Tuning Required: Random Forest models have several hyperparameters (like the number of trees, max depth, min samples split, etc.) that can greatly influence performance. Finding the optimal settings can be time-consuming and require a lot of trial and error or automated search techniques like grid search or random search, which can further increase the computational burden.
- Suggested Improvement:
Model Simplification: We should simplying the model by reducing the number of trees or the depth of each tree, which might reduce training time at the cost of some accuracy.
Feature Selection: Reducing the number of features through feature selection can significantly decrease training time and sometimes even improve model performance by eliminating noise.
Incremental Learning: Some models allow for incremental learning, where the model is trained on chunks of the dataset sequentially, which can reduce memory consumption and potentially training time.