Optimizing Online Sport Retail Revenue

Sports clothing is a booming sector!

In this notebook, I will use my SQL skills to analyze product data for an online sports retail company.

I will work with numeric, string, and timestamp data on pricing and revenue, ratings, reviews, descriptions, and website traffic then use techniques such as aggregation, cleaning, labeling, Common Table Expressions, and correlation to produce recommendations on how the company can maximize revenue!

Project Tasks

Counting missing values
Nike vs Adidas pricing
Labeling price ranges
Average discount by brand
Correlation between revenue and reviews
Ratings and reviews by product description length
Reviews by month and brand
Footwear product performance
Clothing product performance

1. Counting missing values

Sports clothing and athleisure attire is a huge industry, worth approximately $193 billion in 2021 with a strong growth forecast over the next decade!

In this notebook, we play the role of a product analyst for an online sports clothing company. The company is specifically interested in how it can improve revenue. We will dive into product data such as pricing, reviews, descriptions, and ratings, as well as revenue and website traffic, to produce recommendations for its marketing and sales teams.

The database provided to us, sports, contains five tables, with product_id being the primary key for all of them:

`info`

column	data type	description
`product_name`	`varchar`	Name of the product
`product_id`	`varchar`	Unique ID for product
`description`	`varchar`	Description of the product

`finance`

column	data type	description
`product_id`	`varchar`	Unique ID for product
`listing_price`	`float`	Listing price for product
`sale_price`	`float`	Price of the product when on sale
`discount`	`float`	Discount, as a decimal, applied to the sale price
`revenue`	`float`	Amount of revenue generated by each product, in US dollars

`reviews`

column	data type	description
`product_name`	`varchar`	Name of the product
`product_id`	`varchar`	Unique ID for product
`rating`	`float`	Product rating, scored from `1.0` to `5.0`
`reviews`	`float`	Number of reviews for the product

`traffic`

column	data type	description
`product_id`	`varchar`	Unique ID for product
`last_visited`	`timestamp`	Date and time the product was last viewed on the website

`brands`

column	data type	description
`product_id`	`varchar`	Unique ID for product
`brand`	`varchar`	Brand of the product

We will be dealing with missing data as well as numeric, string, and timestamp data types to draw insights about the products in the online store. Let’s start by finding out how complete the data is.

import pandas as pd
import sqlite3

# Define the names of your CSV files and corresponding table names
names = ['finance', 'info', 'reviews', 'traffic', 'brands']

# Create or open a single SQLite database
conn = sqlite3.connect('consolidated.db')

# Load each CSV data into a DataFrame and write it to a SQL table in the single database
for name in names:
    df = pd.read_csv(f'{name}.csv')
    df.to_sql(name, conn, if_exists='replace', index=False)

# Close the database connection
conn.close()

%load_ext sql
%sql sqlite:///consolidated.db

The sql extension is already loaded. To reload it, use:
  %reload_ext sql

%%sql
SELECT 
    COUNT(*) AS total_rows,
    SUM(CASE WHEN i.description IS NULL THEN 0 ELSE 1 END) AS count_description,
    SUM(CASE WHEN f.listing_price IS NULL THEN 0 ELSE 1 END) AS count_listing_price,
    SUM(CASE WHEN t.last_visited IS NULL THEN 0 ELSE 1 END) AS count_last_visited
FROM info i
JOIN finance f ON i.product_id = f.product_id
JOIN traffic t ON f.product_id = t.product_id;

 * sqlite:///consolidated.db
Done.

total_rows	count_description	count_listing_price	count_last_visited
3179	3117	3120	2928

2. Nike vs Adidas pricing

We can see the database contains 3,179 products in total. Of the columns we previewed, only one — last_visited — is missing more than five percent of its values. Now let’s turn our attention to pricing.

How do the price points of Nike and Adidas products differ? Answering this question can help us build a picture of the company’s stock range and customer market. We will run a query to produce a distribution of the listing_price and the count for each price, grouped by brand.

%%sql

SELECT 
    b.brand,
    CAST(f.listing_price AS INTEGER),
    COUNT(*)
FROM finance AS f
JOIN brands AS b
ON b.product_id = f.product_id
WHERE f.listing_price > 0
AND (b.brand = 'Adidas' OR b.brand = 'Nike')
GROUP BY b.brand, f.listing_price
ORDER BY f.listing_price DESC;

 * sqlite:///consolidated.db
Done.

brand	CAST(f.listing_price AS INTEGER)	COUNT(*)
Adidas	299	2
Adidas	279	4
Adidas	239	5
Adidas	229	8
Adidas	219	11
Adidas	199	8
Nike	199	1
Adidas	189	7
Nike	189	2
Adidas	179	34
Nike	179	4
Adidas	169	27
Nike	169	14
Adidas	159	28
Nike	159	31
Adidas	149	41
Nike	149	6
Adidas	139	36
Nike	139	12
Adidas	129	96
Nike	129	12
Adidas	119	115
Nike	119	16
Adidas	109	91
Nike	109	17
Adidas	99	72
Nike	99	14
Adidas	95	2
Nike	94	1
Adidas	89	89
Nike	89	13
Adidas	85	7
Adidas	84	1
Nike	84	5
Adidas	79	322
Nike	79	16
Nike	78	1
Adidas	75	149
Adidas	74	1
Nike	74	7
Adidas	69	87
Nike	69	4
Adidas	65	102
Nike	64	1
Adidas	62	1
Adidas	59	211
Nike	59	2
Adidas	55	174
Adidas	54	2
Adidas	52	43
Adidas	49	183
Nike	49	5
Adidas	47	42
Nike	47	1
Adidas	45	163
Adidas	44	1
Nike	44	3
Adidas	42	51
Adidas	39	81
Nike	39	1
Adidas	37	24
Adidas	35	25
Adidas	32	24
Adidas	29	37
Nike	29	2
Adidas	27	38
Adidas	26	18
Adidas	24	28
Adidas	22	1
Adidas	19	8
Adidas	17	4
Adidas	15	4
Adidas	14	27
Adidas	12	27
Adidas	11	1
Adidas	9	11
Adidas	8	1

3. Labeling price ranges

It turns out there are 77 unique prices for the products in our database, which makes the output of our last query quite difficult to analyze.

Let’s build on our previous query by assigning labels to different price ranges, grouping by brand and label. We will also include the total revenue for each price range and brand.

%%sql
SELECT 
    b.brand,
    COUNT(*),
    SUM(f.revenue) AS total_revenue,
    CASE 
        WHEN f.listing_price < 42 THEN 'Budget'
        WHEN f.listing_price >= 42 AND f.listing_price < 74 THEN 'Average'
        WHEN f.listing_price >= 74 AND f.listing_price < 129 THEN 'Expensive'
        ELSE 'Elite'
    END AS price_category
FROM finance AS f
JOIN brands AS b 
ON b.product_id = f.product_id
WHERE b.brand IS NOT NULL
GROUP BY b.brand, price_category
ORDER BY total_revenue DESC;

 * sqlite:///consolidated.db
Done.

brand	COUNT(*)	total_revenue	price_category
Adidas	849	4626980.07	Expensive
Adidas	1060	3233661.06	Average
Adidas	307	3014316.83	Elite
Adidas	359	651661.12	Budget
Nike	357	595341.02	Budget
Nike	82	128475.59	Elite
Nike	90	71843.15	Expensive
Nike	16	6623.5	Average

4. Average discount by brand

Interestingly, grouping products by brand and price range allows us to see that Adidas items generate more total revenue regardless of price category! Specifically, “Elite” Adidas products priced $129 or more typically generate the highest revenue, so the company can potentially increase revenue by shifting their stock to have a larger proportion of these products!

Note we have been looking at listing_price so far. The listing_price may not be the price that the product is ultimately sold for. To understand revenue better, let’s take a look at the discount, which is the percent reduction in the listing_price when the product is actually sold. We would like to know whether there is a difference in the amount of discount offered between brands, as this could be influencing revenue.

%%sql
SELECT 
    b.brand,
    AVG(f.discount) * 100 AS average_discount
FROM finance AS f
INNER JOIN brands AS b 
ON b.product_id = f.product_id
GROUP BY brand
HAVING brand IS NOT NULL;

 * sqlite:///consolidated.db
Done.

brand	average_discount
Adidas	33.45242718446602
Nike	0.0

5. Correlation between revenue and reviews

Strangely, no discount is offered on Nike products! In comparison, not only do Adidas products generate the most revenue, but these products are also heavily discounted!

To improve revenue further, the company could try to reduce the amount of discount offered on Adidas products, and monitor sales volume to see if it remains stable. Alternatively, it could try offering a small discount on Nike products. This would reduce average revenue for these products, but may increase revenue overall if there is an increase in the volume of Nike products sold.

Now explore whether relationships exist between the columns in our database. We will check the strength and direction of a correlation between revenue and reviews.

# Connect to the SQLite database
conn = sqlite3.connect('consolidated.db')

# Fetch data
query = """
SELECT r.reviews, f.revenue
FROM finance AS f
JOIN reviews AS r ON r.product_id = f.product_id
"""
df = pd.read_sql(query, conn)

# Calculate correlation using pandas
correlation = df['reviews'].corr(df['revenue'])
print("Correlation between reviews and revenue:", correlation)

# Close the database connection
conn.close()

Correlation between reviews and revenue: 0.6518512283481297

6. Ratings and reviews by product description length

Interestingly, there is a strong positive correlation between revenue and reviews. This means, potentially, if we can get more reviews on the company’s website, it may increase sales of those items with a larger number of reviews.

Perhaps the length of a product’s description might influence a product’s rating and reviews — if so, the company can produce content guidelines for listing products on their website and test if this influences revenue. Let’s check this out!

%%sql
SELECT
    (LENGTH(i.description) / 100) * 100 AS description_length,  -- Truncate to nearest 100
    ROUND(AVG(r.rating), 2) AS average_rating  -- Calculate average and round
FROM reviews r
JOIN info i ON r.product_id = i.product_id
WHERE i.description IS NOT NULL
GROUP BY description_length
ORDER BY description_length;

 * sqlite:///consolidated.db
Done.

description_length	average_rating
0	1.87
100	3.21
200	3.27
300	3.29
400	3.32
500	3.12
600	3.65

7. Reviews by month and brand

Unfortunately, there doesn’t appear to be a clear pattern between the length of a product’s description and its rating.

As we know a correlation exists between reviews and revenue, one approach the company could take is to run experiments with different sales processes encouraging more reviews from customers about their purchases, such as by offering a small discount on future purchases.

Let’s take a look at the volume of reviews by month to see if there are any trends or gaps we can look to exploit.

%%sql

SELECT 
    b.brand,
    strftime('%m', t.last_visited) AS month,  -- Extracts the month as 'MM'
    COUNT(r.product_id) AS num_reviews
FROM reviews AS r
JOIN traffic AS t ON r.product_id = t.product_id
JOIN brands AS b ON r.product_id = b.product_id
WHERE b.brand IS NOT NULL AND t.last_visited IS NOT NULL
GROUP BY b.brand, strftime('%m', t.last_visited)
ORDER BY b.brand, month;

 * sqlite:///consolidated.db
Done.

brand	month	num_reviews
Adidas	01	253
Adidas	02	272
Adidas	03	269
Adidas	04	180
Adidas	05	172
Adidas	06	159
Adidas	07	170
Adidas	08	189
Adidas	09	181
Adidas	10	192
Adidas	11	150
Adidas	12	190
Nike	01	52
Nike	02	52
Nike	03	55
Nike	04	42
Nike	05	41
Nike	06	43
Nike	07	37
Nike	08	29
Nike	09	28
Nike	10	47
Nike	11	38
Nike	12	35

8. Footwear product performance

Looks like product reviews are highest in the first quarter of the calendar year, so there is scope to run experiments aiming to increase the volume of reviews in the other nine months!

So far, we have been primarily analyzing Adidas vs Nike products. Now, let’s switch our attention to the type of products being sold. As there are no labels for product type, we will create a Common Table Expression (CTE) that filters description for keywords, then use the results to find out how much of the company’s stock consists of footwear products and the median revenue generated by these items.

%%sql
WITH footwear AS (
    SELECT 
        i.description, 
        f.revenue
    FROM info AS i
    INNER JOIN finance AS f ON i.product_id = f.product_id
    WHERE (LOWER(i.description) LIKE '%shoe%' OR 
           LOWER(i.description) LIKE '%trainer%' OR 
           LOWER(i.description) LIKE '%foot%') 
          AND i.description IS NOT NULL
)

SELECT
    COUNT(*) AS num_footwear_products,
    (SELECT f.revenue
     FROM footwear f
     ORDER BY f.revenue
     LIMIT 1
     OFFSET (SELECT (COUNT(*) - 1) / 2 FROM footwear)) AS median_footwear_revenue
FROM footwear;

 * sqlite:///consolidated.db
Done.

num_footwear_products	median_footwear_revenue
2700	3118.36

9. Clothing product performance

Recall from the first task that we found there are 3,117 products without missing values for description. Of those, 2,700 are footwear products, which accounts for around 85% of the company’s stock. They also generate a median revenue of over $3000 dollars!

This is interesting, but we have no point of reference for whether footwear’s median_revenue is good or bad compared to other products. So, for our final task, let’s examine how this differs to clothing products. We will re-use footwear, adding a filter afterward to count the number of products and median_revenue of products that are not in footwear.

%%sql
WITH footwear AS (
    SELECT 
        i.product_id
    FROM info AS i
    WHERE i.description IS NOT NULL
      AND (LOWER(i.description) LIKE '%shoe%' OR 
           LOWER(i.description) LIKE '%trainer%' OR 
           LOWER(i.description) LIKE '%foot%')
)

SELECT 
    COUNT(*) AS num_clothing_products,
    (SELECT f.revenue
     FROM finance AS f
     JOIN info AS i ON i.product_id = f.product_id
     WHERE i.product_id NOT IN (SELECT product_id FROM footwear)
     ORDER BY f.revenue
     LIMIT 1
     OFFSET (SELECT COUNT(*) / 2 FROM finance AS f
             JOIN info AS i ON i.product_id = f.product_id
             WHERE i.product_id NOT IN (SELECT product_id FROM footwear))) AS median_clothing_revenue
FROM info AS i
WHERE i.product_id NOT IN (SELECT product_id FROM footwear);

 * sqlite:///consolidated.db
Done.

num_clothing_products	median_clothing_revenue
479	388.37