What truly influences demand and sales for Walmart?¶
Contribution Details:¶
Contribution Checkpoints:
A: Project idea - 5%
B: Dataset Curation and Preprocessing - 10%
C: Data Exploration and Summary Statistics - 10%
D: ML Algorithm Design/Development - 25%
E: ML Algorithm Training and Test Data Analysis - 20%
F: Visualization, Result Analysis, Conclusion - 15%
G: Final Tutorial Report Creation - 10%
H: Additional (not listed above, if any) - 5%
Member 1: Christopher Perez Lebron, Contribution: 100%
Member 2: Ever Campos, Contribution: 100%
Member 3: Abbas Islaw, Contribution: 100%
"We, all team members, agree together that the above information is true, and we are confident about our contributions to this submitted project/final tutorial."
Christopher Perez Lebron, 5/7/24
Ever Campos, 5/7/24
Abbas Islaw, 5/7/24
Contribution Details, Continued:¶
- Christopher Perez Lebron:
- Worked on everything
- Ever Campos:
- Worked on everything
- Abbas Islaw:
- Worked on everything
NOTE: Our group frequently met via discord and walked through the project as a team. We would discuss and implement decisions on call as a single unit so we all participated equally in every step of the project. No one completed any section on their own we all contributed ideas and code.
Spring 2024 Data Science Project¶
Christopher Perez Lebron, Ever Campos, Abbas Islaw
Demand. It's the heart of business. Virtually everyone in the business world is curious to find out the answer to a simple question - what is influencing demand for a particular product? For Walmart, a multinational corporation that has hundreds - if not thousands - of different products, it can be complicated to find out what factors are driving demand at a store-wide scale.
Is it the selection of products themselves?
Could it be the status of the economy?
These questions are simple, yet hard to pinpoint.
We were particularly curious to find out what drives the demand and values of weekly sales for a large corporation such as Walmart. We wanted to dig deep and figure out the factors influencing weekly sales - and possibly predict weekly sales of different stores around the world.
Data Curation / Importing Data¶
For our data, we decided to use a dataset from Kaggle, "Walmart Sales": https://www.kaggle.com/datasets/mikhail1681/walmart-sales
This dataset has thousands of datapoints of different Walmart stores during different timepoints, including their weekly sales, current Consumer Price Index (CPI), fuel prices, holiday weeks, and more.
Imports needed:
import pandas as pd
import numpy as np
import scipy as sci
import matplotlib.pyplot as plt
import datetime as dt
import math
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import LearningCurveDisplay, learning_curve
from tqdm import tqdm
from sklearn.metrics import mean_squared_error, r2_score
from scipy.ndimage import gaussian_filter1d
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
Let's take a quick look at our data:
df = pd.read_csv("Walmart_sales.csv")
df.head()
Store | Date | Weekly_Sales | Holiday_Flag | Temperature | Fuel_Price | CPI | Unemployment | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 05-02-2010 | 1643690.90 | 0 | 42.31 | 2.572 | 211.096358 | 8.106 |
1 | 1 | 12-02-2010 | 1641957.44 | 1 | 38.51 | 2.548 | 211.242170 | 8.106 |
2 | 1 | 19-02-2010 | 1611968.17 | 0 | 39.93 | 2.514 | 211.289143 | 8.106 |
3 | 1 | 26-02-2010 | 1409727.59 | 0 | 46.63 | 2.561 | 211.319643 | 8.106 |
4 | 1 | 05-03-2010 | 1554806.68 | 0 | 46.50 | 2.625 | 211.350143 | 8.106 |
Data Cleaning¶
Moreover, let's convert the date column into datetime:
df["Date"] = pd.to_datetime(df['Date'], format="%d-%m-%Y")
print(df.dtypes)
df.head()
Store int64 Date datetime64[ns] Weekly_Sales float64 Holiday_Flag int64 Temperature float64 Fuel_Price float64 CPI float64 Unemployment float64 dtype: object
Store | Date | Weekly_Sales | Holiday_Flag | Temperature | Fuel_Price | CPI | Unemployment | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 2010-02-05 | 1643690.90 | 0 | 42.31 | 2.572 | 211.096358 | 8.106 |
1 | 1 | 2010-02-12 | 1641957.44 | 1 | 38.51 | 2.548 | 211.242170 | 8.106 |
2 | 1 | 2010-02-19 | 1611968.17 | 0 | 39.93 | 2.514 | 211.289143 | 8.106 |
3 | 1 | 2010-02-26 | 1409727.59 | 0 | 46.63 | 2.561 | 211.319643 | 8.106 |
4 | 1 | 2010-03-05 | 1554806.68 | 0 | 46.50 | 2.625 | 211.350143 | 8.106 |
Exploratory Data Analysis¶
Individual Feature Exploration¶
Stores¶
Taking a look at the number of stores¶
numStores = len(df["Store"].unique())
print("Number of Stores: " + str(numStores))
Number of Stores: 45
Clearly, we have a pretty large number of stores (around 10% of the stores in the US according to some quick research).
Taking a look at the number of holidays per store¶
print(df[df["Holiday_Flag"] == 1].groupby("Store")["Holiday_Flag"].count().hist())
Axes(0.125,0.11;0.775x0.77)
It looks like each store has exactly 10 holidays.
Taking a look at the average sales per store¶
# Average sales per store differs
df.groupby("Store")["Weekly_Sales"].mean().hist()
<Axes: >
It seems like the average sales per store varies.
Taking a look at average temperature per store¶
df.groupby("Store")["Temperature"].mean().hist()
<Axes: >
The distribution of average temperature per store is bimodal with a peak around 52 degrees and another peak around 70 degrees.
It seems like the stores are in different regions and/or that the dates for each store span different time frames.
Spoiler alert: when you take a look at average CPI per store it becomes clear that there is a temporal difference between stores. However, the geographic difference cannot be asserted.
Taking a look at the average CPI per store¶
df.groupby("Store")["CPI"].mean().hist()
<Axes: >
This supports the assertion that the stores come from different moments in time because CPI is fixed for each day. Thus, if the stores were all overlapping in time we'd see the same average for every store.
Taking a look at the average unemployment rate per store¶
df.groupby("Store")["Unemployment"].mean().hist()
<Axes: >
This once again supports the assertion that the data from these stores don't come from the same period of time.
However, it is weird that the average unemployment rate per store and the avg CPI per store don't follow a similar distribution. This might be worth exploring.
Summary of findings after exploring the Store column¶
Observed¶
- Stores aren't aligned in time. That is, the dates of their sales data differ.
- Average sales per store differ. Different levels of performance.
- Every store has the same number of holidays.
- There are 45 stores.
Questions¶
- Are the stores in different regions?
- Why is the distribution of average unemployment rates different from the distribution of average CPI?
- Is unemployment rate local or national?
Date¶
df.groupby("Date")["Date"].count().hist()
numUnique = len(df["Date"].unique())
print("The number of unique dates are: " + str(numUnique))
The number of unique dates are: 143
So we have 143 unique dates w/ each date having 45 corresponding instances.
df.groupby(df["Date"].dt.month)["Date"].count().plot(kind="bar")
<Axes: xlabel='Date'>
We do not have the same amount of data for each month of the year. So this again, supports the assertion that the data for each store is not set in the same time frame.
Now, lets take a look at the amount of data we have for each year.
df.groupby(df["Date"].dt.year)["Date"].count().plot(kind="bar")
<Axes: xlabel='Date'>
We have the most data on sales during 2011, then sales in 2010, and last sales in 2012.
Weekly Sales¶
df["Weekly_Sales"].hist()
mean = df["Weekly_Sales"].mean()
stdDev = df["Weekly_Sales"].std()
print("Mean: " + str(mean))
print("Std Dev: " + str(stdDev))
Mean: 1046964.8775617715 Std Dev: 564366.6220536975
This is a really weird distribution... why is the distribution this way? Does this have to do with how the sales data was collected, with the way walmart operates it's stores, or some other factor we cannot think of?
Holiday Flag¶
# Define the bin edges to align with tick marks
bins = [-0.25, 0.25, 0.75, 1.25]
non_holiday = df[df['Holiday_Flag'] == 0]
holiday = df[df['Holiday_Flag'] == 1]
plt.hist(non_holiday['Holiday_Flag'], bins=bins, color='blue', label='Non-Holiday', alpha=0.5)
plt.hist(holiday['Holiday_Flag'], bins=bins, color='green', label='Holiday', alpha=0.5)
plt.xlabel('Holiday Flag')
plt.ylabel('Frequency')
plt.title('Distribution of Holiday Flag')
plt.legend()
# Customizing x-axis ticks and labels
plt.xticks([0, 0.5, 1], ['Non-Holiday', '', 'Holiday'])
plt.show()
From this analysis, we can infer that there is a very low percentage of weeks that are holidays. Consequently, holiday weeks, defined as 1, are expected to have a smaller effect on sales compared to holiday weeks. As, their lower holiday weeks vs. vs. holiday weeks in general, will be taking a look at the relationship between holidays and sales further below for stotres 1 - 4.
Temperature¶
print(df["Temperature"].describe())
df["Temperature"].hist()
count 6435.000000 mean 60.663782 std 18.444933 min -2.060000 25% 47.460000 50% 62.670000 75% 74.940000 max 100.140000 Name: Temperature, dtype: float64
<Axes: >
There is a large variation in temperatures. This could mean one of three things:
- The stores are in different geographic locations.
- All the stores are not in different geograhic locations but rather they are all located in a region that sees both temperature extremes a year.
- Climate change
Fuel Price¶
# View histogram for fuel price
df['Fuel_Price'].hist()
<Axes: >
We have a bimodal distribution for our fuel prices.
# View summary statistics
mean = df['Fuel_Price'].mean()
stdDev = df['Fuel_Price'].std()
print(f"Mean: {mean}")
print(f"Standard Devation: {stdDev}")
Mean: 3.358606837606838 Standard Devation: 0.4590197071928525
In today's dollars adjusted for inflation that would be a mean gas price of about $4.70. This unusually high average may affect the generalizability of our ML model / conclusions
Consumer Price Index¶
# View histogram
df['CPI'].hist()
<Axes: >
It looks like we have a really weird discrete set of values. I wonder why there seems to be a chunk missing. From my understanding CPI is a continuous index that doesn't simply jump out like that.
To dig deeper into why this occured let's look at CPI for each store (it should be roughly the same because the data is from the same time frame but it help determine the cause).
# Histogram for store 1
df[df['Store'] == 1]['CPI'].hist()
<Axes: >
frame = df[df['Store'] == 1]['CPI']
print(f"Range: [{frame.min()}, {frame.max()}]")
Range: [210.3374261, 223.4442513]
It looks store 1 is continuous in the rough range [210, 223].
# View histogram of cpi for store 2
df[df['Store'] == 2]['CPI'].hist()
<Axes: >
frame = df[df['Store'] == 2]['CPI']
print(f"Range: [{frame.min()}, {frame.max()}]")
Range: [209.9984585, 223.0783366]
Store 2's CPI values seem to be continuous in a very similar range as store 1.
# View histogram of CPI's for store 3
df[df['Store'] == 3]['CPI'].hist()
<Axes: >
frame = df[df['Store'] == 3]['CPI']
print(f"Range: [{frame.min()}, {frame.max()}]")
Range: [213.6196139, 226.9873637]
Again, very similar range of CPI values. So, stores 1, 2, and 3 all have similar CPI ranges and store 4 seems to be an outlier in the CPI data.
# View histogram of CPI values for store 4
df[df['Store'] == 4]['CPI'].hist()
<Axes: >
frame = df[df['Store'] == 4]['CPI']
print(f"Range: [{frame.min()}, {frame.max()}]")
Range: [126.064, 131.1930968]
As suspected, store 4 is completely misasligned with the other stores in terms of it's CPI values. Hence, why we had that huge gap in data in the histogram of ALL CPI values.
Unemployment¶
df['Unemployment'].hist()
print("Avg unemployment rate: " + str(df["Unemployment"].mean()))
Avg unemployment rate: 7.99915104895105
It looks like we have relatively high unemployment rates throughout most of our data. This is an important attribute to note as it may affect the generalizability of our ML Model and/or conclusions.
Feature vs Sales¶
Now, let's take a look at each feature individually and how they affect sales.
Tempature vs Sales¶
We want to use a hypothesis test to figure out whether temperature affects sales.
Null Hypothesis: The average temperature of the area has no effect on the sales.
Alternative Hypothesis: The average temperature of the area does have an effect on the sales.
Assume that we have a significance level of 0.05.
First, let's plot the stores, with their respective average temperature and weekly sales. We'll use a scatter plot, with each dot being a store from the set.
store_avgs = df.groupby("Store").mean("Temperature")
temp = store_avgs["Temperature"]
sales = store_avgs["Weekly_Sales"]
ts = np.vstack([temp,sales])
z = sci.stats.gaussian_kde(ts)(ts)
fig, plot = plt.subplots()
plt.title("Temperature vs. Weekly Sales")
plt.xlabel("Average Temperature")
plt.ylabel("Weekly Sales (in million USD)")
plot.scatter(temp, sales, c=z, s=100)
plt.show()
We can see that the stores are scattered around pretty evenly, with small clusters around the average temperature of 45-55 and 65-75 degree range.
Let's check the correlation between the two variables.
result = sci.stats.pearsonr(store_avgs["Temperature"], store_avgs["Weekly_Sales"], alternative="two-sided")
print(result)
PearsonRResult(statistic=-0.07638774908646385, pvalue=0.6179663423198961)
Notice that the pearson correlation coefficient is -0.076, meaning that it is slightly negative. This means that as the average temperature increases, then the weekly sales decrease. Since the p value is larger than our alpha value (0.617), we fail to reject the null hypothesis. In other words, we can see that the average temperature does not have an effect on weekly sales.
Fuel Price vs Sales¶
We can also use a hypothesis test to check if fuel prices have an effect on sales.
Null Hypothesis: Fuel Prices do not have any effect on weekly sales.
Alternative Hypothesis: Fuel prices do have an effect on weekly sales.
Assume the significance level as 0.05.
Let's first create a scatter plot graphing the correlation between fuel price and weekly sales.
store_avgs = df.groupby("Store").mean("Fuel_Price")
temp = store_avgs["Fuel_Price"]
sales = store_avgs["Weekly_Sales"]
ts = np.vstack([temp,sales])
z = sci.stats.gaussian_kde(ts)(ts)
fig, plot = plt.subplots()
plt.title("Fuel Price vs. Weekly Sales")
plt.xlabel("Average Fuel Price")
plt.ylabel("Weekly Sales (in million USD)")
plot.scatter(temp, sales, c=z, s=100)
plt.show()
The fuel prices seem divided into three categories of prices, ranging from 3.20, 3.45, and 3.6 USD per gallon. With those three subdivisions, we can see that the weekly sales seem evenly spread out. From inspection, there looks like there's a very, VERY slight downard trend when the fuel price goes up.
Let's run a pearson correlation coefficient test to find out the correlation:
result = sci.stats.pearsonr(store_avgs["Fuel_Price"], store_avgs["Weekly_Sales"], alternative="two-sided")
print(result)
PearsonRResult(statistic=0.06773422508957694, pvalue=0.6584189067151822)
Notice that the correlation coefficient is 0.06, which is slightly positive. As fuel prices go up, so do weekly sales. However, this is a very, very small increase, to the point where it isn't significant enough.
Since our p value is 0.65, which is way higher than our alpha value of 0.05, we fail to reject the null hypothesis. In other words, fuel price does not have an effect on weekly sales.
CPI vs Sales¶
We want to use a hypothesis test to figure out whether cpi affects sales.
Null Hypothesis: The average cpi does no effect on the sales.
Alternative Hypothesis: The average cpi does have an effect on the sales.
Assume that we have a significance level of 0.05.
First, let's plot the stores, with their respective average cpi and weekly sales. We'll use a scatter plot, with each dot being a store from the set.
store_avgs = df.groupby("Store").mean("CPI")
temp = store_avgs["CPI"]
sales = store_avgs["Weekly_Sales"]
ts = np.vstack([temp,sales])
z = sci.stats.gaussian_kde(ts)(ts)
fig, plot = plt.subplots()
plt.title("CPI vs. Weekly Sales")
plt.xlabel("CPI Temperature")
plt.ylabel("Weekly Sales (in million USD)")
plot.scatter(temp, sales, c=z, s=100)
plt.show()
We observe that the stores are divided into two groups: some stores cluster around 140 CPI on the left-hand side, while others cluster around 200 CPI on the right-hand side. Interestingly, there appears to be an even distribution of revenue among these points, indicating comparable sales performance across stores despite differences in CPI.
result = sci.stats.pearsonr(store_avgs["CPI"], store_avgs["Weekly_Sales"], alternative="two-sided")
print(result)
PearsonRResult(statistic=-0.07656886071136423, pvalue=0.6171309899200195)
Notice that the Pearson correlation coefficient is -0.0765, indicating a slight negative correlation. This suggests that as the average CPI increases, weekly sales decrease. However, since the p-value is larger than our alpha value (0.617), we fail to reject the null hypothesis. In other words, we can conclude that the average CPI does not have a significant effect on weekly sales. This conclusion is supported by the scatter plot visualization, where no noticeable change in the relationship between average CPI and weekly sales is observed as CPI increases or decreases.
Unemployment vs Sales¶
We want to now test if unemployment rates have an effect on the stores' average weekly sales.
Null Hypothesis: The unemployment rate has no effect on the stores' average weekly sales.
Alternative Hypothesis: The unemployment rate does have an effect on the stores' average weekly sales.
Assume that our significance level is 0.05.
We can graph a scatter plot of each store:
temp = store_avgs["Unemployment"]
sales = store_avgs["Weekly_Sales"]
ts = np.vstack([temp,sales])
z = sci.stats.gaussian_kde(ts)(ts)
fig, plot = plt.subplots()
plt.title("Unemployment vs. Weekly Sales")
plt.xlabel("Average Unemployment Rate")
plt.ylabel("Weekly Sales (in million USD)")
plot.scatter(temp, sales, c=z, s=100)
plt.show()
We can see that most of the stores' average unemployment rates are between 6-9 percent, with sales being scattered pretty evenly across the board.
Let's now check the correlation between the two variables:
result = sci.stats.pearsonr(store_avgs["Unemployment"], store_avgs["Weekly_Sales"], alternative="two-sided")
print(result)
PearsonRResult(statistic=-0.11228079769921509, pvalue=0.4627451284572275)
Notice that the correlation coefficient is -0.112, which is slightly negative. This means that as unemployment rates go up, the weekly sales will go slighly down. Since our p value is 0.46, which is greater than 0.05, we fail to reject the null hypothesis. In other words, the unemployment rate does not have an effect on weekly sales.
Holiday vs Sales¶
Will be taking a look at the relationship between holidays and sales for stores 1 through 4, as we know from the Holiday flag, there is lower frequency of holiday weeks then non-holiday weeks.
# Store 3 with non-Holiday's
non_holiday_1 = df[(df['Store'] == 1) & (df['Holiday_Flag'] == 0)].iloc[:140]
# Store 4 with holidays
holiday_1 = df[(df['Store'] == 1) & (df['Holiday_Flag'] == 1)].iloc[:140]
# Plot sales data for Store 1 during non-holiday weeks
plt.plot(non_holiday_1.index, non_holiday_1['Weekly_Sales'], color='blue', label='Non-Holiday Weeks')
# Plot sales data for Store 1 during holiday weeks
plt.plot(holiday_1.index, holiday_1['Weekly_Sales'], color='red', label='Holiday Weeks')
plt.xlabel('Week')
plt.ylabel('Weekly Sales (In Millions)')
plt.title('Weekly Sales for Store 1')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
The interaction between the sales data during non-holiday and holiday weeks highlights the impact of holidays on consumer behavior and sales activity for Store 1, providing valuable insights for strategic decision-making in retail management
# Store 3 with non-Holiday's
non_holiday_2 = df[(df['Store'] == 2) & (df['Holiday_Flag'] == 0)].iloc[:140]
# Store 4 with holidays
holiday_2 = df[(df['Store'] == 2) & (df['Holiday_Flag'] == 1)].iloc[:140]
# Plot sales data for Store 1 during non-holiday weeks
plt.plot(non_holiday_2.index, non_holiday_2['Weekly_Sales'], color='blue', label='Non-Holiday Weeks')
# Plot sales data for Store 1 during holiday weeks
plt.plot(holiday_2.index, holiday_2['Weekly_Sales'], color='red', label='Holiday Weeks')
plt.xlabel('Week')
plt.ylabel('Weekly Sales (In Millions) ')
plt.title('Weekly Sales for Store 2')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
In the second store, we notice that there is a significant difference in the total sales amount compared to store 1, which is much higher.
# Store 3 with non-Holiday's
non_holiday_3 = df[(df['Store'] == 3) & (df['Holiday_Flag'] == 0)].iloc[:140]
# Store 4 with holidays
holiday_3 = df[(df['Store'] == 3) & (df['Holiday_Flag'] == 1)].iloc[:140]
# Plot sales data for Store 1 during non-holiday weeks
plt.plot(non_holiday_3.index, non_holiday_3['Weekly_Sales'], color='blue', label='Non-Holiday Weeks')
# Plot sales data for Store 1 during holiday weeks
plt.plot(holiday_3.index, holiday_3['Weekly_Sales'], color='red', label='Holiday Weeks')
plt.xlabel('Week')
plt.ylabel('Weekly Sales')
plt.title('Weekly Sales for Store 3')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
Store 3 consistently reports among the lowest weekly sales figures compared to other stores. Despite occasional spikes during holidays, its sales performance falls significantly short of Stores 1 and 2, which achieve over 3 million and approximately 550k in total sales, respectively.
# Store 4 with non-Holidays
non_holiday_4 = df[(df['Store'] == 4) & (df['Holiday_Flag'] == 0)].iloc[:140]
# Store 4 with Holidays
holiday__4 = df[(df['Store'] == 4) & (df['Holiday_Flag'] == 1)].iloc[:140]
plt.plot(non_holiday_4.index, non_holiday_4['Weekly_Sales'], color='blue', label='Non-Holiday Weeks')
plt.plot(holiday__4.index, holiday__4['Weekly_Sales'], color='red', label='Holiday Weeks')
plt.xlabel('Week')
plt.ylabel('Weekly Sales(In Millions)')
plt.title('Weekly Sales for Store 4')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
Store 4 demonstrates consistent sales performance akin to that of Store 1 and Store 2. This observation suggests that Stores 1, 2, and 4 likely cater to a comparable population size and exhibit similar consumer foot traffic patterns, in contrast to Store 3.
Thus, from our analysis of stores 1 through 4, there is a similar spike in sales during holiday weeks, as shown from the visualization. This indicates that these stores will have higher revenue and a greater number of consumers visiting during these holiday weeks. Therefore, during these holiday weeks, these stores should continue to prioritize them since this is when they have the greatest revenue.
Moreover, to confirm this, we can create a hypothesis test.
Null Hypothesis: Holiday weeks don't have an effect on weekly sales.
Alternative Hypothesis: Holiday weeks do have an effect on weekly sales.
Assume our alpha level is 0.05.
We can use a t test to determine if the means of sales during holiday weeks vs. non-holiday weeks are different:
holiday_diff = df.groupby("Holiday_Flag").mean("Weekly_Sales")
fig, ax = plt.subplots()
divisions = ["Non-Holiday", "Holiday"]
counts = [holiday_diff["Weekly_Sales"][0], holiday_diff["Weekly_Sales"][1]]
ax.bar(divisions, counts, color=["tab:red", "tab:orange"])
ax.set_ylabel("Weekly Sales (in million USD)")
ax.set_title("Non-Holiday & Holiday vs. Weekly Sales")
plt.show()
Now, let's run the t test to see if the differences in results are statistically significant:
holiday = df[df["Holiday_Flag"] == 1]["Weekly_Sales"]
no_holiday = df[df["Holiday_Flag"] == 0]["Weekly_Sales"]
result = sci.stats.ttest_ind(holiday, no_holiday)
print(result)
Ttest_indResult(statistic=2.9608919093259036, pvalue=0.003078699263818616)
Notice that our p value is 0.003, which is way lower than the alpha level of 0.05. Through this, we can reject the null hypothesis, meaning that holiday weeks do have an effect on weekly sales! We can also see the differences in the weekly sales for holidays vs. non-holiday weeks in the bar graph above.
Month vs. Sales¶
We can use a hypothesis test to figure out if the month affects the weekly sales per store. Note how it wouldn't make sense if we combined stores' weekly sales for each month, because some stores do way better in general than others. If we did do that, then we wouldn't be able to see the differences in month for a particular store.
Null Hypothesis: The month of the year does not have an effect on weekly sales.
Alternative Hypothesis: The month of the year does have an effect on weekly sales.
Assume a significance level of 0.05.
First of all, let's draw a graph noting the average weekly sales for each month for Store 1:
months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
def determine_month(date):
return months[date.month-1]
df["Month"] = df["Date"].apply(determine_month)
def avg(lst):
return sum(lst) / len(lst)
fig, ax = plt.subplots()
list_of_months = [[],[],[],[],[],[],[],[],[],[],[],[]]
for index in range(len(months)):
curr_month = []
for ind, item in df[(df["Month"] == months[index]) & (df["Store"] == 1)].iterrows():
curr_month.append(float(item["Weekly_Sales"]))
list_of_months[index] = avg(curr_month)
divisions = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
counts = list_of_months
ax.bar(divisions, counts, color=["tab:red"])
ax.set_ylabel("Weekly Sales (in million USD)")
ax.set_title("Month vs. Weekly Sales for Store 1")
plt.show()
Note that Store 1 seems to be doing way better in sales during the months of November, December, and February. For every other month, it seems to be doing roughly the same numbers (around 1.5 million USD)
Now, let's conduct ANOVA tests for every store.
Note that we want to see if there exists a difference in the means of weekly sales for each month for each store:
p_value_results = []
def anova_test(store):
january = []
february = []
march = []
april = []
may = []
june = []
july = []
august = []
september = []
october = []
november = []
december = []
list_of_months = [january, february, march, april, may, june, july, august, september, october, november, december]
for index in range(len(months)):
for ind, item in df[(df["Month"] == months[index]) & (df["Store"] == store)].iterrows():
list_of_months[index].append(float(item["Weekly_Sales"]))
result = sci.stats.f_oneway(january, february, march, april, may, june, july, august, september, october, november, december)
print("Store " + str(store) + " test pvalue: " + str(result.pvalue))
p_value_results.append(result.pvalue)
for num in range(1, 46):
anova_test(num)
Store 1 test pvalue: 2.7724825133205463e-07 Store 2 test pvalue: 4.330415081536038e-12 Store 3 test pvalue: 2.3031287514697588e-11 Store 4 test pvalue: 5.044527323318062e-08 Store 5 test pvalue: 3.281519930519039e-08 Store 6 test pvalue: 3.475224266634057e-14 Store 7 test pvalue: 2.2422407017372865e-16 Store 8 test pvalue: 8.009722452114776e-11 Store 9 test pvalue: 2.4807011905056473e-09 Store 10 test pvalue: 1.5526974366305664e-16 Store 11 test pvalue: 1.628313504395449e-12 Store 12 test pvalue: 1.6336003498407762e-12 Store 13 test pvalue: 8.359846513589073e-11 Store 14 test pvalue: 3.9656955165561375e-08 Store 15 test pvalue: 4.640380261674347e-13 Store 16 test pvalue: 1.2039358996831871e-18 Store 17 test pvalue: 0.00011345095945005678 Store 18 test pvalue: 2.3721149491082532e-11 Store 19 test pvalue: 1.0812412446782531e-10 Store 20 test pvalue: 7.210105978784641e-13 Store 21 test pvalue: 1.7292012536913808e-12 Store 22 test pvalue: 9.50619842442727e-14 Store 23 test pvalue: 7.742627133025905e-15 Store 24 test pvalue: 5.094605594267623e-10 Store 25 test pvalue: 3.14810055322353e-18 Store 26 test pvalue: 7.469303673990814e-11 Store 27 test pvalue: 1.538364002585109e-09 Store 28 test pvalue: 2.0184361415396076e-06 Store 29 test pvalue: 5.697335574183187e-12 Store 30 test pvalue: 0.00986728581291941 Store 31 test pvalue: 9.014256351204674e-07 Store 32 test pvalue: 7.904159131772652e-10 Store 33 test pvalue: 0.0014863778723760027 Store 34 test pvalue: 2.391261801278534e-09 Store 35 test pvalue: 0.0002707253217012239 Store 36 test pvalue: 0.555865376310303 Store 37 test pvalue: 0.00016745376727450595 Store 38 test pvalue: 0.9958102761851833 Store 39 test pvalue: 2.099085865751722e-08 Store 40 test pvalue: 4.913845032367567e-10 Store 41 test pvalue: 7.995742455192548e-09 Store 42 test pvalue: 0.6693766507442821 Store 43 test pvalue: 0.40241402241743385 Store 44 test pvalue: 0.3549543354932103 Store 45 test pvalue: 1.8172742505154496e-13
Some stores (Store 42, 43, 44, etc...) exibit higher p values upon inspection, going up to 0.66, which would allow us to fail to reject the null hypothesis. However, for the most part, most stores exibit very low p values, way lower than our significance level. This would mean that we'd have to reject the null hypothesis. Let's see how many stores had a result lower than 0.05, and those who had a result higher:
num_less = 0
num_more = 0
for value in p_value_results:
if value <= 0.05:
num_less = num_less + 1
elif value > 0.05:
num_more = num_more + 1
print("Number of stores with p values less than 0.05: " + str(num_less))
print("Number of stores with p values more than 0.05: " + str(num_more))
Number of stores with p values less than 0.05: 40 Number of stores with p values more than 0.05: 5
We can conclude that for the most part, the months of the year do affect the average sales per store.
Primary Analysis (Model Exploration) and Visualization¶
We will be working on bringing in some ML model to predict weekly sales given all the other features available. We will be taking a causal approach where we assume that the features in a given row can be utilized to predict the weekly sales for that week.
Standardizing Features¶
First we will be standardizing our features to ensure that the scales of the different columns do not increase the important of any column.
# Create a copy of the original df (incase we want the original to do a comparison)
standardizedDf = df.copy()
standardizedDf
from scipy.stats import zscore
standardizedDf['Temperature'] = zscore(standardizedDf['Temperature'])
standardizedDf['Fuel_Price'] = zscore(standardizedDf['Fuel_Price'])
standardizedDf['CPI'] = zscore(standardizedDf['CPI'])
standardizedDf['Unemployment'] = zscore(standardizedDf['Unemployment'])
# What do we do with dates?
# If we use them, how do we handle them?
# We could do categorical columns for Year, Month, and Week
# If we choose not to use them:
# Cool, fine, lets see how it performs :)
standardizedDf
Store | Date | Weekly_Sales | Holiday_Flag | Temperature | Fuel_Price | CPI | Unemployment | Month | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2010-02-05 | 1643690.90 | 0 | -0.995136 | -1.713800 | 1.004175 | 0.056964 | February |
1 | 1 | 2010-02-12 | 1641957.44 | 1 | -1.201170 | -1.766089 | 1.007880 | 0.056964 | February |
2 | 1 | 2010-02-19 | 1611968.17 | 0 | -1.124178 | -1.840166 | 1.009074 | 0.056964 | February |
3 | 1 | 2010-02-26 | 1409727.59 | 0 | -0.760907 | -1.737766 | 1.009849 | 0.056964 | February |
4 | 1 | 2010-03-05 | 1554806.68 | 0 | -0.767955 | -1.598328 | 1.010624 | 0.056964 | March |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6430 | 45 | 2012-09-28 | 713173.95 | 0 | 0.228602 | 1.390883 | 0.519270 | 0.365109 | September |
6431 | 45 | 2012-10-05 | 733455.07 | 0 | 0.229144 | 1.364738 | 0.523256 | 0.356046 | October |
6432 | 45 | 2012-10-12 | 734464.36 | 0 | -0.335825 | 1.397419 | 0.527241 | 0.356046 | October |
6433 | 45 | 2012-10-19 | 718125.53 | 0 | -0.227385 | 1.329879 | 0.527332 | 0.356046 | October |
6434 | 45 | 2012-10-26 | 760281.43 | 0 | -0.098343 | 1.140330 | 0.526775 | 0.356046 | October |
6435 rows × 9 columns
Converting Date Into Ordinal¶
After standardizing our data we will be converting our date into an integer so that our model can use it to predict the weekly sales.
standardizedDf['Date']=standardizedDf['Date'].map(dt.datetime.toordinal)
# Also, remove month column
standardizedDf = standardizedDf.drop('Month', axis=1)
standardizedDf
Store | Date | Weekly_Sales | Holiday_Flag | Temperature | Fuel_Price | CPI | Unemployment | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 733808 | 1643690.90 | 0 | -0.995136 | -1.713800 | 1.004175 | 0.056964 |
1 | 1 | 733815 | 1641957.44 | 1 | -1.201170 | -1.766089 | 1.007880 | 0.056964 |
2 | 1 | 733822 | 1611968.17 | 0 | -1.124178 | -1.840166 | 1.009074 | 0.056964 |
3 | 1 | 733829 | 1409727.59 | 0 | -0.760907 | -1.737766 | 1.009849 | 0.056964 |
4 | 1 | 733836 | 1554806.68 | 0 | -0.767955 | -1.598328 | 1.010624 | 0.056964 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
6430 | 45 | 734774 | 713173.95 | 0 | 0.228602 | 1.390883 | 0.519270 | 0.365109 |
6431 | 45 | 734781 | 733455.07 | 0 | 0.229144 | 1.364738 | 0.523256 | 0.356046 |
6432 | 45 | 734788 | 734464.36 | 0 | -0.335825 | 1.397419 | 0.527241 | 0.356046 |
6433 | 45 | 734795 | 718125.53 | 0 | -0.227385 | 1.329879 | 0.527332 | 0.356046 |
6434 | 45 | 734802 | 760281.43 | 0 | -0.098343 | 1.140330 | 0.526775 | 0.356046 |
6435 rows × 8 columns
Comparing Linear Regression, Decision Tree, and Random Forest¶
Now lets take some time to compare our options. We will start with the most popular models for regression including:
- Linear Regression
- Decision Tree
- Random Forest
After we explore each of these options, we can make a decision to either stick with one of them or explore another set of models that may work well for our data!
We will be using a k-fold validation technique that splits the training set into k subsets and cycles through utilizing each subset as the evaluation set once. During this cycle, we will keep track of the R^2 value and take the average to determine on average how well each model explains the variance observed in the data.
This takes the focus off of "getting lucky" with the training data and makes our findings more robust and repeatable.
# Consider setting aside train, validate, test
# Here, we could show training data set accuracy vs validation set accuracy as we go through training epochs
# This could help us show overfitting vs underfitting
# Define X and y
y = standardizedDf['Weekly_Sales']
X = standardizedDf.drop('Weekly_Sales', axis=1)
# Get train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Perform k-fold Cross-Validation for each model
k_folds = 5
skf = KFold(n_splits=k_folds, shuffle=True, random_state=42)
# Decide the best degree for our data
for k in range(-1, 8):
if k == -1:
model = DecisionTreeRegressor()
score = cross_val_score(model, X_train, y_train, cv=skf, scoring='r2')
print("Decision Tree Regressor R^2 Val: " + str(score.mean()))
print("-----------------------------------------------------")
continue
if k == 0:
model = RandomForestRegressor()
score = cross_val_score(model, X_train, y_train, cv=skf, scoring='r2')
print("Random Forest Regressor R^2 Val: " + str(score.mean()))
print("-----------------------------------------------------")
continue
model = make_pipeline(PolynomialFeatures(k), LinearRegression())
score = cross_val_score(model, X_train, y_train, cv=skf, scoring='r2')
print("Linear Regressor of degree " + str(k) + " R^2 Val: " + str(score.mean()))
print("-----------------------------------------------------")
Decision Tree Regressor R^2 Val: 0.8778749895542678 ----------------------------------------------------- Random Forest Regressor R^2 Val: 0.9304599256542752 ----------------------------------------------------- Linear Regressor of degree 1 R^2 Val: 0.13397245632480764 ----------------------------------------------------- Linear Regressor of degree 2 R^2 Val: 0.25047806329282685 ----------------------------------------------------- Linear Regressor of degree 3 R^2 Val: 0.3371864409081696 ----------------------------------------------------- Linear Regressor of degree 4 R^2 Val: 0.3378576496910995 ----------------------------------------------------- Linear Regressor of degree 5 R^2 Val: 0.3377816180759822 ----------------------------------------------------- Linear Regressor of degree 6 R^2 Val: 0.32211118760617324 ----------------------------------------------------- Linear Regressor of degree 7 R^2 Val: 0.28173126974169926 -----------------------------------------------------
It looks like Random Forest Regressor is giving us the best performance so we will be utilizing that model to perform our final predictions!
Final Model Selection: RandomForestRegressor¶
Why?
- Far better R2 score!
- It seems to be doing a far better job at capturing non-linearity in the data
Optimizing Hyperparameters¶
Now that we have decided which model we want to use, we should take some time to optimize some of the hyperparameters using a validation data set. The validation data set will be a portion of the dataset that we use to adjust hyperparameters after training.
The hyperparameters we will be looking at are the following:
- Depth Limit: This applies a height limit to each tree in the ensable model which can help us avoid overfitting.
- The # of Trees: This hyperparameter - as the name suggests - modifies the number of trees in our model. More trees could possibly offer improved performance but it will cost additional computations.
Depth Limit¶
Let's take a look at how our validation and training loss change as we modify the depth. We will be using the MSE - the mean squared error - to help quantify our loss. The loss is the difference between the output of our model and the expected output according to our labeled data.
# Split train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
# Split train further to get validation set
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, test_size=0.20, random_state=42)
trainScores = []
validationScores = []
depths = []
for depth in tqdm(range(1, 20)):
model = RandomForestRegressor(max_depth= depth)
model.fit(X_train, y_train)
trainScores.append(mean_squared_error(y_train,model.predict(X_train)))
validationScores.append(mean_squared_error(y_validation, model.predict(X_validation)))
depths.append(depth)
tLossSmooothed = gaussian_filter1d(trainScores, sigma=2)
vLossSmooothed = gaussian_filter1d(validationScores, sigma=2)
plt.plot(depths, tLossSmooothed, label='Training loss')
plt.plot(depths, vLossSmooothed, label='Validation loss')
plt.title(label="Smoothed MSE loss")
plt.xlabel(xlabel="depth")
plt.legend()
plt.show()
plt.plot(depths, trainScores, label='Training MSE loss')
plt.plot(depths, validationScores, label='Validation MSE loss')
plt.title(label="MSE loss")
plt.xlabel(xlabel="depth")
plt.legend()
plt.show()
100%|██████████| 19/19 [00:25<00:00, 1.34s/it]
It seems like the optimal depth limit is around 8 to 12!
Number of Decision Trees¶
Lets do the same for the number of trees in our model. That is, lets see how the loss changes as we increase the depth and try to pinpoint the optimal number of trees to use for our model.
# Fixing depth to 12
depth = 12
trainScores = []
validationScores = []
trees = []
for numTrees in tqdm(range(80, 500, 5)):
model = RandomForestRegressor(max_depth= depth, n_estimators=numTrees)
model.fit(X_train, y_train)
trainScores.append(mean_squared_error(y_train,model.predict(X_train)))
validationScores.append(mean_squared_error(y_validation, model.predict(X_validation)))
trees.append(numTrees)
tLossSmooothed = gaussian_filter1d(trainScores, sigma=2)
vLossSmooothed = gaussian_filter1d(validationScores, sigma=2)
plt.plot(trees, tLossSmooothed, label='Training loss')
plt.plot(trees, vLossSmooothed, label='Validation loss')
plt.title(label="Smoothed MSE loss")
plt.xlabel(xlabel="number of trees")
plt.legend()
plt.show()
plt.plot(trees, trainScores, label='Training MSE loss')
plt.plot(trees, validationScores, label='Validation MSE loss')
plt.title(label="MSE loss")
plt.xlabel(xlabel="number of trees")
plt.legend()
plt.show()
100%|██████████| 84/84 [06:14<00:00, 4.46s/it]
It seems like increasing the number of decision trees doesn't have an appreciable affect on the loss. Hence, we will be using the default number of trees, 100, for our model.
Training the Random Forest Model¶
Let's go ahead and train a few RandomForestRegressor on our training data and evaluate them on our test data. We will be using a few different depth options and diving deeper into their performance against our test data.
# Splitting the dataset into training, validation, and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Assuming the best depth found is between 8 to 12.
# Digging down to find the best depth
for i in range(8, 13):
# Create the model with i depth and default number of trees
model = RandomForestRegressor(max_depth=i)
model.fit(X_train, y_train) # Train model on the initial training set
# Predict the test set
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = np.mean(np.abs(y_test - y_pred))
r2 = r2_score(y_test, y_pred)
print("STATISTICS FOR DEPTH %i" % i)
print("-----------------------------------------")
print(f"Test MSE: {mse}")
print(f"Test sqrt of mse: {math.sqrt(mse)}")
print(f"Test RMSE: {rmse}")
print(f"Test MAE: {mae}")
print(f"Test R² score: {r2}")
# Visualization of Actual vs Predicted
plt.figure()
plt.scatter(y_test, y_pred, alpha=0.3, color='blue') # alpha for transparency
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=4) # Ideal line
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs. Predicted at Depth %i' % i)
plt.show()
# Residuals plot
residuals = y_test - y_pred
plt.figure()
plt.scatter(y_pred, residuals, alpha=0.3, color='red')
plt.hlines(y=0, xmin=y_pred.min(), xmax=y_pred.max(), colors='black', linestyles='dashed')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.title('Residuals Plot at Depth %i' % i)
plt.show()
STATISTICS FOR DEPTH 8 ----------------------------------------- Test MSE: 29778496329.666405 Test sqrt of mse: 172564.470067469 Test RMSE: 172564.470067469 Test MAE: 97833.58436878757 Test R² score: 0.9068608891772729
STATISTICS FOR DEPTH 9 ----------------------------------------- Test MSE: 24801365010.294613 Test sqrt of mse: 157484.49133262175 Test RMSE: 157484.49133262175 Test MAE: 87091.17147131398 Test R² score: 0.9224280145419078
STATISTICS FOR DEPTH 10 ----------------------------------------- Test MSE: 22152826791.83312 Test sqrt of mse: 148838.25715128862 Test RMSE: 148838.25715128862 Test MAE: 80184.27408842757 Test R² score: 0.9307119282733665
STATISTICS FOR DEPTH 11 ----------------------------------------- Test MSE: 21431918678.967293 Test sqrt of mse: 146396.44353250967 Test RMSE: 146396.44353250967 Test MAE: 78543.61870800186 Test R² score: 0.9329667345561914
STATISTICS FOR DEPTH 12 ----------------------------------------- Test MSE: 20968460038.47253 Test sqrt of mse: 144804.9033647429 Test RMSE: 144804.9033647429 Test MAE: 77155.06833436356 Test R² score: 0.9344163082754591
Notice that most of the data looks the same, but depth 12 has a better R2 score than depth 8! (.905 vs .935)
Evaluating Our Initial Model¶
model = RandomForestRegressor(max_depth=12)
model.fit(X_train, y_train) # Train model on the initial training set
t_sizes, t_scores, test_scores = learning_curve(model, X_train, y_train)
display = LearningCurveDisplay(train_sizes=t_sizes, train_scores=t_scores, test_scores=test_scores, score_name = "R2 Score")
display.plot()
plt.title("R2 Score vs. Number of samples in training set")
plt.show()
Note that naturally, the R2 score of our model increases drastically when having more than 1000 samples in the training set, and increases up to roughly 0.93 in our test set.
For our training set, the peak R2 score appears to be around 0.97 - 0.98.
# Checking out the most 'important' features in predicting weekly sales
feature_importances = model.feature_importances_
feature_importance_df = pd.DataFrame({'Feature': X.columns, 'Importance': feature_importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
feature_importance_df.head(10)
Feature | Importance | |
---|---|---|
0 | Store | 0.660073 |
5 | CPI | 0.160227 |
6 | Unemployment | 0.118357 |
3 | Temperature | 0.025772 |
4 | Fuel_Price | 0.016556 |
1 | Date | 0.015550 |
2 | Holiday_Flag | 0.003467 |
In this case, the top 3 most impactful features are the Store, CPI, and unemployment.
The Store type influences sales by reflecting varying demographics, location, and how demographics influences what products they purchase, while changes in the Consumer Price Index affect purchasing power, and unemployment affects consumers budget reducing discretionary spending, collectively shaping weekly sales for the company.
Insights and Conclusions¶
It is clear that our model is able to predict weekly sales values pretty accurately relying on the causal relationship between features and the output label. This is exemplified by our high R^2 value and relatively low mean error relative to the mean and standard deviation for weekly sales!
We were able to identify the most important features that helped our RandomForestRegressor deduce fairly accurate predictions.
The most important feature was the store number. This generally makes sense, you expect each store to have reproducible sales numbers seeing that Walmart is in the mature stage of the business life cycle. Moreover, this captures the possible fluctuations between stores that may have vastly different locations, competitors, distribution strategies, or any other factor that could affect the sales data for a specific store.
The second most important feature was CPI. This makes sense as the consumer price index is the most well-known indicator of inflation. Inflation is an extremely relevant construct when it comes to sales. It is one of the many ways to check the status of the economy, and it influences how people spend their money on certain products (When inflation is high, some products might not seem as worth it to purchase compared to others).
As a leader in supply chain management, Walmart heavily relies on accurate forecasting to drive its basis of competition. Walmart, as the slogan suggests, competes on low cost, and in order to achieve these low costs they must be able to reduce their inventory on hand. This is done by improving forecasting accuracy. Higher forecasting accuracy reduces the need for safety stock and thus makes their supply chain more competitive as less stock means cost reductions for all the distribution partners in the network. Accurate forecasting has huge implications for supply chain management which can be considered the primary backbone for any business in today's globalized & high volume business environment.