import pandas as pd
import plotly.express as px
# stats
from scipy.stats import levene
from scipy.stats import ttest_ind
from scipy.stats import kruskal
import scikit_posthocs as sp
Ice Games - Global Sales Exploratory Data Analysis
Introduction
Exploratory data analysis for the purpose of determining profitable 2017 marketing strategy
The purpose of this study is to determine profitable marketing strategy based on the provided data. As of Dec 2016, the following data will be leveraged to determine potential significant time ranges for model building, profitable platform categories, and regional-targeted campaign strategies.
Data Preprocessing
Loading the Games Dataset
= pd.read_csv('../datasets/games.csv')
games
display(games.head()) games.info()
Name | Platform | Year_of_Release | Genre | NA_sales | EU_sales | JP_sales | Other_sales | Critic_Score | User_Score | Rating | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Wii Sports | Wii | 2006.0 | Sports | 41.36 | 28.96 | 3.77 | 8.45 | 76.0 | 8 | E |
1 | Super Mario Bros. | NES | 1985.0 | Platform | 29.08 | 3.58 | 6.81 | 0.77 | NaN | NaN | NaN |
2 | Mario Kart Wii | Wii | 2008.0 | Racing | 15.68 | 12.76 | 3.79 | 3.29 | 82.0 | 8.3 | E |
3 | Wii Sports Resort | Wii | 2009.0 | Sports | 15.61 | 10.93 | 3.28 | 2.95 | 80.0 | 8 | E |
4 | Pokemon Red/Pokemon Blue | GB | 1996.0 | Role-Playing | 11.27 | 8.89 | 10.22 | 1.00 | NaN | NaN | NaN |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 16713 non-null object
1 Platform 16715 non-null object
2 Year_of_Release 16446 non-null float64
3 Genre 16713 non-null object
4 NA_sales 16715 non-null float64
5 EU_sales 16715 non-null float64
6 JP_sales 16715 non-null float64
7 Other_sales 16715 non-null float64
8 Critic_Score 8137 non-null float64
9 User_Score 10014 non-null object
10 Rating 9949 non-null object
dtypes: float64(6), object(5)
memory usage: 1.4+ MB
Columns Breakdown:
Name (object): The name of the game (e.g., Wii Sports, Super Mario Bros.). This is a string data type.
Platform (object): The platform on which the game was released (e.g., Wii, NES, DS). This is also a string data type.
Year_of_Release (float64): The year the game was released. Some rows have missing values (NaN), and the data type is float, possibly because there are missing values in some rows.
Genre (object): The genre of the game (e.g., Sports, Platform, Racing). This is a string data type.
NA_sales (float64): Total sales of the game in North America (in million USD). It’s a numeric column.
EU_sales (float64): Total sales of the game in Europe (in million USD). This is also numeric.
JP_sales (float64): Total sales of the game in Japan (in million USD). Numeric with missing values.
Other_sales (float64): Total sales of the game in regions other than North America, Europe, and Japan (in million USD). Numeric with potential missing values.
Critic_Score (float64): The average critic score for the game. This column has missing values as well, as some games lack this score.
User_Score (float64): The average user score for the game. Like the critic score, some values are missing.
Rating (object): The game’s rating (e.g., E, M for mature, etc.), indicating the age group or audience suitable for the game.
Preparing The Data
1. Duplicate Handling
# Count of unique entries based on Name and Platform
'Name', 'Platform']).ngroups games.groupby([
16709
# Find duplicate entries based on 'Name' and 'Platform'
= games[games.duplicated(subset=['Name', 'Platform'], keep=False)]
duplicates
# Display the DataFrame with non-unique entries sorted so the dups are together
= duplicates.sort_values(by=['Name', 'Platform'])
duplicates_sorted display(duplicates_sorted)
Name | Platform | Year_of_Release | Genre | NA_sales | EU_sales | JP_sales | Other_sales | Critic_Score | User_Score | Rating | |
---|---|---|---|---|---|---|---|---|---|---|---|
604 | Madden NFL 13 | PS3 | 2012.0 | Sports | 2.11 | 0.22 | 0.00 | 0.23 | 83.0 | 5.5 | E |
16230 | Madden NFL 13 | PS3 | 2012.0 | Sports | 0.00 | 0.01 | 0.00 | 0.00 | 83.0 | 5.5 | E |
5972 | Need for Speed: Most Wanted | PC | 2005.0 | Racing | 0.02 | 0.23 | 0.00 | 0.04 | 82.0 | 8.5 | T |
11715 | Need for Speed: Most Wanted | PC | 2012.0 | Racing | 0.00 | 0.06 | 0.00 | 0.02 | 82.0 | 8.5 | T |
1190 | Need for Speed: Most Wanted | X360 | 2012.0 | Racing | 0.62 | 0.78 | 0.01 | 0.15 | 83.0 | 8.5 | T |
1591 | Need for Speed: Most Wanted | X360 | 2005.0 | Racing | 1.00 | 0.13 | 0.02 | 0.10 | 83.0 | 8.5 | T |
1745 | Sonic the Hedgehog | PS3 | 2006.0 | Platform | 0.41 | 0.06 | 0.04 | 0.66 | 43.0 | 4.1 | E10+ |
4127 | Sonic the Hedgehog | PS3 | NaN | Platform | 0.00 | 0.48 | 0.00 | 0.00 | 43.0 | 4.1 | E10+ |
659 | NaN | GEN | 1993.0 | NaN | 1.78 | 0.53 | 0.00 | 0.08 | NaN | NaN | NaN |
14244 | NaN | GEN | 1993.0 | NaN | 0.00 | 0.00 | 0.03 | 0.00 | NaN | NaN | NaN |
When looking at unique entries based on the Name
of the game and the Platform
it was released on. There appears to be 16709 out of 16715 entries which are unique. Looking at the duplicated entries based on Name
and Platform
, we can make several notes:
- Need for Speed: Most Wanted: Upon investigation this title had two separate release dates
2005
and2012
. For whatever the reason is, these are not in fact duplicates but two separate games. However, specifying the release date in thename
feature of these titles will help to deliniate them from each other. At least when it comes to comparing bothname
andplatform
.
The two Need for Speed: Most Wanted games were released by separate developers. Wikipedia Source
- NaN: These NaN entries in the GEN platform will be appropriately handled in the following stage.
- Madden NFL 13: At first glance this may seem a simple duplication error. However, there might be an explaination for this considering Madden NFL 13 had different regional release dates within 2012. However, the sales data associated with entry
16230
seems to be associated with the EU release. Combining the entries together viaEU_sales
should resolve this data issue while incorporating this data. - Sonic the Hedgehog: Similar to Madden NFL 13 there are several releases associated with the title throughout several global regions. Entry
4127
seems to be associated with the EU release in 2007. Combining the entries together viaEU_sales
should resolve this data issue.
# Define the indices of the Need for Speed rows
= [5972, 11715, 1190, 1591]
need_for_updating
# Update the 'Name' column by appending the 'Year_of_Release'
'Name'] = games.loc[need_for_updating].apply(
games.loc[need_for_updating, lambda row: f"{row['Name']} ({int(row['Year_of_Release'])})", axis=1
)
games.loc[need_for_updating]
Name | Platform | Year_of_Release | Genre | NA_sales | EU_sales | JP_sales | Other_sales | Critic_Score | User_Score | Rating | |
---|---|---|---|---|---|---|---|---|---|---|---|
5972 | Need for Speed: Most Wanted (2005) | PC | 2005.0 | Racing | 0.02 | 0.23 | 0.00 | 0.04 | 82.0 | 8.5 | T |
11715 | Need for Speed: Most Wanted (2012) | PC | 2012.0 | Racing | 0.00 | 0.06 | 0.00 | 0.02 | 82.0 | 8.5 | T |
1190 | Need for Speed: Most Wanted (2012) | X360 | 2012.0 | Racing | 0.62 | 0.78 | 0.01 | 0.15 | 83.0 | 8.5 | T |
1591 | Need for Speed: Most Wanted (2005) | X360 | 2005.0 | Racing | 1.00 | 0.13 | 0.02 | 0.10 | 83.0 | 8.5 | T |
# Handling 'Madden NFL 13' case (Performed first because the index to drop is higher)
= 604 # index of the main 'Madden NFL 13' entry.
OG_Madden13_loc = 16230 # index of the dup 'Madden NFL 13' entry.
Dup_Madden13_loc
# Sum the 'EU_sales' values between the two rows and save the sum to the main entry
'EU_sales'] += games.at[Dup_Madden13_loc, 'EU_sales']
games.at[OG_Madden13_loc,
# Drop the duplicated madden 13
= games.drop(index=Dup_Madden13_loc)
games
games.loc[[OG_Madden13_loc]]
Name | Platform | Year_of_Release | Genre | NA_sales | EU_sales | JP_sales | Other_sales | Critic_Score | User_Score | Rating | |
---|---|---|---|---|---|---|---|---|---|---|---|
604 | Madden NFL 13 | PS3 | 2012.0 | Sports | 2.11 | 0.23 | 0.0 | 0.23 | 83.0 | 5.5 | E |
# Handling 'Sonic the Hedgehog' case
= 1745 # index of the main 'Sonic the Hedgehog' entry.
OG_Hedgehog_loc = 4127 # index of the dup 'Sonic the Hedgehog' entry.
Dup_Hedgegog_loc
# Sum the 'EU_sales' values between the two rows and save the same to the main entry
'EU_sales'] += games.at[Dup_Hedgegog_loc, 'EU_sales']
games.at[OG_Hedgehog_loc,
# Drop the duplicated hedgehog
= games.drop(index=Dup_Hedgegog_loc)
games
games.loc[[OG_Hedgehog_loc]]
Name | Platform | Year_of_Release | Genre | NA_sales | EU_sales | JP_sales | Other_sales | Critic_Score | User_Score | Rating | |
---|---|---|---|---|---|---|---|---|---|---|---|
1745 | Sonic the Hedgehog | PS3 | 2006.0 | Platform | 0.41 | 0.54 | 0.04 | 0.66 | 43.0 | 4.1 | E10+ |
2. Data Types & Missing Values
Multiple imputations by chained equations was chosen to handle the large amount of missing data attributed to critic and user scores.
Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: what is it and how does it work?. International journal of methods in psychiatric research, 20(1), 40–49. https://doi.org/10.1002/mpr.329
# imports for more sophisticated imputation techniques
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# Convert all column names to lowercase
= games.columns.str.lower()
games.columns
# Convert 'user_score' to numeric, setting 'tbd' as NaN
'user_score'] = pd.to_numeric(games['user_score'], errors='coerce')
games[
# Remove rows with missing 'year_of_release' and/or 'rating' values
= games.dropna(subset=['year_of_release', 'rating'])
games
# Convert 'year_of_release' to integer
'year_of_release'] = games['year_of_release'].astype('Int64')
games[
# Instantiate the imputer
= IterativeImputer()
mice_imputer
# Apply the MICE imputation to `critic_score` and `user_score`
'critic_score', 'user_score']] = mice_imputer.fit_transform(games[['critic_score', 'user_score']])
games[[
10))
display(games.head(print(games.info())
name | platform | year_of_release | genre | na_sales | eu_sales | jp_sales | other_sales | critic_score | user_score | rating | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Wii Sports | Wii | 2006 | Sports | 41.36 | 28.96 | 3.77 | 8.45 | 76.0 | 8.0 | E |
2 | Mario Kart Wii | Wii | 2008 | Racing | 15.68 | 12.76 | 3.79 | 3.29 | 82.0 | 8.3 | E |
3 | Wii Sports Resort | Wii | 2009 | Sports | 15.61 | 10.93 | 3.28 | 2.95 | 80.0 | 8.0 | E |
6 | New Super Mario Bros. | DS | 2006 | Platform | 11.28 | 9.14 | 6.50 | 2.88 | 89.0 | 8.5 | E |
7 | Wii Play | Wii | 2006 | Misc | 13.96 | 9.18 | 2.93 | 2.84 | 58.0 | 6.6 | E |
8 | New Super Mario Bros. Wii | Wii | 2009 | Platform | 14.44 | 6.94 | 4.70 | 2.24 | 87.0 | 8.4 | E |
11 | Mario Kart DS | DS | 2005 | Racing | 9.71 | 7.47 | 4.13 | 1.90 | 91.0 | 8.6 | E |
13 | Wii Fit | Wii | 2007 | Sports | 8.92 | 8.03 | 3.60 | 2.15 | 80.0 | 7.7 | E |
14 | Kinect Adventures! | X360 | 2010 | Misc | 15.00 | 4.89 | 0.24 | 1.69 | 61.0 | 6.3 | E |
15 | Wii Fit Plus | Wii | 2009 | Sports | 9.01 | 8.49 | 2.53 | 1.77 | 80.0 | 7.4 | E |
<class 'pandas.core.frame.DataFrame'>
Index: 9767 entries, 0 to 16706
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 9767 non-null object
1 platform 9767 non-null object
2 year_of_release 9767 non-null Int64
3 genre 9767 non-null object
4 na_sales 9767 non-null float64
5 eu_sales 9767 non-null float64
6 jp_sales 9767 non-null float64
7 other_sales 9767 non-null float64
8 critic_score 9767 non-null float64
9 user_score 9767 non-null float64
10 rating 9767 non-null object
dtypes: Int64(1), float64(6), object(4)
memory usage: 925.2+ KB
None
Here all the columns are transformed into a lowercase format. As for the tbd entries in user_score
, they are neatly handled by coercing the values to numeric. This forced the tbd values to simply become NaN.
For rating
I had considered utilizing the IGDB API to pull in potential ratings for each title missing a rating. However, since this is a Jupyter notebook submission; there is no real way to handle the secrets associated with performing requests on this API. If there is I would appreciate some advice on this topic.
Handling the missing data of features that can’t be handled otherwise must occur before any of the imputation techniques we wish to employ. For this reason, entries with missing data in year_of_release
and/or rating
were dropped.
Total Sales
# Calculate total sales for each game
'total_sales'] = games[['na_sales', 'eu_sales', 'jp_sales', 'other_sales']].sum(axis=1)
games[10) games.head(
name | platform | year_of_release | genre | na_sales | eu_sales | jp_sales | other_sales | critic_score | user_score | rating | total_sales | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Wii Sports | Wii | 2006 | Sports | 41.36 | 28.96 | 3.77 | 8.45 | 76.0 | 8.0 | E | 82.54 |
2 | Mario Kart Wii | Wii | 2008 | Racing | 15.68 | 12.76 | 3.79 | 3.29 | 82.0 | 8.3 | E | 35.52 |
3 | Wii Sports Resort | Wii | 2009 | Sports | 15.61 | 10.93 | 3.28 | 2.95 | 80.0 | 8.0 | E | 32.77 |
6 | New Super Mario Bros. | DS | 2006 | Platform | 11.28 | 9.14 | 6.50 | 2.88 | 89.0 | 8.5 | E | 29.80 |
7 | Wii Play | Wii | 2006 | Misc | 13.96 | 9.18 | 2.93 | 2.84 | 58.0 | 6.6 | E | 28.91 |
8 | New Super Mario Bros. Wii | Wii | 2009 | Platform | 14.44 | 6.94 | 4.70 | 2.24 | 87.0 | 8.4 | E | 28.32 |
11 | Mario Kart DS | DS | 2005 | Racing | 9.71 | 7.47 | 4.13 | 1.90 | 91.0 | 8.6 | E | 23.21 |
13 | Wii Fit | Wii | 2007 | Sports | 8.92 | 8.03 | 3.60 | 2.15 | 80.0 | 7.7 | E | 22.70 |
14 | Kinect Adventures! | X360 | 2010 | Misc | 15.00 | 4.89 | 0.24 | 1.69 | 61.0 | 6.3 | E | 21.82 |
15 | Wii Fit Plus | Wii | 2009 | Sports | 9.01 | 8.49 | 2.53 | 1.77 | 80.0 | 7.4 | E | 21.80 |
Calculating total sales by taking the aggregate sum of all the regional sale data allows us to look at the overall performance of each release without having to focus on the regional profiles.
Analysis
3.1 Games released by year
# Count the number of games released per year
= games['year_of_release'].value_counts().sort_index() games_per_year
build for plotly graph representing game releases per year.
# Count the number of games released per year
= games['year_of_release'].value_counts().sort_index()
games_per_year
# Convert the Series to DataFrame
= games_per_year.reset_index()
games_per_year_df = ['year_of_release', 'number_of_games']
games_per_year_df.columns
# Create a line plot
= px.line(
fig
games_per_year_df, ='year_of_release',
x='number_of_games',
y=True,
markers={'year_of_release': 'Year of Release', 'number_of_games': 'Number of Games Released'},
labels='Number of Games Released Per Year',
title=900,
width=500
height
)
# Add a salmon rectangle to highlight the years 1985 through 1996
fig.add_shape(type="rect",
=1985, x1=1996,
x0=0, y1=games_per_year_df['number_of_games'].max(),
y0="salmon",
fillcolor=0.1,
opacity=0,
line_width
)
fig.add_shape(type="rect",
=1996, x1=2016,
x0=0, y1=games_per_year_df['number_of_games'].max(),
y0="green",
fillcolor=0.1,
opacity=0,
line_width
)
# Update the layout for better appearance
fig.update_layout(#xaxis=dict(
#minor=dict(ticks='outside'),
#ticks='outside',
#ticklen=10
#),
='Year of Release',
xaxis_title#yaxis=dict(
#ticks='outside'
#),
='Number of Games Released',
yaxis_title='plotly_white'
template
)
=100,
fig.update_layout(title_pad_b="all-small-caps")
title_font_variant
range=[1984, 2017])
fig.update_xaxes(
fig.show()
As expressed by the returned data and shown in the chart, 1985 - 1996 experienced insignificant video game release activity. We can further prove this through a series of simple hypotheses tests.
3.1.1 Variance Test
Perform a hypothesis test to see if the variances between the two time ranges of interest are the equal or not.
null-hypothesis: The variance between the time ranges 1985-1996
& 1997-2016
in reference to the number of games released throughout these time ranges are the same.
significnace level: The signifance level of 0.05
is chosen as it is the typical standard, and there is no current evidence to utilize a different level for this particular test.
# Filter data for both periods
= games_per_year_df[(games_per_year_df['year_of_release'] >= 1985) &
early_period 'year_of_release'] <= 1996)]['number_of_games']
(games_per_year_df[
= games_per_year_df[games_per_year_df['year_of_release'] > 1996]['number_of_games'] later_period
# Perform Levene's test
= levene(early_period, later_period)
stat, p_value
print(f"Levene’s Test Statistic: {stat}, P-value: {p_value}")
Levene’s Test Statistic: 17.596312980670877, P-value: 0.000346497887645125
Since the p-value of Levene’s test is 0.00034, which is much smaller than the significance level of 0.05, we can reject the null hypothesis that the variances are equal. This means the variances between the two time ranges (1985-1996
and the following period) are significantly different.
3.1.2 T-Test
Perform a hypothesis test to see if the means between the two time ranges of interest are the equal or not.
null-hypothesis: The mean between the time ranges 1985-1996
& 1997-2016
in reference to the number of games released throughout these time ranges are the same.
significnace level: Again, the signifance level of 0.05
is chosen as it is the typical standard, and there is no current evidence to utilize a different level for this particular test.
import numpy as np
= np.array(early_period, dtype=float)
early_period = np.array(later_period, dtype=float) later_period
To make sure that the ttest functions properly, converting early_period and late_period is necessary.
equal_var
was chosen based on the variance test in 3.1.1.
# Perform t-test
= ttest_ind(early_period, later_period, equal_var=False)
t_stat, p_value
print(f"T-statistic: {t_stat}, P-value: {p_value}")
T-statistic: -7.0209097150232145, P-value: 1.0968840169623663e-06
- The p-value indicates that the difference in means between the two time periods (
1985-1996
and the following years) is statistically significant. - The negative T-statistic suggests that the mean number of games released per year in the
1985-1996
period is significantly lower compared to the later period.
Data Significance Based on Yearly Video Game Releases
While the historical aspect of of the games released during the 1985-1996
time range is significant enough for archival purposes. Further analytical procedures will have to consider this time period as insiginficant.
3.2 Variation of sales platform to platform
# Calculate the total sales for each platform
= games.groupby('platform')['total_sales'].sum().sort_values(ascending=False)
platform_sales
print(platform_sales)
platform
PS2 1044.25
X360 900.39
PS3 837.87
Wii 794.93
DS 558.90
PS4 245.94
PS 234.56
XB 227.81
PSP 212.08
PC 203.48
GBA 188.43
GC 171.24
XOne 134.06
3DS 131.98
WiiU 67.61
PSV 34.67
DC 4.55
Name: total_sales, dtype: float64
# Calculate variance and standard deviation of total sales across platforms
= platform_sales.var()
sales_variance = platform_sales.std()
sales_std_dev
print(f"Variance in sales across platforms: {sales_variance}")
print(f"Standard deviation in sales across platforms: {sales_std_dev}")
Variance in sales across platforms: 112170.9189264706
Standard deviation in sales across platforms: 334.91927225298724
3.2.1 Choose the platforms with the greatest total sales and build a distribution based on data for each year
= 5
top_n = platform_sales.head(top_n).index.tolist()
top_platforms
print(f"Top {top_n} Platforms by Total Sales:")
print(platform_sales.head(top_n))
Top 5 Platforms by Total Sales:
platform
PS2 1044.25
X360 900.39
PS3 837.87
Wii 794.93
DS 558.90
Name: total_sales, dtype: float64
# Filter the data to include only the top platforms
= games[games['platform'].isin(top_platforms)] top_platforms_data
= top_platforms_data.sort_values(by='year_of_release')
top_platforms_data 'year_of_release'].unique() top_platforms_data[
<IntegerArray>
[2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012,
2013, 2014, 2015, 2016]
Length: 17, dtype: Int64
Sale outliers in top_platforms_data
have been removed according to the IQR method.
build for graph representing distribution of sales.
# Define upper and lower bounds for sales to remove extreme outliers
= top_platforms_data['total_sales'].quantile(0.25)
q1 = top_platforms_data['total_sales'].quantile(0.75)
q3 = q3 - q1
iqr = q1 - 1.5 * iqr
lower_bound = q3 + 1.5 * iqr
upper_bound
# Filter the data to exclude outliers
= top_platforms_data[(top_platforms_data['total_sales'] >= lower_bound) & (top_platforms_data['total_sales'] <= upper_bound)]
filtered_data
# Create the box plot without outliers
= px.box(
fig
filtered_data,='platform',
x='total_sales',
y='platform',
color={
color_discrete_map'PS2': '#003f5c',
'DS': '#444e86',
'X360': '#955196',
'Wii': '#dd5182',
'PS3': '#ff6e54'
}, ='year_of_release',
facet_col=4,
facet_col_wrap
=930,
width=940
height
)
="all-small-caps",
fig.update_layout(title_font_variant=dict(text= "Distribution of Game Sales for Top Platforms by Year",
title= 0.50,
x = 1.0))
y
fig.show()
When considering the top 5 platforms determined by total sales (Million USD), the time-range 2000-2015
is shown to contain this data. Therefore, 2000-2015
represents the range in time with the highest density of sales by platform. Within this range of time there is an expressed pattern where 2000-2003
represents PS2 has the highest and only performing platform when comparing to the rest of the top performing platforms. This suggests that PS2 not only was the first amongst this group of filtered data to release, but maintained it’s position at the top until its last year of sales in 2010
. Despite this, the PS2 platform remains the highest grossing in terms of total video game sales. Continuing with the remaining Play Station platform, the PS3 released 2006 and expressed the highest median total sales that same year.
3.2.2 Find platforms that used to be popular but now have zero sales
# Filter relevant columns
= games[['name', 'platform', 'year_of_release', 'total_sales']]
games_zero_df
# Identify platforms with non-zero sales in any year
= games_zero_df[games_zero_df['total_sales'] > 0]['platform'].unique()
platforms_with_sales
# Identify platforms with zero sales in the most recent year
= games_zero_df['year_of_release'].max()
most_recent_year = most_recent_year - 4
recent_years = games_zero_df[games_zero_df['year_of_release'] >= recent_years]
recent_sales_range = recent_sales_range[recent_sales_range['total_sales'] == 0]['platform'].unique()
platforms_with_zero_sales
# Platforms that had sales in the past but have zero sales now
= set(platforms_with_sales) & set(platforms_with_zero_sales)
platforms_no_sales_now
print("Platforms that used to be popular but now have zero sales:", platforms_no_sales_now)
Platforms that used to be popular but now have zero sales: set()
When set() is returned after performing this procedure. It means that there are no overlapping occurences between the two sets. In this case there are no platforms that had sales in the past but have zero sales now.
Considering the minimum total_sales
value is 0.01 Million USD. There are in fact no 0 total_sales
for any entry, thus any platform when analyzing from the scope of total_sales
delineating between regional sales however might tell a different story.
3.2.3 How long does it generally take for new platforms to appear and old ones to fade?
# Create a DataFrame with the first and last year of sales for each platform
= games_zero_df.groupby('platform')['year_of_release'].agg(['min', 'max']).reset_index()
platform_lifespan = ['platform', 'first_year', 'last_year']
platform_lifespan.columns 'lifespan'] = platform_lifespan['last_year'] - platform_lifespan['first_year']
platform_lifespan[
# Calculate the average time it takes for new platforms to appear
= platform_lifespan[platform_lifespan['first_year'] > 2000]
new_platforms = new_platforms['first_year'].min() - 2000
average_time_to_appear
# Calculate the average time it takes for old platforms to fade
= platform_lifespan[platform_lifespan['last_year'] < most_recent_year]
faded_platforms = most_recent_year - faded_platforms['last_year'].max()
average_time_to_fade
print(f"Average time for new platforms to appear: {average_time_to_appear} years")
print(f"Average time for old platforms to fade: {average_time_to_fade} years")
Average time for new platforms to appear: 1 years
Average time for old platforms to fade: 3 years
In this lifespan analysis, 2000 was chosen as a baseline to signify the milestone of popular platforms beginning to release. (Determined by Distribution of Game Sales for Top Platforms by Year)
build for graph representing the lifespan of platforms.
# Create a scatter plot for platform lifespan
= px.scatter(
fig
platform_lifespan,='first_year',
x='last_year',
y='lifespan',
size='platform',
color='platform',
hover_name={'first_year': 'First Year of Sales', 'last_year': 'Last Year of Sales'},
labels=900,
width=600
height
)
="all-small-caps",
fig.update_layout(title_font_variant=dict(text= "Platform Lifespan",
title= 0.50,
x = 1.0))
y
fig.show()
The most noticeable point alone and uniquely large given it’s lifespan value, is the PC platform. This is easily explainable as the “PC” is being represented in an immutable manner, like the rest of the platforms. With PS2, PS3, and PS4 being represented appropriately, the differences between the platforms are shown. The game released for the PC in 1985 and the games released in 2016 for the PC are not for the same PC. That is why it is overly represented in this particular metric.
3.3 Significant time range for future model training
We already determined the significant range of time for video game sales on the whole. However, for building a model to inform upon strategies for 2017 we must make sure to restrict the time range to a point where we are not capturing outdated patterns. In this case, 2013 and onward will be utilized to inform upon 2017 strategies.
= games[games['year_of_release'] > 2012]
relevant_games 10) relevant_games.head(
name | platform | year_of_release | genre | na_sales | eu_sales | jp_sales | other_sales | critic_score | user_score | rating | total_sales | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
16 | Grand Theft Auto V | PS3 | 2013 | Action | 7.02 | 9.09 | 0.98 | 3.96 | 97.0 | 8.2 | M | 21.05 |
23 | Grand Theft Auto V | X360 | 2013 | Action | 9.66 | 5.14 | 0.06 | 1.41 | 97.0 | 8.1 | M | 16.27 |
42 | Grand Theft Auto V | PS4 | 2014 | Action | 3.96 | 6.31 | 0.38 | 1.97 | 97.0 | 8.3 | M | 12.62 |
60 | Call of Duty: Ghosts | X360 | 2013 | Shooter | 6.73 | 2.56 | 0.04 | 0.91 | 73.0 | 2.6 | M | 10.24 |
69 | Call of Duty: Ghosts | PS3 | 2013 | Shooter | 4.10 | 3.63 | 0.38 | 1.25 | 71.0 | 2.6 | M | 9.36 |
77 | FIFA 16 | PS4 | 2015 | Sports | 1.12 | 6.12 | 0.06 | 1.28 | 82.0 | 4.3 | E | 8.58 |
92 | Call of Duty: Advanced Warfare | PS4 | 2014 | Shooter | 2.81 | 3.48 | 0.14 | 1.23 | 83.0 | 5.7 | M | 7.66 |
94 | FIFA 17 | PS4 | 2016 | Sports | 0.66 | 5.75 | 0.08 | 1.11 | 85.0 | 5.0 | E | 7.60 |
105 | Fallout 4 | PS4 | 2015 | Role-Playing | 2.53 | 3.27 | 0.24 | 1.13 | 87.0 | 6.5 | M | 7.17 |
110 | Mario Kart 8 | WiiU | 2014 | Racing | 3.15 | 2.15 | 1.28 | 0.51 | 88.0 | 9.1 | E | 7.09 |
3.4 Which platforms are leading in sales? Which ones are growing or shrinking? Select several potentially profitable platforms.
= relevant_games.groupby('platform')['total_sales'].sum().sort_values(ascending=False)
platform_sales_relevant 6) platform_sales_relevant.head(
platform
PS4 245.94
PS3 151.91
XOne 134.06
X360 116.70
WiiU 52.76
3DS 45.78
Name: total_sales, dtype: float64
The top 5 performing platforms in the updated time range remains the same unsurprisingly. With PS2 in the lead and PS3 also being presented. Expanding on the top performing range to 6 will provide more data relevant for modeling 2017. In this range PS2 is no longer included however, it is superceeded by the PS4 in this range acting as the top performer.
3.5 Global sales of all games by platform
# Select the top 6 platforms
= platform_sales_relevant.head(6).index.tolist() top_platforms2
# Filter the data to include only the top platforms
= games[games['platform'].isin(top_platforms2)]
top_platforms_data2 = top_platforms_data2.sort_values(by='year_of_release') top_platforms_data2
build for graph representing global sales distribution.
# Remove outliers using IQR method for each platform (make it a function this time)
def remove_outliers(df, column):
= df[column].quantile(0.25)
Q1 = df[column].quantile(0.75)
Q3 = Q3 - Q1
IQR # Define the lower and upper bounds for outliers
= Q1 - 1.5 * IQR
lower_bound = Q3 + 1.5 * IQR
upper_bound # Filter the dataframe to exclude outliers
return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]
# Remove outliers from the dataset (apply the function)
= top_platforms_data2.groupby('platform', group_keys=False).apply(remove_outliers, 'total_sales')
filtered_games
# Step 3: Create the box plot using Plotly Express
= px.box(filtered_games, x='platform', y='total_sales',
fig ='Global Sales Distribution by Platform (2013-2016)',
title={'total_sales': 'Global Sales (in million USD)', 'platform': 'Platform'},
labels='platform',
color={
color_discrete_map'3DS': '#003f5c',
'PS3': '#444e86',
'PS4': '#955196',
'WiiU': '#dd5182',
'X360': '#ff6e54',
'XOne': '#ffa600'
},)
# Customize the plot layout
='Platform', yaxis_title='Global Sales (in million USD)',
fig.update_layout(xaxis_title={'x': 0.5}, width=900, height=600)
title="all-small-caps",
fig.update_layout(title_font_variant=dict(text= "Global Sales Distribution by Platform (2013-2016)",
title= 0.50,
x = 1.0))
y
fig.show()
/tmp/ipykernel_74871/2607202076.py:13: DeprecationWarning:
DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
To answer if the differences in sales are significant we can employ the Kruskal-Wallis H-test.
# Get the unique platforms
= filtered_games['platform'].unique()
platforms
# Create a list of groups (global sales for each platform)
= [filtered_games[filtered_games['platform'] == platform]['total_sales'] for platform in platforms]
groups
# Perform the Kruskal-Wallis test
= kruskal(*groups)
stat, p_value
print(f'Kruskal-Wallis test statistic: {stat}')
print(f'Kruskal-Wallis p-value: {p_value}')
Kruskal-Wallis test statistic: 86.53904927975694
Kruskal-Wallis p-value: 3.580039286380217e-17
According to the SciPy
API documentation, this tests the null hypothesis that the population median of all of the groups are equal. The test works on 2 or more independent samples, which may have different sizes. Some might use the ANOVA test on a problem such as this, but the data assumptions required are not preferrable for this particular dataset.
The Dunn’s test was selected for post-hoc analysis of the Kruskal-Wallis test to check the pairwise comparisons between each platform.
# Prepare data for Dunn's test
# The data should be in the form of a pandas Series for Global Sales with corresponding platform labels
= filtered_games[['total_sales', 'platform']].dropna()
dunn_data
# Perform Dunn's test with Bonferroni correction for multiple comparisons
= sp.posthoc_dunn(dunn_data, val_col='total_sales', group_col='platform', p_adjust='bonferroni')
dunn_results
print(dunn_results)
3DS PS3 PS4 WiiU X360 XOne
3DS 1.000000e+00 1.390504e-14 0.147226 1.000000 1.551596e-07 0.023295
PS3 1.390504e-14 1.000000e+00 0.000005 0.000475 1.533063e-03 0.003958
PS4 1.472264e-01 5.378783e-06 1.000000 1.000000 1.243940e-01 1.000000
WiiU 1.000000e+00 4.749854e-04 1.000000 1.000000 2.079647e-01 1.000000
X360 1.551596e-07 1.533063e-03 0.124394 0.207965 1.000000e+00 1.000000
XOne 2.329531e-02 3.957731e-03 1.000000 1.000000 1.000000e+00 1.000000
Significant Comparisons:
3DS vs PS3, X360, XOne: The p-values for these comparisons (1.39e-14, 1.55e-07, 0.023) are very small, indicating significant differences in median global sales between the 3DS and PS3, X360, and XOne platforms.
PS3 vs PS4, WiiU, X360, XOne: The p-values (5.38e-06, 0.000475, 0.001533, 0.003958) indicate significant differences in sales between PS3 and these platforms, showing that PS3 has distinct sales patterns.
X360 vs 3DS, PS3: Significant differences are evident here, with p-values of 1.55e-07 and 0.001533 respectively.
XOne vs 3DS, PS3: The p-values of 0.023 and 0.003958 suggest significant differences between XOne and these platforms.
Non-Significant Comparisons:
3DS vs PS4, WiiU: P-values of 0.147 and 1.0 indicate no significant difference between 3DS and these platforms.
PS4 vs WiiU, X360, XOne: With p-values of 1.0, PS4 does not show significant differences in median sales compared to WiiU, X360, or XOne.
WiiU vs 3DS, PS4, X360, XOne: All p-values are close to or equal to 1, indicating no significant differences in sales between WiiU and these platforms.
Conclusion:
There are significant differences in median global sales between several platform pairs, particularly involving 3DS, PS3, and X360, which show distinctive sales patterns. However, platforms like PS4, WiiU, and XOne exhibit more similarity in sales distribution, as there are no significant differences between them.
3.6 Professional reviews affect on sales for PS4
= filtered_games[filtered_games['platform'] == 'PS4']
PS4_platform PS4_platform.head()
name | platform | year_of_release | genre | na_sales | eu_sales | jp_sales | other_sales | critic_score | user_score | rating | total_sales | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
7296 | Angry Birds Star Wars | PS4 | 2013 | Strategy | 0.10 | 0.08 | 0.00 | 0.04 | 47.0 | 2.0 | E | 0.22 |
863 | Need for Speed Rivals | PS4 | 2013 | Racing | 0.73 | 0.92 | 0.03 | 0.30 | 80.0 | 6.2 | E10+ | 1.98 |
6911 | NBA Live 14 | PS4 | 2013 | Sports | 0.15 | 0.05 | 0.00 | 0.04 | 43.0 | 2.7 | E | 0.24 |
2307 | Madden NFL 25 | PS4 | 2013 | Sports | 0.63 | 0.13 | 0.00 | 0.15 | 74.0 | 5.2 | E | 0.91 |
4917 | Skylanders SWAP Force | PS4 | 2013 | Platform | 0.20 | 0.12 | 0.00 | 0.06 | 79.0 | 5.2 | E10+ | 0.38 |
build for graph representing correlation between professional reviews and sales
= px.scatter(PS4_platform, x="critic_score", y="total_sales", color='total_sales',
fig ="Total Sale Correlation with Professional Reviews (PS4 games)",
title={'total_sales': 'Total Sales (in millions)', 'critic_score': 'Critic Score'})
labels
="all-small-caps",
fig.update_layout(title_font_variant=dict(x = 0.50,
title= 1.0))
y
fig.show()
print(PS4_platform['critic_score'].corr(PS4_platform['total_sales']))
0.31816367517799193
With a correlation coefficient of 0.318 there is little to no linear relationship between professional reviews and total sales for the PS4 platform.
3.7 Compare the sales of the same games on other platforms.
To do this we first filter out all the games that are on the other top platforms and also on the PS4 platform, making sure to filter out the PS4 entries entirely.
= PS4_platform['name'].unique() PS4_games
# Find games that are shared with other platforms, excluding PS4
= filtered_games[(filtered_games['name'].isin(PS4_games)) & (filtered_games['platform'] != 'PS4')]
shared_games 5) shared_games.sample(
name | platform | year_of_release | genre | na_sales | eu_sales | jp_sales | other_sales | critic_score | user_score | rating | total_sales | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
8265 | Madden NFL 17 | X360 | 2016 | Sports | 0.14 | 0.02 | 0.0 | 0.02 | 29.312523 | 0.7 | E | 0.18 |
11391 | Transformers: Devastation | PS3 | 2015 | Action | 0.03 | 0.04 | 0.0 | 0.01 | 69.019708 | 7.1 | T | 0.08 |
1455 | Tom Clancy's Rainbow Six: Siege | XOne | 2015 | Shooter | 0.77 | 0.45 | 0.0 | 0.12 | 74.000000 | 6.9 | M | 1.34 |
9629 | NBA 2K17 | PS3 | 2016 | Sports | 0.06 | 0.04 | 0.0 | 0.02 | 37.378045 | 2.0 | E | 0.12 |
9861 | NBA 2K17 | X360 | 2016 | Sports | 0.09 | 0.02 | 0.0 | 0.01 | 46.063992 | 3.4 | E | 0.12 |
build for graph representing correlation between professional reviews and sales (shared games)
= px.scatter(shared_games, x="critic_score", y="total_sales", color='total_sales', facet_col='platform',
fig ="Total Sale Correlation with Professional Reviews (shared games)",
title={'total_sales': 'Total Sales (in millions)', 'critic_score': 'Critic Score'})
labels
="all-small-caps",
fig.update_layout(title_font_variant=dict(x = 0.50,
title= 1.0))
y
fig.show()
# Group by platform and calculate the Pearson correlation
= shared_games.groupby('platform').apply(lambda group: group['total_sales'].corr(group['critic_score']))
correlations
# Print the correlation for each platform
for platform, corr in correlations.items():
print(f"Correlation between total_sales and critic_score for {platform}: {corr}")
Correlation between total_sales and critic_score for 3DS: -0.3817347562062789
Correlation between total_sales and critic_score for PS3: 0.22682294202250627
Correlation between total_sales and critic_score for WiiU: 0.4033118763354005
Correlation between total_sales and critic_score for X360: 0.21228717186696933
Correlation between total_sales and critic_score for XOne: 0.33775568867121897
/tmp/ipykernel_74871/529713270.py:2: DeprecationWarning:
DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
The shared games professional review analysis now updated with the shared games via the PS4 platform now shows that the WiiU platform while still expressing a correlation coefficient below 0.5 was the most affected by professional reviews when comparing to the rest of the platforms shown.
3.8 Distribution of games by genre
build for graph representing distribution of total sales by genre.
# Remove outliers from the dataset (apply the function)
= games.groupby('genre', group_keys=False).apply(remove_outliers, 'total_sales')
games2
# Visualize the total sales by genre
= px.box(games2, x='genre', y='total_sales',
fig ='Distribution of Total Sales by Genre',
title={'total_sales': 'Total Sales (in millions)', 'genre': 'Genre'},
labels=600,
height=900)
width
="all-small-caps",
fig.update_layout(title_font_variant=dict(x = 0.50,
title= 1.0))
y
fig.show()
/tmp/ipykernel_74871/325689584.py:2: DeprecationWarning:
DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
According to the above chart: Action and Sports lead with their high median total sales value. However, now that we look at the median total sales, genres such as Platform, Role-Playing, Shooter, and Fighting are better represented even when compared to Action and Sports.
Regional Profiles
4.3 Regional affect of ESRB ratings on sales
# Group by 'rating' and sum the sales for each region
= games.groupby('rating')[['na_sales', 'eu_sales', 'jp_sales']].sum().reset_index()
rating_sales
# Calculate total sales for each region
= rating_sales['na_sales'].sum()
total_na_sales_rating = rating_sales['eu_sales'].sum()
total_eu_sales_rating = rating_sales['jp_sales'].sum() total_jp_sales_rating
# Calculate market share for each rating in each region
'market_share_na'] = (rating_sales['na_sales'] / total_na_sales_rating) * 100
rating_sales['market_share_eu'] = (rating_sales['eu_sales'] / total_eu_sales_rating) * 100
rating_sales['market_share_jp'] = (rating_sales['jp_sales'] / total_jp_sales_rating) * 100 rating_sales[
build for graph representing regional rating market share.
= pd.melt(rating_sales, id_vars='rating', value_vars=['market_share_na', 'market_share_eu', 'market_share_jp'],
rating_sales_melted ='region', value_name='market_share')
var_name
'region'] = rating_sales_melted['region'].replace({
rating_sales_melted['market_share_na': 'NA',
'market_share_eu': 'EU',
'market_share_jp': 'JP'
})
= px.bar(rating_sales_melted, x='rating', y='market_share', color='region', barmode='group',
fig ={
color_discrete_map'NA': '#003f5c',
'EU': '#bc5090',
'JP': '#ffa600'
},="Market Share of ESRB Ratings by Region", labels={'market_share': 'Market Share (%)'})
title
="all-small-caps",
fig.update_layout(title_font_variant=dict(x = 0.50,
title= 1.0))
y
fig.show()
The E rating dominates the global space for video games. While AO, EC, K-A, and RP are hardly represented especially when delineated across separate regions. M and T are moreso globally expressed with JP holding a lead on T rated game sales.
Hypothesis Tests
Average user ratings of the Xbox One and PC platforms.
Null Hypothesis: There are no significant difference between the average user ratings of the Xbox One and PC platforms.
Alternative Hypothesis: Average user rating between Xbox One and PC platforms are different.
Average user ratings for the Action and Sport genres.
Null Hypothesis: There are no significant difference between the average user ratings of the Action and Sport genres.
Alternative Hypothesis: The average user rating between the Action and Sport genre are different.
# Filter for Xbox One and PC platforms
= games[games['platform'] == 'XOne']['user_score']
xbox_one_ratings = games[games['platform'] == 'PC']['user_score']
pc_ratings
# Filter for Action and Sports genres
= games[games['genre'] == 'Action']['user_score']
action_ratings = games[games['genre'] == 'Sports']['user_score'] sports_ratings
First we must make sure that the variation status between these groups are known, in order to conduct proper hyptheses tests.
# Levene's test for equality of variances
= levene(xbox_one_ratings, pc_ratings)
levene_stat_xbox_pc, levene_p_value_xbox_pc = levene(action_ratings, sports_ratings) levene_stat_action_sports, levene_p_value_action_sports
# For Xbox One vs. PC
if levene_p_value_xbox_pc < 0.05:
# Unequal variances
= ttest_ind(xbox_one_ratings, pc_ratings, equal_var=False)
t_stat_xbox_pc, p_value_xbox_pc print("Levene's test indicates unequal variances. Using Welch's t-test.")
else:
# Equal variances
= ttest_ind(xbox_one_ratings, pc_ratings, equal_var=True)
t_stat_xbox_pc, p_value_xbox_pc print("Levene's test indicates equal variances. Using standard t-test.")
# For Action vs. Sports
if levene_p_value_action_sports < 0.05:
# Unequal variances
= ttest_ind(action_ratings, sports_ratings, equal_var=False)
t_stat_action_sports, p_value_action_sports print("Levene's test indicates unequal variances. Using Welch's t-test.")
else:
# Equal variances
= ttest_ind(action_ratings, sports_ratings, equal_var=True)
t_stat_action_sports, p_value_action_sports print("Levene's test indicates equal variances. Using standard t-test.")
Levene's test indicates equal variances. Using standard t-test.
Levene's test indicates equal variances. Using standard t-test.
Appropriately, the variations were determined to be equal between the groups of interest. Therefore, a default t-test could be carried out.
= 0.025
alpha
# Results for Hypothesis 1
if p_value_xbox_pc < alpha:
print("Reject the null hypothesis: Average user ratings for Xbox One and PC are different.")
else:
print("Fail to reject the null hypothesis: No significant difference in average user ratings for Xbox One and PC.")
# Results for Hypothesis 2
if p_value_action_sports < alpha:
print("Reject the null hypothesis: Average user ratings for Action and Sports genres are different.")
else:
print("Fail to reject the null hypothesis: No significant difference in average user ratings for Action and Sports genres.")
Reject the null hypothesis: Average user ratings for Xbox One and PC are different.
Fail to reject the null hypothesis: No significant difference in average user ratings for Action and Sports genres.
According to the above tests:
Xbox One and PC - Average user ratings for Xbox One and PC are different.
Action and Sports genres - The are no significant difference in average user ratings for Action and Sports genres.
Conclusion
When looking at the data as a whole, there is much that can be derived. In terms of data preperation and preprocessing handling the implicit duplicates was an essential step in insuring accurate analysis. Especially when it comes to differentiating games which shared the same name yet were released at different times amongst separate platforms. Additionally, handling the data types to ensure the reasonable expression of year time data was a required action given the original data representation.
Continuing with the analysis, most of the analytical structures and directions revolved around the total_sales
variable created via the summation of relevant regional sale values for every entry. The further grouping and aggregation by platform, genre, and rating were all done separately to avoid the doupling of aggregations. When attempting the zero sales analysis from the scope of total_sales
there was no potential to derive any further value from it. Since in a global scale there is not a single platform that is not expressing a sale. Perhaps further analysis on a regional scale can be completed on the same topic.
On the same thread, reviewing the top 5 and 6 performing platforms revealed a consistent range of time from 2000 - 2016 where all of the top performing platforms by global sales where expressed. This allowed the previous years to be disregarded in further analysis.
On the whole, there is a clear trend in Play Station platforms and each of their respective global sales performance. While expressing the greatest global markest share as a platform there is no doubt that investment in title development for future Play Station platforms would be the most profitable. Additionally, investment in Play Station titles under the Action genre would be most advisable considering it’s global market share. Targeted game development for Role-Playing titles can be accomplished successfuly in Japan.
In addition to the main analysis for 2017 campaign strategies. The hypothesis tested regarding the potential equity between Xbox One and PC user ratings showed that the averages are in fact different. Also, there are no significant difference in the average user rating between Action and Sports genres. Therefore, one could conclude that user ratings had little to no affect in the clear significant difference between the genres’ market share.