Ice Games - Global Sales Exploratory Data Analysis

Author

Umberto Fasci

Introduction

Exploratory data analysis for the purpose of determining profitable 2017 marketing strategy

The purpose of this study is to determine profitable marketing strategy based on the provided data. As of Dec 2016, the following data will be leveraged to determine potential significant time ranges for model building, profitable platform categories, and regional-targeted campaign strategies.

Data Preprocessing

import pandas as pd
import plotly.express as px

# stats
from scipy.stats import levene
from scipy.stats import ttest_ind
from scipy.stats import kruskal
import scikit_posthocs as sp

Loading the Games Dataset

games = pd.read_csv('../datasets/games.csv')

display(games.head())
games.info()

	Name	Platform	Year_of_Release	Genre	NA_sales	EU_sales	JP_sales	Other_sales	Critic_Score	User_Score	Rating
0	Wii Sports	Wii	2006.0	Sports	41.36	28.96	3.77	8.45	76.0	8	E
1	Super Mario Bros.	NES	1985.0	Platform	29.08	3.58	6.81	0.77	NaN	NaN	NaN
2	Mario Kart Wii	Wii	2008.0	Racing	15.68	12.76	3.79	3.29	82.0	8.3	E
3	Wii Sports Resort	Wii	2009.0	Sports	15.61	10.93	3.28	2.95	80.0	8	E
4	Pokemon Red/Pokemon Blue	GB	1996.0	Role-Playing	11.27	8.89	10.22	1.00	NaN	NaN	NaN

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16713 non-null  object 
 1   Platform         16715 non-null  object 
 2   Year_of_Release  16446 non-null  float64
 3   Genre            16713 non-null  object 
 4   NA_sales         16715 non-null  float64
 5   EU_sales         16715 non-null  float64
 6   JP_sales         16715 non-null  float64
 7   Other_sales      16715 non-null  float64
 8   Critic_Score     8137 non-null   float64
 9   User_Score       10014 non-null  object 
 10  Rating           9949 non-null   object 
dtypes: float64(6), object(5)
memory usage: 1.4+ MB

Columns Breakdown:

Name (object): The name of the game (e.g., Wii Sports, Super Mario Bros.). This is a string data type.
Platform (object): The platform on which the game was released (e.g., Wii, NES, DS). This is also a string data type.
Year_of_Release (float64): The year the game was released. Some rows have missing values (NaN), and the data type is float, possibly because there are missing values in some rows.
Genre (object): The genre of the game (e.g., Sports, Platform, Racing). This is a string data type.
NA_sales (float64): Total sales of the game in North America (in million USD). It’s a numeric column.
EU_sales (float64): Total sales of the game in Europe (in million USD). This is also numeric.
JP_sales (float64): Total sales of the game in Japan (in million USD). Numeric with missing values.
Other_sales (float64): Total sales of the game in regions other than North America, Europe, and Japan (in million USD). Numeric with potential missing values.
Critic_Score (float64): The average critic score for the game. This column has missing values as well, as some games lack this score.
User_Score (float64): The average user score for the game. Like the critic score, some values are missing.
Rating (object): The game’s rating (e.g., E, M for mature, etc.), indicating the age group or audience suitable for the game.

Preparing The Data

1. Duplicate Handling

# Count of unique entries based on Name and Platform
games.groupby(['Name', 'Platform']).ngroups

# Find duplicate entries based on 'Name' and 'Platform'
duplicates = games[games.duplicated(subset=['Name', 'Platform'], keep=False)]

# Display the DataFrame with non-unique entries sorted so the dups are together
duplicates_sorted = duplicates.sort_values(by=['Name', 'Platform'])
display(duplicates_sorted)

	Name	Platform	Year_of_Release	Genre	NA_sales	EU_sales	JP_sales	Other_sales	Critic_Score	User_Score	Rating
604	Madden NFL 13	PS3	2012.0	Sports	2.11	0.22	0.00	0.23	83.0	5.5	E
16230	Madden NFL 13	PS3	2012.0	Sports	0.00	0.01	0.00	0.00	83.0	5.5	E
5972	Need for Speed: Most Wanted	PC	2005.0	Racing	0.02	0.23	0.00	0.04	82.0	8.5	T
11715	Need for Speed: Most Wanted	PC	2012.0	Racing	0.00	0.06	0.00	0.02	82.0	8.5	T
1190	Need for Speed: Most Wanted	X360	2012.0	Racing	0.62	0.78	0.01	0.15	83.0	8.5	T
1591	Need for Speed: Most Wanted	X360	2005.0	Racing	1.00	0.13	0.02	0.10	83.0	8.5	T
1745	Sonic the Hedgehog	PS3	2006.0	Platform	0.41	0.06	0.04	0.66	43.0	4.1	E10+
4127	Sonic the Hedgehog	PS3	NaN	Platform	0.00	0.48	0.00	0.00	43.0	4.1	E10+
659	NaN	GEN	1993.0	NaN	1.78	0.53	0.00	0.08	NaN	NaN	NaN
14244	NaN	GEN	1993.0	NaN	0.00	0.00	0.03	0.00	NaN	NaN	NaN

When looking at unique entries based on the Name of the game and the Platform it was released on. There appears to be 16709 out of 16715 entries which are unique. Looking at the duplicated entries based on Name and Platform, we can make several notes:

Need for Speed: Most Wanted: Upon investigation this title had two separate release dates 2005 and 2012. For whatever the reason is, these are not in fact duplicates but two separate games. However, specifying the release date in the name feature of these titles will help to deliniate them from each other. At least when it comes to comparing both name and platform.

The two Need for Speed: Most Wanted games were released by separate developers. Wikipedia Source

NaN: These NaN entries in the GEN platform will be appropriately handled in the following stage.
Madden NFL 13: At first glance this may seem a simple duplication error. However, there might be an explaination for this considering Madden NFL 13 had different regional release dates within 2012. However, the sales data associated with entry 16230 seems to be associated with the EU release. Combining the entries together via EU_sales should resolve this data issue while incorporating this data.
Sonic the Hedgehog: Similar to Madden NFL 13 there are several releases associated with the title throughout several global regions. Entry 4127 seems to be associated with the EU release in 2007. Combining the entries together via EU_sales should resolve this data issue.

# Define the indices of the Need for Speed rows
need_for_updating = [5972, 11715, 1190, 1591]

# Update the 'Name' column by appending the 'Year_of_Release'
games.loc[need_for_updating, 'Name'] = games.loc[need_for_updating].apply(
    lambda row: f"{row['Name']} ({int(row['Year_of_Release'])})", axis=1
)

games.loc[need_for_updating]

	Name	Platform	Year_of_Release	Genre	NA_sales	EU_sales	JP_sales	Other_sales	Critic_Score	User_Score	Rating
5972	Need for Speed: Most Wanted (2005)	PC	2005.0	Racing	0.02	0.23	0.00	0.04	82.0	8.5	T
11715	Need for Speed: Most Wanted (2012)	PC	2012.0	Racing	0.00	0.06	0.00	0.02	82.0	8.5	T
1190	Need for Speed: Most Wanted (2012)	X360	2012.0	Racing	0.62	0.78	0.01	0.15	83.0	8.5	T
1591	Need for Speed: Most Wanted (2005)	X360	2005.0	Racing	1.00	0.13	0.02	0.10	83.0	8.5	T

# Handling 'Madden NFL 13' case (Performed first because the index to drop is higher)
OG_Madden13_loc = 604        # index of the main 'Madden NFL 13' entry.
Dup_Madden13_loc = 16230     # index of the dup 'Madden NFL 13' entry.

# Sum the 'EU_sales' values between the two rows and save the sum to the main entry
games.at[OG_Madden13_loc, 'EU_sales'] += games.at[Dup_Madden13_loc, 'EU_sales']

# Drop the duplicated madden 13
games = games.drop(index=Dup_Madden13_loc)

games.loc[[OG_Madden13_loc]]

	Name	Platform	Year_of_Release	Genre	NA_sales	EU_sales	JP_sales	Other_sales	Critic_Score	User_Score	Rating
604	Madden NFL 13	PS3	2012.0	Sports	2.11	0.23	0.0	0.23	83.0	5.5	E

# Handling 'Sonic the Hedgehog' case
OG_Hedgehog_loc = 1745      # index of the main 'Sonic the Hedgehog' entry.
Dup_Hedgegog_loc = 4127     # index of the dup 'Sonic the Hedgehog' entry.

# Sum the 'EU_sales' values between the two rows and save the same to the main entry
games.at[OG_Hedgehog_loc, 'EU_sales'] += games.at[Dup_Hedgegog_loc, 'EU_sales']

# Drop the duplicated hedgehog
games = games.drop(index=Dup_Hedgegog_loc)

games.loc[[OG_Hedgehog_loc]]

	Name	Platform	Year_of_Release	Genre	NA_sales	EU_sales	JP_sales	Other_sales	Critic_Score	User_Score	Rating
1745	Sonic the Hedgehog	PS3	2006.0	Platform	0.41	0.54	0.04	0.66	43.0	4.1	E10+

2. Data Types & Missing Values

Multiple imputations by chained equations was chosen to handle the large amount of missing data attributed to critic and user scores.

Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: what is it and how does it work?. International journal of methods in psychiatric research, 20(1), 40–49. https://doi.org/10.1002/mpr.329

sklearn multiple imputation documentation

# imports for more sophisticated imputation techniques
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Convert all column names to lowercase
games.columns = games.columns.str.lower()

# Convert 'user_score' to numeric, setting 'tbd' as NaN
games['user_score'] = pd.to_numeric(games['user_score'], errors='coerce')

# Remove rows with missing 'year_of_release' and/or 'rating' values
games = games.dropna(subset=['year_of_release', 'rating'])

# Convert 'year_of_release' to integer
games['year_of_release'] = games['year_of_release'].astype('Int64')

# Instantiate the imputer
mice_imputer = IterativeImputer()

# Apply the MICE imputation to `critic_score` and `user_score`
games[['critic_score', 'user_score']] = mice_imputer.fit_transform(games[['critic_score', 'user_score']])

display(games.head(10))
print(games.info())

	name	platform	year_of_release	genre	na_sales	eu_sales	jp_sales	other_sales	critic_score	user_score	rating
0	Wii Sports	Wii	2006	Sports	41.36	28.96	3.77	8.45	76.0	8.0	E
2	Mario Kart Wii	Wii	2008	Racing	15.68	12.76	3.79	3.29	82.0	8.3	E
3	Wii Sports Resort	Wii	2009	Sports	15.61	10.93	3.28	2.95	80.0	8.0	E
6	New Super Mario Bros.	DS	2006	Platform	11.28	9.14	6.50	2.88	89.0	8.5	E
7	Wii Play	Wii	2006	Misc	13.96	9.18	2.93	2.84	58.0	6.6	E
8	New Super Mario Bros. Wii	Wii	2009	Platform	14.44	6.94	4.70	2.24	87.0	8.4	E
11	Mario Kart DS	DS	2005	Racing	9.71	7.47	4.13	1.90	91.0	8.6	E
13	Wii Fit	Wii	2007	Sports	8.92	8.03	3.60	2.15	80.0	7.7	E
14	Kinect Adventures!	X360	2010	Misc	15.00	4.89	0.24	1.69	61.0	6.3	E
15	Wii Fit Plus	Wii	2009	Sports	9.01	8.49	2.53	1.77	80.0	7.4	E

<class 'pandas.core.frame.DataFrame'>
Index: 9767 entries, 0 to 16706
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             9767 non-null   object 
 1   platform         9767 non-null   object 
 2   year_of_release  9767 non-null   Int64  
 3   genre            9767 non-null   object 
 4   na_sales         9767 non-null   float64
 5   eu_sales         9767 non-null   float64
 6   jp_sales         9767 non-null   float64
 7   other_sales      9767 non-null   float64
 8   critic_score     9767 non-null   float64
 9   user_score       9767 non-null   float64
 10  rating           9767 non-null   object 
dtypes: Int64(1), float64(6), object(4)
memory usage: 925.2+ KB
None

Here all the columns are transformed into a lowercase format. As for the tbd entries in user_score, they are neatly handled by coercing the values to numeric. This forced the tbd values to simply become NaN.

For rating I had considered utilizing the IGDB API to pull in potential ratings for each title missing a rating. However, since this is a Jupyter notebook submission; there is no real way to handle the secrets associated with performing requests on this API. If there is I would appreciate some advice on this topic.

Handling the missing data of features that can’t be handled otherwise must occur before any of the imputation techniques we wish to employ. For this reason, entries with missing data in year_of_release and/or rating were dropped.

Total Sales

# Calculate total sales for each game
games['total_sales'] = games[['na_sales', 'eu_sales', 'jp_sales', 'other_sales']].sum(axis=1)
games.head(10)

	name	platform	year_of_release	genre	na_sales	eu_sales	jp_sales	other_sales	critic_score	user_score	rating	total_sales
0	Wii Sports	Wii	2006	Sports	41.36	28.96	3.77	8.45	76.0	8.0	E	82.54
2	Mario Kart Wii	Wii	2008	Racing	15.68	12.76	3.79	3.29	82.0	8.3	E	35.52
3	Wii Sports Resort	Wii	2009	Sports	15.61	10.93	3.28	2.95	80.0	8.0	E	32.77
6	New Super Mario Bros.	DS	2006	Platform	11.28	9.14	6.50	2.88	89.0	8.5	E	29.80
7	Wii Play	Wii	2006	Misc	13.96	9.18	2.93	2.84	58.0	6.6	E	28.91
8	New Super Mario Bros. Wii	Wii	2009	Platform	14.44	6.94	4.70	2.24	87.0	8.4	E	28.32
11	Mario Kart DS	DS	2005	Racing	9.71	7.47	4.13	1.90	91.0	8.6	E	23.21
13	Wii Fit	Wii	2007	Sports	8.92	8.03	3.60	2.15	80.0	7.7	E	22.70
14	Kinect Adventures!	X360	2010	Misc	15.00	4.89	0.24	1.69	61.0	6.3	E	21.82
15	Wii Fit Plus	Wii	2009	Sports	9.01	8.49	2.53	1.77	80.0	7.4	E	21.80

Calculating total sales by taking the aggregate sum of all the regional sale data allows us to look at the overall performance of each release without having to focus on the regional profiles.

Analysis

3.1 Games released by year

# Count the number of games released per year
games_per_year = games['year_of_release'].value_counts().sort_index()

build for plotly graph representing game releases per year.

# Count the number of games released per year
games_per_year = games['year_of_release'].value_counts().sort_index()

# Convert the Series to DataFrame
games_per_year_df = games_per_year.reset_index()
games_per_year_df.columns = ['year_of_release', 'number_of_games']

# Create a line plot
fig = px.line(
    games_per_year_df, 
    x='year_of_release', 
    y='number_of_games', 
    markers=True,
    labels={'year_of_release': 'Year of Release', 'number_of_games': 'Number of Games Released'},
    title='Number of Games Released Per Year',
    width=900,
    height=500
)

# Add a salmon rectangle to highlight the years 1985 through 1996
fig.add_shape(
    type="rect",
    x0=1985, x1=1996,
    y0=0, y1=games_per_year_df['number_of_games'].max(),
    fillcolor="salmon",
    opacity=0.1,
    line_width=0,
)

fig.add_shape(
    type="rect",
    x0=1996, x1=2016,
    y0=0, y1=games_per_year_df['number_of_games'].max(),
    fillcolor="green",
    opacity=0.1,
    line_width=0,
)

# Update the layout for better appearance
fig.update_layout(
    #xaxis=dict(
        #minor=dict(ticks='outside'),
        #ticks='outside',
        #ticklen=10
        #),
    xaxis_title='Year of Release',
    #yaxis=dict(
        #ticks='outside'
    #),
    yaxis_title='Number of Games Released',
    template='plotly_white'
    
)

fig.update_layout(title_pad_b=100,
                  title_font_variant="all-small-caps")

fig.update_xaxes(range=[1984, 2017])



fig.show()

As expressed by the returned data and shown in the chart, 1985 - 1996 experienced insignificant video game release activity. We can further prove this through a series of simple hypotheses tests.

3.1.1 Variance Test

Perform a hypothesis test to see if the variances between the two time ranges of interest are the equal or not.

null-hypothesis: The variance between the time ranges 1985-1996 & 1997-2016 in reference to the number of games released throughout these time ranges are the same.

significnace level: The signifance level of 0.05 is chosen as it is the typical standard, and there is no current evidence to utilize a different level for this particular test.

# Filter data for both periods
early_period = games_per_year_df[(games_per_year_df['year_of_release'] >= 1985) &
    (games_per_year_df['year_of_release'] <= 1996)]['number_of_games']

later_period = games_per_year_df[games_per_year_df['year_of_release'] > 1996]['number_of_games']

# Perform Levene's test 
stat, p_value = levene(early_period, later_period)

print(f"Levene’s Test Statistic: {stat}, P-value: {p_value}")

Levene’s Test Statistic: 17.596312980670877, P-value: 0.000346497887645125

Since the p-value of Levene’s test is 0.00034, which is much smaller than the significance level of 0.05, we can reject the null hypothesis that the variances are equal. This means the variances between the two time ranges (1985-1996 and the following period) are significantly different.

3.1.2 T-Test

Perform a hypothesis test to see if the means between the two time ranges of interest are the equal or not.

null-hypothesis: The mean between the time ranges 1985-1996 & 1997-2016 in reference to the number of games released throughout these time ranges are the same.

significnace level: Again, the signifance level of 0.05 is chosen as it is the typical standard, and there is no current evidence to utilize a different level for this particular test.

import numpy as np

early_period = np.array(early_period, dtype=float)
later_period = np.array(later_period, dtype=float)

To make sure that the ttest functions properly, converting early_period and late_period is necessary.

equal_var was chosen based on the variance test in 3.1.1.

# Perform t-test
t_stat, p_value = ttest_ind(early_period, later_period, equal_var=False)

print(f"T-statistic: {t_stat}, P-value: {p_value}")

T-statistic: -7.0209097150232145, P-value: 1.0968840169623663e-06

The p-value indicates that the difference in means between the two time periods (1985-1996 and the following years) is statistically significant.
The negative T-statistic suggests that the mean number of games released per year in the 1985-1996 period is significantly lower compared to the later period.

Data Significance Based on Yearly Video Game Releases

While the historical aspect of of the games released during the 1985-1996 time range is significant enough for archival purposes. Further analytical procedures will have to consider this time period as insiginficant.

3.2 Variation of sales platform to platform

# Calculate the total sales for each platform
platform_sales = games.groupby('platform')['total_sales'].sum().sort_values(ascending=False)

print(platform_sales)

platform
PS2     1044.25
X360     900.39
PS3      837.87
Wii      794.93
DS       558.90
PS4      245.94
PS       234.56
XB       227.81
PSP      212.08
PC       203.48
GBA      188.43
GC       171.24
XOne     134.06
3DS      131.98
WiiU      67.61
PSV       34.67
DC         4.55
Name: total_sales, dtype: float64

# Calculate variance and standard deviation of total sales across platforms
sales_variance = platform_sales.var()
sales_std_dev = platform_sales.std()

print(f"Variance in sales across platforms: {sales_variance}")
print(f"Standard deviation in sales across platforms: {sales_std_dev}")

Variance in sales across platforms: 112170.9189264706
Standard deviation in sales across platforms: 334.91927225298724

3.2.1 Choose the platforms with the greatest total sales and build a distribution based on data for each year

top_n = 5
top_platforms = platform_sales.head(top_n).index.tolist()

print(f"Top {top_n} Platforms by Total Sales:")
print(platform_sales.head(top_n))

Top 5 Platforms by Total Sales:
platform
PS2     1044.25
X360     900.39
PS3      837.87
Wii      794.93
DS       558.90
Name: total_sales, dtype: float64

# Filter the data to include only the top platforms
top_platforms_data = games[games['platform'].isin(top_platforms)]

top_platforms_data = top_platforms_data.sort_values(by='year_of_release')
top_platforms_data['year_of_release'].unique()

<IntegerArray>
[2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012,
 2013, 2014, 2015, 2016]
Length: 17, dtype: Int64

Sale outliers in top_platforms_data have been removed according to the IQR method.

build for graph representing distribution of sales.

# Define upper and lower bounds for sales to remove extreme outliers
q1 = top_platforms_data['total_sales'].quantile(0.25)
q3 = top_platforms_data['total_sales'].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

# Filter the data to exclude outliers
filtered_data = top_platforms_data[(top_platforms_data['total_sales'] >= lower_bound) & (top_platforms_data['total_sales'] <= upper_bound)]

# Create the box plot without outliers
fig = px.box(
    filtered_data,
    x='platform',
    y='total_sales',
    color='platform',
    color_discrete_map={
                 'PS2': '#003f5c',
                 'DS': '#444e86',
                 'X360': '#955196',
                 'Wii': '#dd5182',
                 'PS3': '#ff6e54'
             }, 
    facet_col='year_of_release',
    facet_col_wrap=4,

    width=930,
    height=940
)

fig.update_layout(title_font_variant="all-small-caps",
                  title=dict(text= "Distribution of Game Sales for Top Platforms by Year",
                             x = 0.50,
                             y = 1.0))

fig.show()

When considering the top 5 platforms determined by total sales (Million USD), the time-range 2000-2015 is shown to contain this data. Therefore, 2000-2015 represents the range in time with the highest density of sales by platform. Within this range of time there is an expressed pattern where 2000-2003 represents PS2 has the highest and only performing platform when comparing to the rest of the top performing platforms. This suggests that PS2 not only was the first amongst this group of filtered data to release, but maintained it’s position at the top until its last year of sales in 2010. Despite this, the PS2 platform remains the highest grossing in terms of total video game sales. Continuing with the remaining Play Station platform, the PS3 released 2006 and expressed the highest median total sales that same year.

3.2.2 Find platforms that used to be popular but now have zero sales

# Filter relevant columns
games_zero_df = games[['name', 'platform', 'year_of_release', 'total_sales']]

# Identify platforms with non-zero sales in any year
platforms_with_sales = games_zero_df[games_zero_df['total_sales'] > 0]['platform'].unique()

# Identify platforms with zero sales in the most recent year
most_recent_year = games_zero_df['year_of_release'].max()
recent_years = most_recent_year - 4
recent_sales_range = games_zero_df[games_zero_df['year_of_release'] >= recent_years]
platforms_with_zero_sales = recent_sales_range[recent_sales_range['total_sales'] == 0]['platform'].unique()

# Platforms that had sales in the past but have zero sales now
platforms_no_sales_now = set(platforms_with_sales) & set(platforms_with_zero_sales)

print("Platforms that used to be popular but now have zero sales:", platforms_no_sales_now)

Platforms that used to be popular but now have zero sales: set()

When set() is returned after performing this procedure. It means that there are no overlapping occurences between the two sets. In this case there are no platforms that had sales in the past but have zero sales now.

Considering the minimum total_sales value is 0.01 Million USD. There are in fact no 0 total_sales for any entry, thus any platform when analyzing from the scope of total_sales delineating between regional sales however might tell a different story.

3.2.3 How long does it generally take for new platforms to appear and old ones to fade?

# Create a DataFrame with the first and last year of sales for each platform
platform_lifespan = games_zero_df.groupby('platform')['year_of_release'].agg(['min', 'max']).reset_index()
platform_lifespan.columns = ['platform', 'first_year', 'last_year']
platform_lifespan['lifespan'] = platform_lifespan['last_year'] - platform_lifespan['first_year']

# Calculate the average time it takes for new platforms to appear
new_platforms = platform_lifespan[platform_lifespan['first_year'] > 2000]
average_time_to_appear = new_platforms['first_year'].min() - 2000

# Calculate the average time it takes for old platforms to fade
faded_platforms = platform_lifespan[platform_lifespan['last_year'] < most_recent_year]
average_time_to_fade = most_recent_year - faded_platforms['last_year'].max()

print(f"Average time for new platforms to appear: {average_time_to_appear} years")
print(f"Average time for old platforms to fade: {average_time_to_fade} years")

Average time for new platforms to appear: 1 years
Average time for old platforms to fade: 3 years

In this lifespan analysis, 2000 was chosen as a baseline to signify the milestone of popular platforms beginning to release. (Determined by Distribution of Game Sales for Top Platforms by Year)

build for graph representing the lifespan of platforms.

# Create a scatter plot for platform lifespan
fig = px.scatter(
    platform_lifespan,
    x='first_year',
    y='last_year',
    size='lifespan',
    color='platform',
    hover_name='platform',
    labels={'first_year': 'First Year of Sales', 'last_year': 'Last Year of Sales'},
    width=900,
    height=600
)

fig.update_layout(title_font_variant="all-small-caps",
                  title=dict(text= "Platform Lifespan",
                             x = 0.50,
                             y = 1.0))


fig.show()

The most noticeable point alone and uniquely large given it’s lifespan value, is the PC platform. This is easily explainable as the “PC” is being represented in an immutable manner, like the rest of the platforms. With PS2, PS3, and PS4 being represented appropriately, the differences between the platforms are shown. The game released for the PC in 1985 and the games released in 2016 for the PC are not for the same PC. That is why it is overly represented in this particular metric.

3.3 Significant time range for future model training

We already determined the significant range of time for video game sales on the whole. However, for building a model to inform upon strategies for 2017 we must make sure to restrict the time range to a point where we are not capturing outdated patterns. In this case, 2013 and onward will be utilized to inform upon 2017 strategies.

relevant_games = games[games['year_of_release'] > 2012]
relevant_games.head(10)

	name	platform	year_of_release	genre	na_sales	eu_sales	jp_sales	other_sales	critic_score	user_score	rating	total_sales
16	Grand Theft Auto V	PS3	2013	Action	7.02	9.09	0.98	3.96	97.0	8.2	M	21.05
23	Grand Theft Auto V	X360	2013	Action	9.66	5.14	0.06	1.41	97.0	8.1	M	16.27
42	Grand Theft Auto V	PS4	2014	Action	3.96	6.31	0.38	1.97	97.0	8.3	M	12.62
60	Call of Duty: Ghosts	X360	2013	Shooter	6.73	2.56	0.04	0.91	73.0	2.6	M	10.24
69	Call of Duty: Ghosts	PS3	2013	Shooter	4.10	3.63	0.38	1.25	71.0	2.6	M	9.36
77	FIFA 16	PS4	2015	Sports	1.12	6.12	0.06	1.28	82.0	4.3	E	8.58
92	Call of Duty: Advanced Warfare	PS4	2014	Shooter	2.81	3.48	0.14	1.23	83.0	5.7	M	7.66
94	FIFA 17	PS4	2016	Sports	0.66	5.75	0.08	1.11	85.0	5.0	E	7.60
105	Fallout 4	PS4	2015	Role-Playing	2.53	3.27	0.24	1.13	87.0	6.5	M	7.17
110	Mario Kart 8	WiiU	2014	Racing	3.15	2.15	1.28	0.51	88.0	9.1	E	7.09

3.4 Which platforms are leading in sales? Which ones are growing or shrinking? Select several potentially profitable platforms.

platform_sales_relevant = relevant_games.groupby('platform')['total_sales'].sum().sort_values(ascending=False)
platform_sales_relevant.head(6)

platform
PS4     245.94
PS3     151.91
XOne    134.06
X360    116.70
WiiU     52.76
3DS      45.78
Name: total_sales, dtype: float64

The top 5 performing platforms in the updated time range remains the same unsurprisingly. With PS2 in the lead and PS3 also being presented. Expanding on the top performing range to 6 will provide more data relevant for modeling 2017. In this range PS2 is no longer included however, it is superceeded by the PS4 in this range acting as the top performer.

3.5 Global sales of all games by platform

# Select the top 6 platforms
top_platforms2 = platform_sales_relevant.head(6).index.tolist()

# Filter the data to include only the top platforms
top_platforms_data2 = games[games['platform'].isin(top_platforms2)]
top_platforms_data2 = top_platforms_data2.sort_values(by='year_of_release')

build for graph representing global sales distribution.

# Remove outliers using IQR method for each platform (make it a function this time)
def remove_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    # Define the lower and upper bounds for outliers
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    # Filter the dataframe to exclude outliers
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

# Remove outliers from the dataset (apply the function)
filtered_games = top_platforms_data2.groupby('platform', group_keys=False).apply(remove_outliers, 'total_sales')

# Step 3: Create the box plot using Plotly Express
fig = px.box(filtered_games, x='platform', y='total_sales', 
             title='Global Sales Distribution by Platform (2013-2016)', 
             labels={'total_sales': 'Global Sales (in million USD)', 'platform': 'Platform'},
             color='platform',
             color_discrete_map={
                 '3DS': '#003f5c',
                 'PS3': '#444e86',
                 'PS4': '#955196',
                 'WiiU': '#dd5182',
                 'X360': '#ff6e54',
                 'XOne': '#ffa600'
             },)

# Customize the plot layout
fig.update_layout(xaxis_title='Platform', yaxis_title='Global Sales (in million USD)',
                  title={'x': 0.5}, width=900, height=600)
fig.update_layout(title_font_variant="all-small-caps",
                  title=dict(text= "Global Sales Distribution by Platform (2013-2016)",
                             x = 0.50,
                             y = 1.0))

fig.show()

/tmp/ipykernel_74871/2607202076.py:13: DeprecationWarning:

DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.

To answer if the differences in sales are significant we can employ the Kruskal-Wallis H-test.

# Get the unique platforms
platforms = filtered_games['platform'].unique()

# Create a list of groups (global sales for each platform)
groups = [filtered_games[filtered_games['platform'] == platform]['total_sales'] for platform in platforms]

# Perform the Kruskal-Wallis test
stat, p_value = kruskal(*groups)

print(f'Kruskal-Wallis test statistic: {stat}')
print(f'Kruskal-Wallis p-value: {p_value}')

Kruskal-Wallis test statistic: 86.53904927975694
Kruskal-Wallis p-value: 3.580039286380217e-17

According to the SciPy API documentation, this tests the null hypothesis that the population median of all of the groups are equal. The test works on 2 or more independent samples, which may have different sizes. Some might use the ANOVA test on a problem such as this, but the data assumptions required are not preferrable for this particular dataset.

Scipy Kruskal Documentation

The Dunn’s test was selected for post-hoc analysis of the Kruskal-Wallis test to check the pairwise comparisons between each platform.

# Prepare data for Dunn's test
# The data should be in the form of a pandas Series for Global Sales with corresponding platform labels
dunn_data = filtered_games[['total_sales', 'platform']].dropna()

# Perform Dunn's test with Bonferroni correction for multiple comparisons
dunn_results = sp.posthoc_dunn(dunn_data, val_col='total_sales', group_col='platform', p_adjust='bonferroni')

print(dunn_results)

               3DS           PS3       PS4      WiiU          X360      XOne
3DS   1.000000e+00  1.390504e-14  0.147226  1.000000  1.551596e-07  0.023295
PS3   1.390504e-14  1.000000e+00  0.000005  0.000475  1.533063e-03  0.003958
PS4   1.472264e-01  5.378783e-06  1.000000  1.000000  1.243940e-01  1.000000
WiiU  1.000000e+00  4.749854e-04  1.000000  1.000000  2.079647e-01  1.000000
X360  1.551596e-07  1.533063e-03  0.124394  0.207965  1.000000e+00  1.000000
XOne  2.329531e-02  3.957731e-03  1.000000  1.000000  1.000000e+00  1.000000

sklearn-posthoc Dunn’s Test Documentation

Significant Comparisons:

3DS vs PS3, X360, XOne: The p-values for these comparisons (1.39e-14, 1.55e-07, 0.023) are very small, indicating significant differences in median global sales between the 3DS and PS3, X360, and XOne platforms.
PS3 vs PS4, WiiU, X360, XOne: The p-values (5.38e-06, 0.000475, 0.001533, 0.003958) indicate significant differences in sales between PS3 and these platforms, showing that PS3 has distinct sales patterns.
X360 vs 3DS, PS3: Significant differences are evident here, with p-values of 1.55e-07 and 0.001533 respectively.
XOne vs 3DS, PS3: The p-values of 0.023 and 0.003958 suggest significant differences between XOne and these platforms.

Non-Significant Comparisons:

3DS vs PS4, WiiU: P-values of 0.147 and 1.0 indicate no significant difference between 3DS and these platforms.
PS4 vs WiiU, X360, XOne: With p-values of 1.0, PS4 does not show significant differences in median sales compared to WiiU, X360, or XOne.
WiiU vs 3DS, PS4, X360, XOne: All p-values are close to or equal to 1, indicating no significant differences in sales between WiiU and these platforms.

Conclusion:

There are significant differences in median global sales between several platform pairs, particularly involving 3DS, PS3, and X360, which show distinctive sales patterns. However, platforms like PS4, WiiU, and XOne exhibit more similarity in sales distribution, as there are no significant differences between them.

3.6 Professional reviews affect on sales for PS4

PS4_platform = filtered_games[filtered_games['platform'] == 'PS4']
PS4_platform.head()

	name	platform	year_of_release	genre	na_sales	eu_sales	jp_sales	other_sales	critic_score	user_score	rating	total_sales
7296	Angry Birds Star Wars	PS4	2013	Strategy	0.10	0.08	0.00	0.04	47.0	2.0	E	0.22
863	Need for Speed Rivals	PS4	2013	Racing	0.73	0.92	0.03	0.30	80.0	6.2	E10+	1.98
6911	NBA Live 14	PS4	2013	Sports	0.15	0.05	0.00	0.04	43.0	2.7	E	0.24
2307	Madden NFL 25	PS4	2013	Sports	0.63	0.13	0.00	0.15	74.0	5.2	E	0.91
4917	Skylanders SWAP Force	PS4	2013	Platform	0.20	0.12	0.00	0.06	79.0	5.2	E10+	0.38

build for graph representing correlation between professional reviews and sales

fig = px.scatter(PS4_platform, x="critic_score", y="total_sales", color='total_sales',
                 title="Total Sale Correlation with Professional Reviews (PS4 games)",
                 labels={'total_sales': 'Total Sales (in millions)', 'critic_score': 'Critic Score'})

fig.update_layout(title_font_variant="all-small-caps",
                  title=dict(x = 0.50,
                             y = 1.0))

fig.show()

print(PS4_platform['critic_score'].corr(PS4_platform['total_sales']))

0.31816367517799193

With a correlation coefficient of 0.318 there is little to no linear relationship between professional reviews and total sales for the PS4 platform.

3.7 Compare the sales of the same games on other platforms.

To do this we first filter out all the games that are on the other top platforms and also on the PS4 platform, making sure to filter out the PS4 entries entirely.

PS4_games = PS4_platform['name'].unique()

# Find games that are shared with other platforms, excluding PS4
shared_games = filtered_games[(filtered_games['name'].isin(PS4_games)) & (filtered_games['platform'] != 'PS4')]
shared_games.sample(5)

	name	platform	year_of_release	genre	na_sales	eu_sales	other_sales	critic_score	user_score	rating	total_sales
8265	Madden NFL 17	X360	2016	Sports	0.14	0.02	0.02	29.312523	0.7	E	0.18
11391	Transformers: Devastation	PS3	2015	Action	0.03	0.04	0.01	69.019708	7.1	T	0.08
1455	Tom Clancy's Rainbow Six: Siege	XOne	2015	Shooter	0.77	0.45	0.12	74.000000	6.9	M	1.34
9629	NBA 2K17	PS3	2016	Sports	0.06	0.04	0.02	37.378045	2.0	E	0.12
9861	NBA 2K17	X360	2016	Sports	0.09	0.02	0.01	46.063992	3.4	E	0.12

build for graph representing correlation between professional reviews and sales (shared games)

fig = px.scatter(shared_games, x="critic_score", y="total_sales", color='total_sales', facet_col='platform',
                 title="Total Sale Correlation with Professional Reviews (shared games)",
                 labels={'total_sales': 'Total Sales (in millions)', 'critic_score': 'Critic Score'})

fig.update_layout(title_font_variant="all-small-caps",
                  title=dict(x = 0.50,
                             y = 1.0))

fig.show()

# Group by platform and calculate the Pearson correlation
correlations = shared_games.groupby('platform').apply(lambda group: group['total_sales'].corr(group['critic_score']))

# Print the correlation for each platform
for platform, corr in correlations.items():
    print(f"Correlation between total_sales and critic_score for {platform}: {corr}")

Correlation between total_sales and critic_score for 3DS: -0.3817347562062789
Correlation between total_sales and critic_score for PS3: 0.22682294202250627
Correlation between total_sales and critic_score for WiiU: 0.4033118763354005
Correlation between total_sales and critic_score for X360: 0.21228717186696933
Correlation between total_sales and critic_score for XOne: 0.33775568867121897

/tmp/ipykernel_74871/529713270.py:2: DeprecationWarning:

DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.

The shared games professional review analysis now updated with the shared games via the PS4 platform now shows that the WiiU platform while still expressing a correlation coefficient below 0.5 was the most affected by professional reviews when comparing to the rest of the platforms shown.

3.8 Distribution of games by genre

build for graph representing distribution of total sales by genre.

# Remove outliers from the dataset (apply the function)
games2 = games.groupby('genre', group_keys=False).apply(remove_outliers, 'total_sales')

# Visualize the total sales by genre
fig = px.box(games2, x='genre', y='total_sales', 
             title='Distribution of Total Sales by Genre', 
             labels={'total_sales': 'Total Sales (in millions)', 'genre': 'Genre'}, 
             height=600,
             width=900)

fig.update_layout(title_font_variant="all-small-caps",
                  title=dict(x = 0.50,
                             y = 1.0))

fig.show()

/tmp/ipykernel_74871/325689584.py:2: DeprecationWarning:

DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.

According to the above chart: Action and Sports lead with their high median total sales value. However, now that we look at the median total sales, genres such as Platform, Role-Playing, Shooter, and Fighting are better represented even when compared to Action and Sports.

Regional Profiles

4.1 Variations in market share for top 5 platforms

# Group by platform and calculate total sales for each region
platform_sales = games.groupby('platform')[['na_sales', 'eu_sales', 'jp_sales']].sum().reset_index()

# Calculate total sales for each region
total_na_sales = platform_sales['na_sales'].sum()
total_eu_sales = platform_sales['eu_sales'].sum()
total_jp_sales = platform_sales['jp_sales'].sum()

# Sort by sales in each region and extract top 5 platforms
top_5_na = platform_sales.nlargest(5, 'na_sales')[['platform', 'na_sales']]
top_5_eu = platform_sales.nlargest(5, 'eu_sales')[['platform', 'eu_sales']]
top_5_jp = platform_sales.nlargest(5, 'jp_sales')[['platform', 'jp_sales']]

# Calculate market share for the top 5 platforms in each region
top_5_na['market_share_na'] = (top_5_na['na_sales'] / total_na_sales) * 100
top_5_eu['market_share_eu'] = (top_5_eu['eu_sales'] / total_eu_sales) * 100
top_5_jp['market_share_jp'] = (top_5_jp['jp_sales'] / total_jp_sales) * 100

build for graph representing regional platform market share.

# Combine the top 5 data into a single DataFrame for visualization
na_top = top_5_na.rename(columns={'na_sales': 'total_sales', 'market_share_na': 'market_share'})
na_top['region'] = 'NA'

eu_top = top_5_eu.rename(columns={'eu_sales': 'total_sales', 'market_share_eu': 'market_share'})
eu_top['region'] = 'EU'

jp_top = top_5_jp.rename(columns={'jp_sales': 'total_sales', 'market_share_jp': 'market_share'})
jp_top['region'] = 'JP'

# Concatenate data for all regions
top_platforms = pd.concat([na_top[['platform', 'region', 'total_sales', 'market_share']], 
                           eu_top[['platform', 'region', 'total_sales', 'market_share']], 
                           jp_top[['platform', 'region', 'total_sales', 'market_share']]])

# Plot the market share differences
fig = px.bar(top_platforms, x='platform', y='market_share', color='region', barmode='group',
             color_discrete_map={
                 'NA': '#003f5c',
                 'EU': '#bc5090',
                 'JP': '#ffa600'
             },
             title="Top 5 Platforms Market Share by Region", labels={'market_share': 'Market Share (%)'})

fig.update_layout(title_font_variant="all-small-caps",
                  title=dict(x = 0.50,
                             y = 1.0))

fig.show()

As expressed in the above graph, X360 had a complete market share in NA and EU however not in JP (affected by top 5 restriction). PS2 has nearly equal distributed market share throughout each region of interest, with a slight increased share in JP. Wii held a farely small percentage, however experienced its greatest share in NA. PS3 was most popular in EU. The DS platform was most popular in JP. In fact was the most popular platform in JP. Looking at only the top 5 platforms grouped by region affected the representation of PS in NA and EU, only being represented in the JP regional top 5 platforms.

4.2 Variations in market share for top 5 genres

# Group by genre and calculate total sales for each region
genre_sales = games.groupby('genre')[['na_sales', 'eu_sales', 'jp_sales']].sum().reset_index()

# Calculate total sales for each region
total_na_sales_genre = genre_sales['na_sales'].sum()
total_eu_sales_genre = genre_sales['eu_sales'].sum()
total_jp_sales_genre = genre_sales['jp_sales'].sum()

# Sort by sales in each region and extract top 5 genres
top_5_genres_na = genre_sales.nlargest(5, 'na_sales')[['genre', 'na_sales']]
top_5_genres_eu = genre_sales.nlargest(5, 'eu_sales')[['genre', 'eu_sales']]
top_5_genres_jp = genre_sales.nlargest(5, 'jp_sales')[['genre', 'jp_sales']]

# Calculate market share for the top 5 genres in each region
top_5_genres_na['market_share_na'] = (top_5_genres_na['na_sales'] / total_na_sales_genre) * 100
top_5_genres_eu['market_share_eu'] = (top_5_genres_eu['eu_sales'] / total_eu_sales_genre) * 100
top_5_genres_jp['market_share_jp'] = (top_5_genres_jp['jp_sales'] / total_jp_sales_genre) * 100

build for graph representing regional genre market share.

# Combine the top 5 data into a single DataFrame for visualization
na_genres_top = top_5_genres_na.rename(columns={'na_sales': 'total_sales', 'market_share_na': 'market_share'})
na_genres_top['region'] = 'NA'

eu_genres_top = top_5_genres_eu.rename(columns={'eu_sales': 'total_sales', 'market_share_eu': 'market_share'})
eu_genres_top['region'] = 'EU'

jp_genres_top = top_5_genres_jp.rename(columns={'jp_sales': 'total_sales', 'market_share_jp': 'market_share'})
jp_genres_top['region'] = 'JP'

# Concatenate data for all regions
top_genres = pd.concat([na_genres_top[['genre', 'region', 'total_sales', 'market_share']], 
                        eu_genres_top[['genre', 'region', 'total_sales', 'market_share']], 
                        jp_genres_top[['genre', 'region', 'total_sales', 'market_share']]])

# Plot the market share differences
fig = px.bar(top_genres, x='genre', y='market_share', color='region', barmode='group',
             color_discrete_map={
                 'NA': '#003f5c',
                 'EU': '#bc5090',
                 'JP': '#ffa600'
             },
             title="Top 5 Genres Market Share by Region", labels={'market_share': 'Market Share (%)'})

fig.update_layout(title_font_variant="all-small-caps",
                  title=dict(x = 0.50,
                             y = 1.0))

fig.show()

As expected, Action holds a generally high market share globally, with Sports and Misc also being expressed as popular genres in the global market. However, it seems that Shooter genres were most popular in NA and EU, while Role-Playing and Platform genres dominate JP.

4.3 Regional affect of ESRB ratings on sales

# Group by 'rating' and sum the sales for each region
rating_sales = games.groupby('rating')[['na_sales', 'eu_sales', 'jp_sales']].sum().reset_index()

# Calculate total sales for each region
total_na_sales_rating = rating_sales['na_sales'].sum()
total_eu_sales_rating = rating_sales['eu_sales'].sum()
total_jp_sales_rating = rating_sales['jp_sales'].sum()

# Calculate market share for each rating in each region
rating_sales['market_share_na'] = (rating_sales['na_sales'] / total_na_sales_rating) * 100
rating_sales['market_share_eu'] = (rating_sales['eu_sales'] / total_eu_sales_rating) * 100
rating_sales['market_share_jp'] = (rating_sales['jp_sales'] / total_jp_sales_rating) * 100

build for graph representing regional rating market share.

rating_sales_melted = pd.melt(rating_sales, id_vars='rating', value_vars=['market_share_na', 'market_share_eu', 'market_share_jp'],
                              var_name='region', value_name='market_share')

rating_sales_melted['region'] = rating_sales_melted['region'].replace({
    'market_share_na': 'NA',
    'market_share_eu': 'EU',
    'market_share_jp': 'JP'
})


fig = px.bar(rating_sales_melted, x='rating', y='market_share', color='region', barmode='group',
             color_discrete_map={
                 'NA': '#003f5c',
                 'EU': '#bc5090',
                 'JP': '#ffa600'
             },
             title="Market Share of ESRB Ratings by Region", labels={'market_share': 'Market Share (%)'})

fig.update_layout(title_font_variant="all-small-caps",
                  title=dict(x = 0.50,
                             y = 1.0))

fig.show()

The E rating dominates the global space for video games. While AO, EC, K-A, and RP are hardly represented especially when delineated across separate regions. M and T are moreso globally expressed with JP holding a lead on T rated game sales.

Hypothesis Tests

Average user ratings of the Xbox One and PC platforms.

Null Hypothesis: There are no significant difference between the average user ratings of the Xbox One and PC platforms.

Alternative Hypothesis: Average user rating between Xbox One and PC platforms are different.

Average user ratings for the Action and Sport genres.

Null Hypothesis: There are no significant difference between the average user ratings of the Action and Sport genres.

Alternative Hypothesis: The average user rating between the Action and Sport genre are different.

# Filter for Xbox One and PC platforms
xbox_one_ratings = games[games['platform'] == 'XOne']['user_score']
pc_ratings = games[games['platform'] == 'PC']['user_score']

# Filter for Action and Sports genres
action_ratings = games[games['genre'] == 'Action']['user_score']
sports_ratings = games[games['genre'] == 'Sports']['user_score']

First we must make sure that the variation status between these groups are known, in order to conduct proper hyptheses tests.

# Levene's test for equality of variances
levene_stat_xbox_pc, levene_p_value_xbox_pc = levene(xbox_one_ratings, pc_ratings)
levene_stat_action_sports, levene_p_value_action_sports = levene(action_ratings, sports_ratings)

# For Xbox One vs. PC
if levene_p_value_xbox_pc < 0.05:
    # Unequal variances
    t_stat_xbox_pc, p_value_xbox_pc = ttest_ind(xbox_one_ratings, pc_ratings, equal_var=False)
    print("Levene's test indicates unequal variances. Using Welch's t-test.")
else:
    # Equal variances
    t_stat_xbox_pc, p_value_xbox_pc = ttest_ind(xbox_one_ratings, pc_ratings, equal_var=True)
    print("Levene's test indicates equal variances. Using standard t-test.")

# For Action vs. Sports
if levene_p_value_action_sports < 0.05:
    # Unequal variances
    t_stat_action_sports, p_value_action_sports = ttest_ind(action_ratings, sports_ratings, equal_var=False)
    print("Levene's test indicates unequal variances. Using Welch's t-test.")
else:
    # Equal variances
    t_stat_action_sports, p_value_action_sports = ttest_ind(action_ratings, sports_ratings, equal_var=True)
    print("Levene's test indicates equal variances. Using standard t-test.")

Levene's test indicates equal variances. Using standard t-test.
Levene's test indicates equal variances. Using standard t-test.

Appropriately, the variations were determined to be equal between the groups of interest. Therefore, a default t-test could be carried out.

alpha = 0.025

# Results for Hypothesis 1
if p_value_xbox_pc < alpha:
    print("Reject the null hypothesis: Average user ratings for Xbox One and PC are different.")
else:
    print("Fail to reject the null hypothesis: No significant difference in average user ratings for Xbox One and PC.")

# Results for Hypothesis 2
if p_value_action_sports < alpha:
    print("Reject the null hypothesis: Average user ratings for Action and Sports genres are different.")
else:
    print("Fail to reject the null hypothesis: No significant difference in average user ratings for Action and Sports genres.")

Reject the null hypothesis: Average user ratings for Xbox One and PC are different.
Fail to reject the null hypothesis: No significant difference in average user ratings for Action and Sports genres.

According to the above tests:

Xbox One and PC - Average user ratings for Xbox One and PC are different.

Action and Sports genres - The are no significant difference in average user ratings for Action and Sports genres.

Conclusion

When looking at the data as a whole, there is much that can be derived. In terms of data preperation and preprocessing handling the implicit duplicates was an essential step in insuring accurate analysis. Especially when it comes to differentiating games which shared the same name yet were released at different times amongst separate platforms. Additionally, handling the data types to ensure the reasonable expression of year time data was a required action given the original data representation.

Continuing with the analysis, most of the analytical structures and directions revolved around the total_sales variable created via the summation of relevant regional sale values for every entry. The further grouping and aggregation by platform, genre, and rating were all done separately to avoid the doupling of aggregations. When attempting the zero sales analysis from the scope of total_sales there was no potential to derive any further value from it. Since in a global scale there is not a single platform that is not expressing a sale. Perhaps further analysis on a regional scale can be completed on the same topic.

On the same thread, reviewing the top 5 and 6 performing platforms revealed a consistent range of time from 2000 - 2016 where all of the top performing platforms by global sales where expressed. This allowed the previous years to be disregarded in further analysis.

On the whole, there is a clear trend in Play Station platforms and each of their respective global sales performance. While expressing the greatest global markest share as a platform there is no doubt that investment in title development for future Play Station platforms would be the most profitable. Additionally, investment in Play Station titles under the Action genre would be most advisable considering it’s global market share. Targeted game development for Role-Playing titles can be accomplished successfuly in Japan.

In addition to the main analysis for 2017 campaign strategies. The hypothesis tested regarding the potential equity between Xbox One and PC user ratings showed that the averages are in fact different. Also, there are no significant difference in the average user rating between Action and Sports genres. Therefore, one could conclude that user ratings had little to no affect in the clear significant difference between the genres’ market share.