OilyGiant Mining Company - Simulated Regional Profit Analysis & Recommendations

Author

Umberto Fasci

Introduction

The primary goal of this project is to build a machine learning model capable of accurately predicting the volume of oil reserves for new wells in each region. This model will help select the top-performing oil wells based on predicted reserves and identify the region with the highest profit margin. The model’s output will guide the decision on where to drill, considering both profitability and financial risk.

For the purpose of understanding the main structure of the following analysis, I have taken a modular approach to each of the sections and subsections. Creating functions to perform the modeling, analysis all with defined output structure in order to further utilize the resulting information.

Library Import

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

Data Ingest and Initial Exploration

Data Ingest and Duplicate Handling Code

file_paths = ["../datasets/geo_data_0.csv", "../datasets/geo_data_1.csv", "../datasets/geo_data_2.csv"]

geodata = []
duplicates_df = pd.DataFrame()  # Create empty DataFrame for duplicates

for path in file_paths:
    # Load the data
    data = pd.read_csv(path)
    
    # Find duplicates
    duplicates = data[data.duplicated(subset='id', keep=False)].copy()  # Use copy() to avoid SettingWithCopyWarning
    if not duplicates.empty:
        # Add a column to indicate which file the duplicates came from
        duplicates['source_file'] = path.split('/')[-1]
        duplicates_df = pd.concat([duplicates_df, duplicates], ignore_index=True)
    
    # Remove duplicates based on the 'id' column and add to the list
    geodata.append(data.drop_duplicates(subset='id'))

# Sort duplicates_df by ID to group duplicates together
duplicates_df = duplicates_df.sort_values(['id']).reset_index(drop=True)

# Unpack the data into separate variables if needed
geo_data_0, geo_data_1, geo_data_2 = geodata

display(duplicates_df)

	id	f0	f1	f2	product	source_file
0	5ltQ6	-3.435401	-12.296043	1.999796	57.085625	geo_data_1.csv
1	5ltQ6	18.213839	2.191999	3.993869	107.813044	geo_data_1.csv
2	74z30	1.084962	-0.312358	6.990771	127.643327	geo_data_0.csv
3	74z30	0.741456	0.459229	5.153109	140.771492	geo_data_0.csv
4	A5aEY	-0.039949	0.156872	0.209861	89.249364	geo_data_0.csv
5	A5aEY	-0.180335	0.935548	-2.094773	33.020205	geo_data_0.csv
6	AGS9W	-0.933795	0.116194	-3.655896	19.230453	geo_data_0.csv
7	AGS9W	1.454747	-0.479651	0.683380	126.370504	geo_data_0.csv
8	HZww2	0.755284	0.368511	1.863211	30.681774	geo_data_0.csv
9	HZww2	1.061194	-0.373969	10.430210	158.828695	geo_data_0.csv
10	KUPhW	0.231846	-1.698941	4.990775	11.716299	geo_data_2.csv
11	KUPhW	1.211150	3.176408	5.543540	132.831802	geo_data_2.csv
12	LHZR0	-8.989672	-4.286607	2.009139	57.085625	geo_data_1.csv
13	LHZR0	11.170835	-1.945066	3.002872	80.859783	geo_data_1.csv
14	QcMuo	0.506563	-0.323775	-2.215583	75.496502	geo_data_0.csv
15	QcMuo	0.635635	-0.473422	0.862670	64.578675	geo_data_0.csv
16	Tdehs	0.829407	0.298807	-0.049563	96.035308	geo_data_0.csv
17	Tdehs	0.112079	0.430296	3.218993	60.964018	geo_data_0.csv
18	TtcGQ	0.569276	-0.104876	6.440215	85.350186	geo_data_0.csv
19	TtcGQ	0.110711	1.022689	0.911381	101.318008	geo_data_0.csv
20	VF7Jo	2.122656	-0.858275	5.746001	181.716817	geo_data_2.csv
21	VF7Jo	-0.883115	0.560537	0.723601	136.233420	geo_data_2.csv
22	Vcm5J	-1.229484	-2.439204	1.222909	137.968290	geo_data_2.csv
23	Vcm5J	2.587702	1.986875	2.482245	92.327572	geo_data_2.csv
24	bfPNe	-6.202799	-4.820045	2.995107	84.038886	geo_data_1.csv
25	bfPNe	-9.494442	-5.463692	4.006042	110.992147	geo_data_1.csv
26	bsk9y	0.378429	0.005837	0.160827	160.637302	geo_data_0.csv
27	bsk9y	0.398908	-0.400253	10.122376	163.433078	geo_data_0.csv
28	bxg6G	-0.823752	0.546319	3.630479	93.007798	geo_data_0.csv
29	bxg6G	0.411645	0.856830	-3.653440	73.604260	geo_data_0.csv
30	fiKDv	0.157341	1.028359	5.585586	95.817889	geo_data_0.csv
31	fiKDv	0.049883	0.841313	6.394613	137.346586	geo_data_0.csv
32	wt4Uk	-9.091098	-8.109279	-0.002314	3.179103	geo_data_1.csv
33	wt4Uk	10.259972	-9.376355	4.994297	134.766305	geo_data_1.csv
34	xCHr8	1.633027	0.368135	-2.378367	6.120525	geo_data_2.csv
35	xCHr8	-0.847066	2.101796	5.597130	184.388641	geo_data_2.csv

All duplicates expressed throughout each of the datasets are only duplicated across the id subset. Looking closely none of the features or products throughout this duplicate range are duplicated themselves. This suggests that these are additional measuremets taken at the same well id.

In the end, in order to avoid any potential issues and considering the negligable amount of this data, these duplicated data were dropped.

Regional Target Distributions

Product Distribution Graph Code

sns.set_style("whitegrid")

fig, axes = plt.subplots(1, 3, figsize=(18, 4))

sns.histplot(geo_data_0['product'], color='#48c9b0', alpha=0.6, ax=axes[0])
axes[0].set_title('Region 1 - Product Distribution', fontsize=15)

sns.histplot(geo_data_1['product'], color='#48c9b0', alpha=0.6, ax=axes[1])
axes[1].set_title('Region 2 - Product Distribution', fontsize=15)

sns.histplot(geo_data_2['product'], color='#48c9b0', alpha=0.6, ax=axes[2])
axes[2].set_title('Region 3 - Product Distribution', fontsize=15)

plt.tight_layout()
plt.show()

Regions 1 and 3 show similar patterns and scales, suggesting they might have similar underlying characteristics. Region 2 stands out dramatically with its distinct spike pattern and much higher maximum counts. The x-axis (product) range is similar across all regions (0-175). The variance in distribution patterns suggests different operational or measurement approaches across regions

Regional Feature Distributions

Feature Distribution Graph Code

# Set up the figure and axes
fig, axes = plt.subplots(3, 3, figsize=(18, 12))

# Plot each histogram with Seaborn
sns.histplot(geo_data_0['f0'], color='#121314', alpha=0.6, ax=axes[0, 0])
axes[0, 0].set_title('Region 1 - f0 Distribution', fontsize=15)

sns.histplot(geo_data_1['f0'], color='#2D6E62', alpha=0.6, ax=axes[0, 1])
axes[0, 1].set_title('Region 2 - f0 Distribution', fontsize=15)

sns.histplot(geo_data_2['f0'], color='#48C9B0', alpha=0.6, ax=axes[0, 2])
axes[0, 2].set_title('Region 3 - f0 Distribution', fontsize=15)

sns.histplot(geo_data_0['f1'], color='#121314', alpha=0.6, ax=axes[1, 0])
axes[1, 0].set_title('Region 1 - f1 Distribution', fontsize=15)

sns.histplot(geo_data_1['f1'], color='#2D6E62', alpha=0.6, ax=axes[1, 1])
axes[1, 1].set_title('Region 2 - f1 Distribution', fontsize=15)

sns.histplot(geo_data_2['f1'], color='#48C9B0', alpha=0.6, ax=axes[1, 2])
axes[1, 2].set_title('Region 3 - f1 Distribution', fontsize=15)

sns.histplot(geo_data_0['f2'], color='#121314', alpha=0.6, ax=axes[2, 0])
axes[2, 0].set_title('Region 1 - f2 Distribution', fontsize=15)

sns.histplot(geo_data_1['f2'], color='#2D6E62', alpha=0.6, ax=axes[2, 1])
axes[2, 1].set_title('Region 2 - f2 Distribution', fontsize=15)

sns.histplot(geo_data_2['f2'], color='#48C9B0', alpha=0.6, ax=axes[2, 2])
axes[2, 2].set_title('Region 3 - f2 Distribution', fontsize=15)

plt.tight_layout()
plt.show()

Region 1 Feature Distributions:

f0: Shows a multimodal distribution with several peaks, roughly symmetric but with clear separations between clusters
f1: Similar to f0, displays multimodal behavior with 3-4 distinct peaks and valleys
f2: Appears more unimodal and normally distributed, centered around 0-2 with some slight right skew

Region 2 Feature Distributions:

f0: Bimodal distribution with two prominent peaks, fairly symmetric around 0
f1: Single normal/Gaussian distribution, unimodal and symmetric
f2: Appears to be a discrete-looking distribution with regularly spaced spikes, though noted as continuous. This could indicate heavily quantized or binned data.

Region 3 Feature Distributions:

f0: Single peaked, normal distribution with slight asymmetry
f1: Normal distribution, very symmetric and unimodal
f2: Normal distribution with slight heavy tails, symmetric and unimodal

Base Model

def train_and_evaluate_region(data, region_name):
    # Separate features and target
    features = data[['f0', 'f1', 'f2']]
    target = data['product']
    
    # Split data into training and validation sets (75:25)
    features_train, features_val, target_train, target_val = train_test_split(features, target, test_size=0.25, random_state=12345)
    
    # Initialize and train model
    model = LinearRegression()
    model.fit(features_train, target_train)
    
    # Make predictions on validation set
    target_pred = model.predict(features_val)
    
    # Calculate metrics
    rmse = np.sqrt(mean_squared_error(target_val, target_pred))
    r2 = r2_score(target_val, target_pred)
    
    # Save validation results
    validation_results = pd.DataFrame({
        'Actual': target_val,
        'Predicted': target_pred,
        'Error': target_val - target_pred
    })
    
    # Get feature coefficients
    feature_coefficients = dict(zip(features.columns, model.coef_))
    
    model_params = {
        'param_fit_intercept': model.fit_intercept,
        'param_copy_X': model.copy_X,
        'param_positive': model.positive,
        'param_n_features_in': model.n_features_in_
    }
    
    return {
        'region_name': region_name,
        'model': model,
        'rmse': rmse,
        'r2': r2,
        'avg_predicted': np.mean(target_pred),
        'avg_actual': np.mean(target_val),
        'validation_results': validation_results,
        'feature_coefficients': feature_coefficients,
        'intercept': model.intercept_,
        'parameters': model_params
    }

For the purposes of this analysis, Linear Regression is the chosen algorithm to model on the regional product target. Additionally, RMSE is utilized as the main metric to measure model performance.

Additional metrics are utilized in tendem to inform model performance.

Model parameters are collected and stored into the model_params dictionary in occordance to my own output structure workflow. Working with stored data in this fashion allows for much more versatility in future functionalities such as outputing a DataFrame showcasing the results of the model.

# Process all regions
regions_data = {
    'Region 1': geo_data_0,
    'Region 2': geo_data_1,
    'Region 3': geo_data_2
}

Taking advantage of the dictionary structure in this case allows me to not only call upon this data when I need to but also take advantage of the keys for the output structure from each milestone in the analysis.

all_results = {} # Dictionary storage for model results and coefficients
master_results = [] # List to store rows for the result DataFrame

for region_name, data in regions_data.items():
    results = train_and_evaluate_region(data, region_name)
    all_results[region_name] = results
    
    # Append results for DataFrame
    master_results.append({
        'Region': region_name,
        'RMSE': results['rmse'],
        'R2 Score': results['r2'],
        'Avg Predicted Volume': results['avg_predicted'],
        'Avg Actual Volume': results['avg_actual'],
        **results['feature_coefficients'],  # feature coefficients
        'Intercept': results['intercept'],
        **results['parameters'] # model params
    })

Base Model Results

# Create master result DataFrame
master_result_df = pd.DataFrame(master_results)
display(master_result_df)

	Region	RMSE	R2 Score	Avg Predicted Volume	Avg Actual Volume	f0	f1	f2	Intercept	param_fit_intercept	param_copy_X	param_positive	param_n_features_in
0	Region 1	37.853527	0.272392	92.789156	92.158205	3.781966	-13.892695	6.634141	77.637060	True	True	False	3
1	Region 2	0.892059	0.999622	69.178320	69.186044	-0.144811	-0.022002	26.950821	1.652857	True	True	False	3
2	Region 3	40.075851	0.195562	94.865725	94.785109	0.052104	-0.061638	5.772026	80.616568	True	True	False	3

Using the outputs from the structure designed in the previous functions allows me to display a decent looking DataFrame of the results for each of the regional models.

Additionally, I utilize the validation_results from the model runs to formulate the analytical plots describing the performance of the Linear Regression models.

Base Model Performance Analysis Graph Function

def plot_results(all_results):
    # Set the style for all plots
    sns.set_style("whitegrid")
    
    for region_name, results in all_results.items():
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
        fig.suptitle(f'Base Model Performance Analysis - {region_name}', fontsize=14, y=1.00)
        
        # Actual vs Predicted Plot
        sns.scatterplot(
            data=results['validation_results'],
            x='Actual',
            y='Predicted',
            alpha=0.6,
            ax=ax1,
            color='#121314'
        )
        
        # Add diagonal line (representing perfect predictions)
        min_val = min(results['validation_results']['Actual'].min(), 
                     results['validation_results']['Predicted'].min())
        max_val = max(results['validation_results']['Actual'].max(), 
                     results['validation_results']['Predicted'].max())
        ax1.plot([min_val, max_val], [min_val, max_val], 
                 'r--', label='Perfect Prediction')
        
        # Customize first plot
        ax1.set_xlabel('Actual Values', fontsize=12)
        ax1.set_ylabel('Predicted Values', fontsize=12)
        ax1.set_title('Actual vs Predicted Values', pad=20)
        ax1.legend(loc='lower center')
        
        # Add R^2 and RMSE annotations
        text = f'R² = {results["r2"]:.3f}\nRMSE = {results["rmse"]:.2f}'
        ax1.text(0.05, 0.95, text,
                transform=ax1.transAxes,
                verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='white', edgecolor='#121314', alpha=0.8))
        
        # Error Distribution Plot
        sns.histplot(
            data=results['validation_results'],
            x='Error',
            kde=True,
            ax=ax2,
            color='#890A0A'
        )
        
        # Error Distribution plot aes
        ax2.set_xlabel('Prediction Error', fontsize=12)
        ax2.set_ylabel('Count', fontsize=12)
        ax2.set_title('Error Distribution', pad=20)
        
        # Add mean and std annotations
        mean_error = results['validation_results']['Error'].mean()
        std_error = results['validation_results']['Error'].std()
        med_error = results['validation_results']['Error'].median()
        text = f'Mean = {mean_error:.2f}\nMedian = {med_error:.2f}\nStd = {std_error:.2f}'
        ax2.text(0.95, 0.95, text,
                transform=ax2.transAxes,
                horizontalalignment='right',
                verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='white', edgecolor='#121314', alpha=0.8))
        
        plt.tight_layout()
        plt.show()
        
plot_results(all_results)

Base Model Analysis

Region 2 shows excellent performance: - Very high R² (0.999) indicating the model explains nearly all variance in the data - Very low RMSE (0.89) showing high prediction accuracy - The scatter plot shows points tightly clustered along the perfect prediction line - Error distribution is narrow and normally distributed with small standard deviation - Predicted vs actual volumes are nearly identical (69.18 vs 69.19) - The highly quantized data exhibited by both the f2 feature and the target potentially hold strong influence on the results of this model.

Regions 1 and 3 show poor performance: - Low R² values (0.27 and 0.20 respectively) indicating the model explains very little of the variance - High RMSE values (37.85 and 40.08) showing large prediction errors - Scatter plots show wide dispersion from the perfect prediction line - Error distributions are much wider with larger standard deviations - While average predicted volumes are close to actuals, individual predictions vary greatly

Reviewing the assumptions made by Linear Regression modeling there is an assumption where the errors are normally distributed this is why I have decided to include these in the model performance analysis.

Model Hypertuning

param_space = {'copy_X': [True,False], 
               'fit_intercept': [True,False], 
               'n_jobs': [5,10,15,None],
               'positive': [True,False]}

None means 1 unless in a parallel backend context. Not including 1 for n_jobs in param space.

def train_and_hypertune_region(data, region_name):
    # Separate features and target
    features = data[['f0', 'f1', 'f2']]
    target = data['product']
    
    # Split data into training and validation sets (75:25)
    features_train, features_val, target_train, target_val = train_test_split(features, target, test_size=0.25, random_state=12345)
    
    # Initialize and train model
    model = LinearRegression()
    
    # Hypertune with GridSearch
    grid_search = GridSearchCV(model, param_space, cv=5, scoring='neg_root_mean_squared_error')
    grid_search.fit(features_train, target_train)
    
    best_model = grid_search.best_estimator_
    
    # Make predictions on validation set
    target_pred = best_model.predict(features_val)
    
    # Calculate metrics
    rmse = np.sqrt(mean_squared_error(target_val, target_pred))
    r2 = r2_score(target_val, target_pred)
    
    # Save validation results
    validation_results = pd.DataFrame({
        'Actual': target_val,
        'Predicted': target_pred,
        'Error': target_val - target_pred
    })
    
    # Get feature coefficients
    feature_coefficients = dict(zip(features.columns, best_model.coef_))
    
    # Get hyperparameters as a flattened dictionary
    hyperparams = {f'param_{key}': value for key, value in grid_search.best_params_.items()}
    
    return {
        'region_name': region_name,
        'model': best_model,
        'rmse': rmse,
        'r2': r2,
        'avg_predicted': np.mean(target_pred),
        'avg_actual': np.mean(target_val),
        'validation_results': validation_results,
        'feature_coefficients': feature_coefficients,
        'intercept': best_model.intercept_,
        'hyperparameters': hyperparams
    }

Using neg_root_mean_squared_error here for the grid search scoring parameter, this was implemented after addressing the documentation and investigating the scoring methods available.

hyper_results = {} # Dictionary storage for hypertuned model results and coefficients
master_hyper_results = [] # List to store rows for the master result DataFrame

for region_name, data in regions_data.items():
    results = train_and_hypertune_region(data, region_name)
    hyper_results[region_name] = results
    
    # Append results for master DataFrame
    master_hyper_results.append({
        'Region': region_name,
        'RMSE': results['rmse'],
        'R2 Score': results['r2'],
        'Avg Predicted Volume': results['avg_predicted'],
        'Avg Actual Volume': results['avg_actual'],
        **results['feature_coefficients'],  # feature coefficients
        'Intercept': results['intercept'],
        **results['hyperparameters']
    })

# Create master result DataFrame
master_hyper_result_df = pd.DataFrame(master_hyper_results)
display(master_hyper_result_df)

	Region	RMSE	R2 Score	Avg Predicted Volume	Avg Actual Volume	f0	f1	f2	Intercept	param_copy_X	param_fit_intercept	param_n_jobs	param_positive
0	Region 1	37.853527	0.272392	92.789156	92.158205	3.781966	-13.892695	6.634141	77.637060	True	True	5	False
1	Region 2	0.892059	0.999622	69.178320	69.186044	-0.144811	-0.022002	26.950821	1.652857	True	True	5	False
2	Region 3	40.075516	0.195576	94.866021	94.785109	0.051994	0.000000	5.772001	80.616838	True	True	5	True

Hypertuned Model Performance Analysis Graph Function

def plot_hypertuned_results(hyper_results):
    for region_name, results in hyper_results.items():
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
        fig.suptitle(f'Hypertuned Model Performance Analysis - {region_name}', fontsize=14, y=1.00)
        
        # Actual vs Predicted Plot
        sns.scatterplot(
            data=results['validation_results'],
            x='Actual',
            y='Predicted',
            alpha=0.6,
            ax=ax1,
            color='#121314'
        )
        
        # Add diagonal line (representing perfect predictions)
        min_val = min(results['validation_results']['Actual'].min(), 
                     results['validation_results']['Predicted'].min())
        max_val = max(results['validation_results']['Actual'].max(), 
                     results['validation_results']['Predicted'].max())
        ax1.plot([min_val, max_val], [min_val, max_val], 
                 'r--', label='Perfect Prediction')
        
        # Customize first plot
        ax1.set_xlabel('Actual Values', fontsize=12)
        ax1.set_ylabel('Predicted Values', fontsize=12)
        ax1.set_title('Actual vs Predicted Values', pad=20)
        ax1.legend(loc='lower center')
        
        # Add R^2 and RMSE annotations
        text = f'R² = {results["r2"]:.3f}\nRMSE = {results["rmse"]:.2f}'
        ax1.text(0.05, 0.95, text,
                transform=ax1.transAxes,
                verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='white', edgecolor='#121314', alpha=0.8))
        
        # Error Distribution Plot
        sns.histplot(
            data=results['validation_results'],
            x='Error',
            kde=True,
            ax=ax2,
            color='#890A0A'
        )
        
        # Error Distribution plot aes
        ax2.set_xlabel('Prediction Error', fontsize=12)
        ax2.set_ylabel('Count', fontsize=12)
        ax2.set_title('Error Distribution', pad=20)
        
        # Add mean and std annotations
        mean_error = results['validation_results']['Error'].mean()
        std_error = results['validation_results']['Error'].std()
        med_error = results['validation_results']['Error'].median()
        text = f'Mean = {mean_error:.2f}\nMedian = {med_error:.2f}\nStd = {std_error:.2f}'
        ax2.text(0.95, 0.95, text,
                transform=ax2.transAxes,
                horizontalalignment='right',
                verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='white', edgecolor='#121314', alpha=0.8))
        
        plt.tight_layout()
        plt.show()
        
plot_hypertuned_results(hyper_results)

Hypertuned model analysis

Region 2 has the highest model accuracy, with an RMSE of 0.892 and an R2 Score of 0.9996, suggesting almost perfect prediction.
Region 1 and Region 3 have higher RMSE values (37.85 and 40.08, respectively) and lower R2 scores, indicating weaker predictive accuracy in these regions.
The average predicted volumes are very close to the actual volumes across all regions.

It seems that even after model hypertuning, there is no change in the regional performance. Without any significant data transformations implimented, these are the best models avaiable.

Profit Calculations

Conditions:

When exploring the region, a study of 500 points is carried with picking the best 200 points for the profit calculation.
The budget for development of 200 oil wells is 100 USD million.
One barrel of raw materials brings 4.5 USD of revenue The revenue from one unit of product is 4,500 dollars (volume of reserves is in thousand barrels).
After the risk evaluation, keep only the regions with the risk of losses lower than 2.5%. From the ones that fit the criteria, the region with the highest average profit should be selected.

def calculate_profit(targets, predictions, n_wells=200, budget=100_000_000, revenue_per_unit=4_500):
1    targets = pd.Series(targets).reset_index(drop=True)
    predictions = pd.Series(predictions).reset_index(drop=True)
    
2    top_indices = predictions.sort_values(ascending=False).index[:n_wells]
    
3    selected_volumes = targets.iloc[top_indices]
    
4    total_volume = selected_volumes.sum()
    total_revenue = total_volume * revenue_per_unit
    profit = total_revenue - budget
    
5    return {
        'profit': profit,
        'total_volume': total_volume,
        'average_volume': total_volume / n_wells,
        'selected_indices': top_indices
    }

1: Reset indices and ensure numpy arrays
2: Sort predictions and get indices of top n_wells
3: Get actual volumes for these wells
4: Calculate metrics
5: Output structure

Bootstrapped Regional Analysis

def bootstrap_profit_analysis(targets, predictions, n_bootstrap=1000, sample_size=500,
                            n_wells=200, budget=100_000_000, revenue_per_unit=4_500,
                            random_state=12345):
    np.random.seed(random_state)
    profits = []
    volumes = []
    
    for i in range(n_bootstrap):
        # Sample indices with replacement
        sample_indices = np.random.choice(len(targets), size=sample_size, replace=True)
        
        # Calculate profit for this sample
        sample_result = calculate_profit(
            targets.iloc[sample_indices],
            predictions.iloc[sample_indices],
            n_wells=n_wells,
            budget=budget,
            revenue_per_unit=revenue_per_unit
        )
        
        profits.append(sample_result['profit'])
        volumes.append(sample_result['total_volume'])
    
    # Calculate risk metrics
    profits_array = np.array(profits)
    prob_loss = (profits_array < 0).mean() * 100
    
    return {
        'mean_profit': np.mean(profits),
        'std_profit': np.std(profits),
        'mean_volume': np.mean(volumes),
        'std_volume': np.std(volumes),
        'prob_loss': prob_loss,
        'confidence_interval': np.percentile(profits, [2.5, 97.5])
    }

region_analysis = {}
for region_name, results in hyper_results.items():
    # Get validation results from the previously trained models
    val_results = results['validation_results']
    
    # Calculate base profit using all validation data
    base_profit = calculate_profit(
        val_results['Actual'],
        val_results['Predicted']
    )
    
    # Perform bootstrap analysis
    bootstrap_results = bootstrap_profit_analysis(
        val_results['Actual'],
        val_results['Predicted']
    )
    
    region_analysis[region_name] = {
        'base_profit': base_profit,
        'bootstrap_results': bootstrap_results
    }

region_analysis_results = []

# Populate results for each region
for region_name, analysis in region_analysis.items():
    base_results = analysis['base_profit']
    bootstrap_results = analysis['bootstrap_results']
    
    region_analysis_results.append({
        'Region': region_name,
        'Base Profit ($M)': base_results['profit'] / 1_000_000,
        'Average Volume (thousand barrels)': base_results['average_volume'],
        'Total Volume (thousand barrels)': base_results['total_volume'],
        'Mean Bootstrap Profit ($M)': bootstrap_results['mean_profit'] / 1_000_000,
        'Profit Std Dev ($M)': bootstrap_results['std_profit'] / 1_000_000,
        'Probability of Loss (%)': bootstrap_results['prob_loss'],
        'CI Lower Bound ($M)': bootstrap_results['confidence_interval'][0] / 1_000_000,
        'CI Upper Bound ($M)': bootstrap_results['confidence_interval'][1] / 1_000_000,
        'Mean Bootstrap Volume (thousand barrels)': bootstrap_results['mean_volume'],
        'Volume Std Dev (thousand barrels)': bootstrap_results['std_volume']
    })

# Create and display the DataFrame
region_analysis_df = pd.DataFrame(region_analysis_results)

# Round numerical columns
numeric_columns = region_analysis_df.select_dtypes(include=[np.number]).columns
region_analysis_df[numeric_columns] = region_analysis_df[numeric_columns].round(2)

# Sort by Mean Bootstrap Profit
region_analysis_df = region_analysis_df.sort_values('Mean Bootstrap Profit ($M)', ascending=False)

# Display the DataFrame
display(region_analysis_df)

	Region	Base Profit ($M)	Average Volume (thousand barrels)	Total Volume (thousand barrels)	Mean Bootstrap Profit ($M)	Profit Std Dev ($M)	Probability of Loss (%)	CI Lower Bound ($M)	CI Upper Bound ($M)	Mean Bootstrap Volume (thousand barrels)	Volume Std Dev (thousand barrels)
1	Region 2	24.15	137.95	27589.08	4.78	2.05	1.2	0.90	8.67	23285.52	456.10
0	Region 1	33.65	148.50	29700.42	3.81	2.59	7.7	-1.43	8.91	23068.03	574.58
2	Region 3	25.12	139.02	27803.77	3.30	2.61	11.3	-1.89	8.38	22954.56	579.02

Region 1

Highest Base Profit: Region 1 has the highest base profit of $33.65 million, suggesting it is the most profitable under initial calculations.
Profit Variability: With a standard deviation of $2.59 million, this region has moderate profit variability, but a significant 7.7% probability of loss—higher than the other regions.
Profit Confidence Interval: Ranges from -$1.43 million to $8.91 million, suggesting possible but limited risk of loss.
Volume Statistics: The average volume per period is the highest at 148.50 thousand barrels, with a substantial total volume of 29,700.42 thousand barrels.

Region 2

Stable Profit Estimates: Region 2 has a base profit of $24.15 million and a relatively lower mean bootstrap profit at $4.78 million, with the lowest profit variability (std dev: $2.05 million).
Low Loss Probability: Only a 1.2% chance of incurring a loss, making it the least risky region.
Profit Confidence Interval: Between $0.90 million and $8.67 million, indicating positive expectations.
Consistent Volume: The mean bootstrap volume is 23,285.52 thousand barrels, with a low standard deviation of 456.10 thousand barrels, suggesting consistent output.

Region 3

Moderate Base Profit: Region 3 has a base profit of $25.12 million, close to Region 2 but with slightly higher risk.
High Loss Probability: A notable 11.3% probability of loss, the highest among the regions, indicating elevated financial risk.
Widest Confidence Interval: Ranges from -$1.89 million to $8.38 million, implying a significant chance of lower profit.
Volume Variability: Has a volume standard deviation of 579.02 thousand barrels, the highest among the regions, indicating greater variability in production.

Conclusion

According to the current analysis, the most profitable region with a good risk balance is Region 2, where profit is simulated being high while having the smallest confidence interval, and risk of loss as well. According to the hypertuned model and bootstraping simulation Region 1 and 3 where found to also have relativley decent bootstrapped profit however they do maintain a considerably high amount of loss risk.

Considering these results, the recommendation for Region 2 for continued and immediate production is a given. Further investigation and review on the measurement, efficiency and overall efficacy of production in Region 1 and 3 must be conducted to in order to finalize production decisions for those particular regions.