OilyGiant Mining Company - Simulated Regional Profit Analysis & Recommendations

Author

Umberto Fasci

Introduction

The primary goal of this project is to build a machine learning model capable of accurately predicting the volume of oil reserves for new wells in each region. This model will help select the top-performing oil wells based on predicted reserves and identify the region with the highest profit margin. The model’s output will guide the decision on where to drill, considering both profitability and financial risk.

For the purpose of understanding the main structure of the following analysis, I have taken a modular approach to each of the sections and subsections. Creating functions to perform the modeling, analysis all with defined output structure in order to further utilize the resulting information.

Library Import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

Data Ingest and Initial Exploration

Data Ingest and Duplicate Handling Code
file_paths = ["../datasets/geo_data_0.csv", "../datasets/geo_data_1.csv", "../datasets/geo_data_2.csv"]

geodata = []
duplicates_df = pd.DataFrame()  # Create empty DataFrame for duplicates

for path in file_paths:
    # Load the data
    data = pd.read_csv(path)
    
    # Find duplicates
    duplicates = data[data.duplicated(subset='id', keep=False)].copy()  # Use copy() to avoid SettingWithCopyWarning
    if not duplicates.empty:
        # Add a column to indicate which file the duplicates came from
        duplicates['source_file'] = path.split('/')[-1]
        duplicates_df = pd.concat([duplicates_df, duplicates], ignore_index=True)
    
    # Remove duplicates based on the 'id' column and add to the list
    geodata.append(data.drop_duplicates(subset='id'))

# Sort duplicates_df by ID to group duplicates together
duplicates_df = duplicates_df.sort_values(['id']).reset_index(drop=True)

# Unpack the data into separate variables if needed
geo_data_0, geo_data_1, geo_data_2 = geodata

display(duplicates_df)
id f0 f1 f2 product source_file
0 5ltQ6 -3.435401 -12.296043 1.999796 57.085625 geo_data_1.csv
1 5ltQ6 18.213839 2.191999 3.993869 107.813044 geo_data_1.csv
2 74z30 1.084962 -0.312358 6.990771 127.643327 geo_data_0.csv
3 74z30 0.741456 0.459229 5.153109 140.771492 geo_data_0.csv
4 A5aEY -0.039949 0.156872 0.209861 89.249364 geo_data_0.csv
5 A5aEY -0.180335 0.935548 -2.094773 33.020205 geo_data_0.csv
6 AGS9W -0.933795 0.116194 -3.655896 19.230453 geo_data_0.csv
7 AGS9W 1.454747 -0.479651 0.683380 126.370504 geo_data_0.csv
8 HZww2 0.755284 0.368511 1.863211 30.681774 geo_data_0.csv
9 HZww2 1.061194 -0.373969 10.430210 158.828695 geo_data_0.csv
10 KUPhW 0.231846 -1.698941 4.990775 11.716299 geo_data_2.csv
11 KUPhW 1.211150 3.176408 5.543540 132.831802 geo_data_2.csv
12 LHZR0 -8.989672 -4.286607 2.009139 57.085625 geo_data_1.csv
13 LHZR0 11.170835 -1.945066 3.002872 80.859783 geo_data_1.csv
14 QcMuo 0.506563 -0.323775 -2.215583 75.496502 geo_data_0.csv
15 QcMuo 0.635635 -0.473422 0.862670 64.578675 geo_data_0.csv
16 Tdehs 0.829407 0.298807 -0.049563 96.035308 geo_data_0.csv
17 Tdehs 0.112079 0.430296 3.218993 60.964018 geo_data_0.csv
18 TtcGQ 0.569276 -0.104876 6.440215 85.350186 geo_data_0.csv
19 TtcGQ 0.110711 1.022689 0.911381 101.318008 geo_data_0.csv
20 VF7Jo 2.122656 -0.858275 5.746001 181.716817 geo_data_2.csv
21 VF7Jo -0.883115 0.560537 0.723601 136.233420 geo_data_2.csv
22 Vcm5J -1.229484 -2.439204 1.222909 137.968290 geo_data_2.csv
23 Vcm5J 2.587702 1.986875 2.482245 92.327572 geo_data_2.csv
24 bfPNe -6.202799 -4.820045 2.995107 84.038886 geo_data_1.csv
25 bfPNe -9.494442 -5.463692 4.006042 110.992147 geo_data_1.csv
26 bsk9y 0.378429 0.005837 0.160827 160.637302 geo_data_0.csv
27 bsk9y 0.398908 -0.400253 10.122376 163.433078 geo_data_0.csv
28 bxg6G -0.823752 0.546319 3.630479 93.007798 geo_data_0.csv
29 bxg6G 0.411645 0.856830 -3.653440 73.604260 geo_data_0.csv
30 fiKDv 0.157341 1.028359 5.585586 95.817889 geo_data_0.csv
31 fiKDv 0.049883 0.841313 6.394613 137.346586 geo_data_0.csv
32 wt4Uk -9.091098 -8.109279 -0.002314 3.179103 geo_data_1.csv
33 wt4Uk 10.259972 -9.376355 4.994297 134.766305 geo_data_1.csv
34 xCHr8 1.633027 0.368135 -2.378367 6.120525 geo_data_2.csv
35 xCHr8 -0.847066 2.101796 5.597130 184.388641 geo_data_2.csv

All duplicates expressed throughout each of the datasets are only duplicated across the id subset. Looking closely none of the features or products throughout this duplicate range are duplicated themselves. This suggests that these are additional measuremets taken at the same well id.

In the end, in order to avoid any potential issues and considering the negligable amount of this data, these duplicated data were dropped.

Regional Target Distributions

Product Distribution Graph Code
sns.set_style("whitegrid")

fig, axes = plt.subplots(1, 3, figsize=(18, 4))

sns.histplot(geo_data_0['product'], color='#48c9b0', alpha=0.6, ax=axes[0])
axes[0].set_title('Region 1 - Product Distribution', fontsize=15)

sns.histplot(geo_data_1['product'], color='#48c9b0', alpha=0.6, ax=axes[1])
axes[1].set_title('Region 2 - Product Distribution', fontsize=15)

sns.histplot(geo_data_2['product'], color='#48c9b0', alpha=0.6, ax=axes[2])
axes[2].set_title('Region 3 - Product Distribution', fontsize=15)

plt.tight_layout()
plt.show()

Regions 1 and 3 show similar patterns and scales, suggesting they might have similar underlying characteristics. Region 2 stands out dramatically with its distinct spike pattern and much higher maximum counts. The x-axis (product) range is similar across all regions (0-175). The variance in distribution patterns suggests different operational or measurement approaches across regions

Regional Feature Distributions

Feature Distribution Graph Code
# Set up the figure and axes
fig, axes = plt.subplots(3, 3, figsize=(18, 12))

# Plot each histogram with Seaborn
sns.histplot(geo_data_0['f0'], color='#121314', alpha=0.6, ax=axes[0, 0])
axes[0, 0].set_title('Region 1 - f0 Distribution', fontsize=15)

sns.histplot(geo_data_1['f0'], color='#2D6E62', alpha=0.6, ax=axes[0, 1])
axes[0, 1].set_title('Region 2 - f0 Distribution', fontsize=15)

sns.histplot(geo_data_2['f0'], color='#48C9B0', alpha=0.6, ax=axes[0, 2])
axes[0, 2].set_title('Region 3 - f0 Distribution', fontsize=15)

sns.histplot(geo_data_0['f1'], color='#121314', alpha=0.6, ax=axes[1, 0])
axes[1, 0].set_title('Region 1 - f1 Distribution', fontsize=15)

sns.histplot(geo_data_1['f1'], color='#2D6E62', alpha=0.6, ax=axes[1, 1])
axes[1, 1].set_title('Region 2 - f1 Distribution', fontsize=15)

sns.histplot(geo_data_2['f1'], color='#48C9B0', alpha=0.6, ax=axes[1, 2])
axes[1, 2].set_title('Region 3 - f1 Distribution', fontsize=15)

sns.histplot(geo_data_0['f2'], color='#121314', alpha=0.6, ax=axes[2, 0])
axes[2, 0].set_title('Region 1 - f2 Distribution', fontsize=15)

sns.histplot(geo_data_1['f2'], color='#2D6E62', alpha=0.6, ax=axes[2, 1])
axes[2, 1].set_title('Region 2 - f2 Distribution', fontsize=15)

sns.histplot(geo_data_2['f2'], color='#48C9B0', alpha=0.6, ax=axes[2, 2])
axes[2, 2].set_title('Region 3 - f2 Distribution', fontsize=15)

plt.tight_layout()
plt.show()

Region 1 Feature Distributions:

  • f0: Shows a multimodal distribution with several peaks, roughly symmetric but with clear separations between clusters
  • f1: Similar to f0, displays multimodal behavior with 3-4 distinct peaks and valleys
  • f2: Appears more unimodal and normally distributed, centered around 0-2 with some slight right skew

Region 2 Feature Distributions:

  • f0: Bimodal distribution with two prominent peaks, fairly symmetric around 0
  • f1: Single normal/Gaussian distribution, unimodal and symmetric
  • f2: Appears to be a discrete-looking distribution with regularly spaced spikes, though noted as continuous. This could indicate heavily quantized or binned data.

Region 3 Feature Distributions:

  • f0: Single peaked, normal distribution with slight asymmetry
  • f1: Normal distribution, very symmetric and unimodal
  • f2: Normal distribution with slight heavy tails, symmetric and unimodal

Base Model

def train_and_evaluate_region(data, region_name):
    # Separate features and target
    features = data[['f0', 'f1', 'f2']]
    target = data['product']
    
    # Split data into training and validation sets (75:25)
    features_train, features_val, target_train, target_val = train_test_split(features, target, test_size=0.25, random_state=12345)
    
    # Initialize and train model
    model = LinearRegression()
    model.fit(features_train, target_train)
    
    # Make predictions on validation set
    target_pred = model.predict(features_val)
    
    # Calculate metrics
    rmse = np.sqrt(mean_squared_error(target_val, target_pred))
    r2 = r2_score(target_val, target_pred)
    
    # Save validation results
    validation_results = pd.DataFrame({
        'Actual': target_val,
        'Predicted': target_pred,
        'Error': target_val - target_pred
    })
    
    # Get feature coefficients
    feature_coefficients = dict(zip(features.columns, model.coef_))
    
    model_params = {
        'param_fit_intercept': model.fit_intercept,
        'param_copy_X': model.copy_X,
        'param_positive': model.positive,
        'param_n_features_in': model.n_features_in_
    }
    
    return {
        'region_name': region_name,
        'model': model,
        'rmse': rmse,
        'r2': r2,
        'avg_predicted': np.mean(target_pred),
        'avg_actual': np.mean(target_val),
        'validation_results': validation_results,
        'feature_coefficients': feature_coefficients,
        'intercept': model.intercept_,
        'parameters': model_params
    }

For the purposes of this analysis, Linear Regression is the chosen algorithm to model on the regional product target. Additionally, RMSE is utilized as the main metric to measure model performance.

Additional metrics are utilized in tendem to inform model performance.

Model parameters are collected and stored into the model_params dictionary in occordance to my own output structure workflow. Working with stored data in this fashion allows for much more versatility in future functionalities such as outputing a DataFrame showcasing the results of the model.


# Process all regions
regions_data = {
    'Region 1': geo_data_0,
    'Region 2': geo_data_1,
    'Region 3': geo_data_2
}

Taking advantage of the dictionary structure in this case allows me to not only call upon this data when I need to but also take advantage of the keys for the output structure from each milestone in the analysis.


all_results = {} # Dictionary storage for model results and coefficients
master_results = [] # List to store rows for the result DataFrame

for region_name, data in regions_data.items():
    results = train_and_evaluate_region(data, region_name)
    all_results[region_name] = results
    
    # Append results for DataFrame
    master_results.append({
        'Region': region_name,
        'RMSE': results['rmse'],
        'R2 Score': results['r2'],
        'Avg Predicted Volume': results['avg_predicted'],
        'Avg Actual Volume': results['avg_actual'],
        **results['feature_coefficients'],  # feature coefficients
        'Intercept': results['intercept'],
        **results['parameters'] # model params
    })

Base Model Results

# Create master result DataFrame
master_result_df = pd.DataFrame(master_results)
display(master_result_df)
Region RMSE R2 Score Avg Predicted Volume Avg Actual Volume f0 f1 f2 Intercept param_fit_intercept param_copy_X param_positive param_n_features_in
0 Region 1 37.853527 0.272392 92.789156 92.158205 3.781966 -13.892695 6.634141 77.637060 True True False 3
1 Region 2 0.892059 0.999622 69.178320 69.186044 -0.144811 -0.022002 26.950821 1.652857 True True False 3
2 Region 3 40.075851 0.195562 94.865725 94.785109 0.052104 -0.061638 5.772026 80.616568 True True False 3

Using the outputs from the structure designed in the previous functions allows me to display a decent looking DataFrame of the results for each of the regional models.

Additionally, I utilize the validation_results from the model runs to formulate the analytical plots describing the performance of the Linear Regression models.


Base Model Performance Analysis Graph Function
def plot_results(all_results):
    # Set the style for all plots
    sns.set_style("whitegrid")
    
    for region_name, results in all_results.items():
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
        fig.suptitle(f'Base Model Performance Analysis - {region_name}', fontsize=14, y=1.00)
        
        # Actual vs Predicted Plot
        sns.scatterplot(
            data=results['validation_results'],
            x='Actual',
            y='Predicted',
            alpha=0.6,
            ax=ax1,
            color='#121314'
        )
        
        # Add diagonal line (representing perfect predictions)
        min_val = min(results['validation_results']['Actual'].min(), 
                     results['validation_results']['Predicted'].min())
        max_val = max(results['validation_results']['Actual'].max(), 
                     results['validation_results']['Predicted'].max())
        ax1.plot([min_val, max_val], [min_val, max_val], 
                 'r--', label='Perfect Prediction')
        
        # Customize first plot
        ax1.set_xlabel('Actual Values', fontsize=12)
        ax1.set_ylabel('Predicted Values', fontsize=12)
        ax1.set_title('Actual vs Predicted Values', pad=20)
        ax1.legend(loc='lower center')
        
        # Add R^2 and RMSE annotations
        text = f'R² = {results["r2"]:.3f}\nRMSE = {results["rmse"]:.2f}'
        ax1.text(0.05, 0.95, text,
                transform=ax1.transAxes,
                verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='white', edgecolor='#121314', alpha=0.8))
        
        # Error Distribution Plot
        sns.histplot(
            data=results['validation_results'],
            x='Error',
            kde=True,
            ax=ax2,
            color='#890A0A'
        )
        
        # Error Distribution plot aes
        ax2.set_xlabel('Prediction Error', fontsize=12)
        ax2.set_ylabel('Count', fontsize=12)
        ax2.set_title('Error Distribution', pad=20)
        
        # Add mean and std annotations
        mean_error = results['validation_results']['Error'].mean()
        std_error = results['validation_results']['Error'].std()
        med_error = results['validation_results']['Error'].median()
        text = f'Mean = {mean_error:.2f}\nMedian = {med_error:.2f}\nStd = {std_error:.2f}'
        ax2.text(0.95, 0.95, text,
                transform=ax2.transAxes,
                horizontalalignment='right',
                verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='white', edgecolor='#121314', alpha=0.8))
        
        plt.tight_layout()
        plt.show()
        
plot_results(all_results)

Base Model Analysis

Region 2 shows excellent performance: - Very high R² (0.999) indicating the model explains nearly all variance in the data - Very low RMSE (0.89) showing high prediction accuracy - The scatter plot shows points tightly clustered along the perfect prediction line - Error distribution is narrow and normally distributed with small standard deviation - Predicted vs actual volumes are nearly identical (69.18 vs 69.19) - The highly quantized data exhibited by both the f2 feature and the target potentially hold strong influence on the results of this model.

Regions 1 and 3 show poor performance: - Low R² values (0.27 and 0.20 respectively) indicating the model explains very little of the variance - High RMSE values (37.85 and 40.08) showing large prediction errors - Scatter plots show wide dispersion from the perfect prediction line - Error distributions are much wider with larger standard deviations - While average predicted volumes are close to actuals, individual predictions vary greatly

Reviewing the assumptions made by Linear Regression modeling there is an assumption where the errors are normally distributed this is why I have decided to include these in the model performance analysis.

Model Hypertuning

param_space = {'copy_X': [True,False], 
               'fit_intercept': [True,False], 
               'n_jobs': [5,10,15,None],
               'positive': [True,False]}

None means 1 unless in a parallel backend context. Not including 1 for n_jobs in param space.


def train_and_hypertune_region(data, region_name):
    # Separate features and target
    features = data[['f0', 'f1', 'f2']]
    target = data['product']
    
    # Split data into training and validation sets (75:25)
    features_train, features_val, target_train, target_val = train_test_split(features, target, test_size=0.25, random_state=12345)
    
    # Initialize and train model
    model = LinearRegression()
    
    # Hypertune with GridSearch
    grid_search = GridSearchCV(model, param_space, cv=5, scoring='neg_root_mean_squared_error')
    grid_search.fit(features_train, target_train)
    
    best_model = grid_search.best_estimator_
    
    # Make predictions on validation set
    target_pred = best_model.predict(features_val)
    
    # Calculate metrics
    rmse = np.sqrt(mean_squared_error(target_val, target_pred))
    r2 = r2_score(target_val, target_pred)
    
    # Save validation results
    validation_results = pd.DataFrame({
        'Actual': target_val,
        'Predicted': target_pred,
        'Error': target_val - target_pred
    })
    
    # Get feature coefficients
    feature_coefficients = dict(zip(features.columns, best_model.coef_))
    
    # Get hyperparameters as a flattened dictionary
    hyperparams = {f'param_{key}': value for key, value in grid_search.best_params_.items()}
    
    return {
        'region_name': region_name,
        'model': best_model,
        'rmse': rmse,
        'r2': r2,
        'avg_predicted': np.mean(target_pred),
        'avg_actual': np.mean(target_val),
        'validation_results': validation_results,
        'feature_coefficients': feature_coefficients,
        'intercept': best_model.intercept_,
        'hyperparameters': hyperparams
    }

Using neg_root_mean_squared_error here for the grid search scoring parameter, this was implemented after addressing the documentation and investigating the scoring methods available.

hyper_results = {} # Dictionary storage for hypertuned model results and coefficients
master_hyper_results = [] # List to store rows for the master result DataFrame

for region_name, data in regions_data.items():
    results = train_and_hypertune_region(data, region_name)
    hyper_results[region_name] = results
    
    # Append results for master DataFrame
    master_hyper_results.append({
        'Region': region_name,
        'RMSE': results['rmse'],
        'R2 Score': results['r2'],
        'Avg Predicted Volume': results['avg_predicted'],
        'Avg Actual Volume': results['avg_actual'],
        **results['feature_coefficients'],  # feature coefficients
        'Intercept': results['intercept'],
        **results['hyperparameters']
    })
# Create master result DataFrame
master_hyper_result_df = pd.DataFrame(master_hyper_results)
display(master_hyper_result_df)
Region RMSE R2 Score Avg Predicted Volume Avg Actual Volume f0 f1 f2 Intercept param_copy_X param_fit_intercept param_n_jobs param_positive
0 Region 1 37.853527 0.272392 92.789156 92.158205 3.781966 -13.892695 6.634141 77.637060 True True 5 False
1 Region 2 0.892059 0.999622 69.178320 69.186044 -0.144811 -0.022002 26.950821 1.652857 True True 5 False
2 Region 3 40.075516 0.195576 94.866021 94.785109 0.051994 0.000000 5.772001 80.616838 True True 5 True
Hypertuned Model Performance Analysis Graph Function
def plot_hypertuned_results(hyper_results):
    for region_name, results in hyper_results.items():
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
        fig.suptitle(f'Hypertuned Model Performance Analysis - {region_name}', fontsize=14, y=1.00)
        
        # Actual vs Predicted Plot
        sns.scatterplot(
            data=results['validation_results'],
            x='Actual',
            y='Predicted',
            alpha=0.6,
            ax=ax1,
            color='#121314'
        )
        
        # Add diagonal line (representing perfect predictions)
        min_val = min(results['validation_results']['Actual'].min(), 
                     results['validation_results']['Predicted'].min())
        max_val = max(results['validation_results']['Actual'].max(), 
                     results['validation_results']['Predicted'].max())
        ax1.plot([min_val, max_val], [min_val, max_val], 
                 'r--', label='Perfect Prediction')
        
        # Customize first plot
        ax1.set_xlabel('Actual Values', fontsize=12)
        ax1.set_ylabel('Predicted Values', fontsize=12)
        ax1.set_title('Actual vs Predicted Values', pad=20)
        ax1.legend(loc='lower center')
        
        # Add R^2 and RMSE annotations
        text = f'R² = {results["r2"]:.3f}\nRMSE = {results["rmse"]:.2f}'
        ax1.text(0.05, 0.95, text,
                transform=ax1.transAxes,
                verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='white', edgecolor='#121314', alpha=0.8))
        
        # Error Distribution Plot
        sns.histplot(
            data=results['validation_results'],
            x='Error',
            kde=True,
            ax=ax2,
            color='#890A0A'
        )
        
        # Error Distribution plot aes
        ax2.set_xlabel('Prediction Error', fontsize=12)
        ax2.set_ylabel('Count', fontsize=12)
        ax2.set_title('Error Distribution', pad=20)
        
        # Add mean and std annotations
        mean_error = results['validation_results']['Error'].mean()
        std_error = results['validation_results']['Error'].std()
        med_error = results['validation_results']['Error'].median()
        text = f'Mean = {mean_error:.2f}\nMedian = {med_error:.2f}\nStd = {std_error:.2f}'
        ax2.text(0.95, 0.95, text,
                transform=ax2.transAxes,
                horizontalalignment='right',
                verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='white', edgecolor='#121314', alpha=0.8))
        
        plt.tight_layout()
        plt.show()
        
plot_hypertuned_results(hyper_results)

Hypertuned model analysis

  • Region 2 has the highest model accuracy, with an RMSE of 0.892 and an R2 Score of 0.9996, suggesting almost perfect prediction.
  • Region 1 and Region 3 have higher RMSE values (37.85 and 40.08, respectively) and lower R2 scores, indicating weaker predictive accuracy in these regions.
  • The average predicted volumes are very close to the actual volumes across all regions.

It seems that even after model hypertuning, there is no change in the regional performance. Without any significant data transformations implimented, these are the best models avaiable.

Profit Calculations

Conditions:

  • When exploring the region, a study of 500 points is carried with picking the best 200 points for the profit calculation.
  • The budget for development of 200 oil wells is 100 USD million.
  • One barrel of raw materials brings 4.5 USD of revenue The revenue from one unit of product is 4,500 dollars (volume of reserves is in thousand barrels).
  • After the risk evaluation, keep only the regions with the risk of losses lower than 2.5%. From the ones that fit the criteria, the region with the highest average profit should be selected.
def calculate_profit(targets, predictions, n_wells=200, budget=100_000_000, revenue_per_unit=4_500):
1    targets = pd.Series(targets).reset_index(drop=True)
    predictions = pd.Series(predictions).reset_index(drop=True)
    
2    top_indices = predictions.sort_values(ascending=False).index[:n_wells]
    
3    selected_volumes = targets.iloc[top_indices]
    
4    total_volume = selected_volumes.sum()
    total_revenue = total_volume * revenue_per_unit
    profit = total_revenue - budget
    
5    return {
        'profit': profit,
        'total_volume': total_volume,
        'average_volume': total_volume / n_wells,
        'selected_indices': top_indices
    }
1
Reset indices and ensure numpy arrays
2
Sort predictions and get indices of top n_wells
3
Get actual volumes for these wells
4
Calculate metrics
5
Output structure

Bootstrapped Regional Analysis

def bootstrap_profit_analysis(targets, predictions, n_bootstrap=1000, sample_size=500,
                            n_wells=200, budget=100_000_000, revenue_per_unit=4_500,
                            random_state=12345):
    np.random.seed(random_state)
    profits = []
    volumes = []
    
    for i in range(n_bootstrap):
        # Sample indices with replacement
        sample_indices = np.random.choice(len(targets), size=sample_size, replace=True)
        
        # Calculate profit for this sample
        sample_result = calculate_profit(
            targets.iloc[sample_indices],
            predictions.iloc[sample_indices],
            n_wells=n_wells,
            budget=budget,
            revenue_per_unit=revenue_per_unit
        )
        
        profits.append(sample_result['profit'])
        volumes.append(sample_result['total_volume'])
    
    # Calculate risk metrics
    profits_array = np.array(profits)
    prob_loss = (profits_array < 0).mean() * 100
    
    return {
        'mean_profit': np.mean(profits),
        'std_profit': np.std(profits),
        'mean_volume': np.mean(volumes),
        'std_volume': np.std(volumes),
        'prob_loss': prob_loss,
        'confidence_interval': np.percentile(profits, [2.5, 97.5])
    }
region_analysis = {}
for region_name, results in hyper_results.items():
    # Get validation results from the previously trained models
    val_results = results['validation_results']
    
    # Calculate base profit using all validation data
    base_profit = calculate_profit(
        val_results['Actual'],
        val_results['Predicted']
    )
    
    # Perform bootstrap analysis
    bootstrap_results = bootstrap_profit_analysis(
        val_results['Actual'],
        val_results['Predicted']
    )
    
    region_analysis[region_name] = {
        'base_profit': base_profit,
        'bootstrap_results': bootstrap_results
    }
region_analysis_results = []

# Populate results for each region
for region_name, analysis in region_analysis.items():
    base_results = analysis['base_profit']
    bootstrap_results = analysis['bootstrap_results']
    
    region_analysis_results.append({
        'Region': region_name,
        'Base Profit ($M)': base_results['profit'] / 1_000_000,
        'Average Volume (thousand barrels)': base_results['average_volume'],
        'Total Volume (thousand barrels)': base_results['total_volume'],
        'Mean Bootstrap Profit ($M)': bootstrap_results['mean_profit'] / 1_000_000,
        'Profit Std Dev ($M)': bootstrap_results['std_profit'] / 1_000_000,
        'Probability of Loss (%)': bootstrap_results['prob_loss'],
        'CI Lower Bound ($M)': bootstrap_results['confidence_interval'][0] / 1_000_000,
        'CI Upper Bound ($M)': bootstrap_results['confidence_interval'][1] / 1_000_000,
        'Mean Bootstrap Volume (thousand barrels)': bootstrap_results['mean_volume'],
        'Volume Std Dev (thousand barrels)': bootstrap_results['std_volume']
    })

# Create and display the DataFrame
region_analysis_df = pd.DataFrame(region_analysis_results)

# Round numerical columns
numeric_columns = region_analysis_df.select_dtypes(include=[np.number]).columns
region_analysis_df[numeric_columns] = region_analysis_df[numeric_columns].round(2)

# Sort by Mean Bootstrap Profit
region_analysis_df = region_analysis_df.sort_values('Mean Bootstrap Profit ($M)', ascending=False)

# Display the DataFrame
display(region_analysis_df)
Region Base Profit ($M) Average Volume (thousand barrels) Total Volume (thousand barrels) Mean Bootstrap Profit ($M) Profit Std Dev ($M) Probability of Loss (%) CI Lower Bound ($M) CI Upper Bound ($M) Mean Bootstrap Volume (thousand barrels) Volume Std Dev (thousand barrels)
1 Region 2 24.15 137.95 27589.08 4.78 2.05 1.2 0.90 8.67 23285.52 456.10
0 Region 1 33.65 148.50 29700.42 3.81 2.59 7.7 -1.43 8.91 23068.03 574.58
2 Region 3 25.12 139.02 27803.77 3.30 2.61 11.3 -1.89 8.38 22954.56 579.02

Region 1

  • Highest Base Profit: Region 1 has the highest base profit of $33.65 million, suggesting it is the most profitable under initial calculations.
  • Profit Variability: With a standard deviation of $2.59 million, this region has moderate profit variability, but a significant 7.7% probability of loss—higher than the other regions.
  • Profit Confidence Interval: Ranges from -$1.43 million to $8.91 million, suggesting possible but limited risk of loss.
  • Volume Statistics: The average volume per period is the highest at 148.50 thousand barrels, with a substantial total volume of 29,700.42 thousand barrels.

Region 2

  • Stable Profit Estimates: Region 2 has a base profit of $24.15 million and a relatively lower mean bootstrap profit at $4.78 million, with the lowest profit variability (std dev: $2.05 million).
  • Low Loss Probability: Only a 1.2% chance of incurring a loss, making it the least risky region.
  • Profit Confidence Interval: Between $0.90 million and $8.67 million, indicating positive expectations.
  • Consistent Volume: The mean bootstrap volume is 23,285.52 thousand barrels, with a low standard deviation of 456.10 thousand barrels, suggesting consistent output.

Region 3

  • Moderate Base Profit: Region 3 has a base profit of $25.12 million, close to Region 2 but with slightly higher risk.
  • High Loss Probability: A notable 11.3% probability of loss, the highest among the regions, indicating elevated financial risk.
  • Widest Confidence Interval: Ranges from -$1.89 million to $8.38 million, implying a significant chance of lower profit.
  • Volume Variability: Has a volume standard deviation of 579.02 thousand barrels, the highest among the regions, indicating greater variability in production.

Conclusion

According to the current analysis, the most profitable region with a good risk balance is Region 2, where profit is simulated being high while having the smallest confidence interval, and risk of loss as well. According to the hypertuned model and bootstraping simulation Region 1 and 3 where found to also have relativley decent bootstrapped profit however they do maintain a considerably high amount of loss risk.

Considering these results, the recommendation for Region 2 for continued and immediate production is a given. Further investigation and review on the measurement, efficiency and overall efficacy of production in Region 1 and 3 must be conducted to in order to finalize production decisions for those particular regions.