Zyfra - Optimizing Gold Recovery with Random Forest Regression
A modeling procedure which is designed to accurately predict the recovery rate of gold during purification.
The primary goal of this project is to build a machine learning model capable of accurately predicting the volume of oil reserves for new wells in each region. This model will help select the top-performing oil wells based on predicted reserves and identify the region with the highest profit margin. The model’s output will guide the decision on where to drill, considering both profitability and financial risk.
For the purpose of understanding the main structure of the following analysis, I have taken a modular approach to each of the sections and subsections. Creating functions to perform the modeling, analysis all with defined output structure in order to further utilize the resulting information.
All duplicates expressed throughout each of the datasets are only duplicated across the id subset. Looking closely none of the features or products throughout this duplicate range are duplicated themselves. This suggests that these are additional measuremets taken at the same well id.
In the end, in order to avoid any potential issues and considering the negligable amount of this data, these duplicated data were dropped.
This procedure is further detailed in the following code block…
file_paths = ["../datasets/geo_data_0.csv", "../datasets/geo_data_1.csv", "../datasets/geo_data_2.csv"]
geodata = []
duplicates_df = pd.DataFrame() # Create empty DataFrame for duplicates
for path in file_paths:
# Load the data
data = pd.read_csv(path)
# Find duplicates
duplicates = data[data.duplicated(subset='id', keep=False)].copy()
if not duplicates.empty:
# Add a column to indicate which file the duplicates came from
duplicates['source_file'] = path.split('/')[-1]
duplicates_df = pd.concat([duplicates_df, duplicates], ignore_index=True)
# Remove duplicates based on the 'id' column and add to the list
geodata.append(data.drop_duplicates(subset='id'))
# Sort duplicates_df by ID to group duplicates together
duplicates_df = duplicates_df.sort_values(['id']).reset_index(drop=True)
# Unpack the data into separate variables if needed
geo_data_0, geo_data_1, geo_data_2 = geodata
The following section describes the data distributions of each of the regions.
f0
: Shows a multimodal distribution with several peaks, roughly symmetric but with clear separations between clustersf1
: Similar to f0, displays multimodal behavior with 3-4 distinct peaks and valleysf2
: Appears more unimodal and normally distributed, centered around 0-2 with some slight right skewf0
: Bimodal distribution with two prominent peaks, fairly symmetric around 0f1
: Single normal/Gaussian distribution, unimodal and symmetricf2
: Appears to be a discrete-looking distribution with regularly spaced spikes, though noted as continuous. This could indicate heavily quantized or binned data.f0
: Single peaked, normal distribution with slight asymmetryf1
: Normal distribution, very symmetric and unimodalf2
: Normal distribution with slight heavy tails, symmetric and unimodalFor the purposes of this analysis, Linear Regression is the chosen algorithm to model the regional product
target. Additionally, RMSE
is utilized as the main metric to measure model performance. Additional metrics are utilized in tendem to inform model performance. Model parameters are collected and stored into the model_params
dictionary in occordance to a custom output structure workflow. Working with stored data in this fashion allows for much more versatility in future functionalities such as a DataFrame showcasing the results of the model.
def train_and_evaluate_region(data, region_name):
# Separate features and target
features = data[['f0', 'f1', 'f2']]
target = data['product']
# Split data into training and validation sets (75:25)
features_train, features_val, target_train, target_val = train_test_split(features, target, test_size=0.25, random_state=12345)
# Initialize and train model
model = LinearRegression()
model.fit(features_train, target_train)
# Make predictions on validation set
target_pred = model.predict(features_val)
# Calculate metrics
rmse = np.sqrt(mean_squared_error(target_val, target_pred))
r2 = r2_score(target_val, target_pred)
# Save validation results
validation_results = pd.DataFrame({
'Actual': target_val,
'Predicted': target_pred,
'Error': target_val - target_pred
})
# Get feature coefficients
feature_coefficients = dict(zip(features.columns, model.coef_))
model_params = {
'param_fit_intercept': model.fit_intercept,
'param_copy_X': model.copy_X,
'param_positive': model.positive,
'param_n_features_in': model.n_features_in_
}
return {
'region_name': region_name,
'model': model,
'rmse': rmse,
'r2': r2,
'avg_predicted': np.mean(target_pred),
'avg_actual': np.mean(target_val),
'validation_results': validation_results,
'feature_coefficients': feature_coefficients,
'intercept': model.intercept_,
'parameters': model_params
}
Region | RMSE | R2 Score | Avg Predicted Volume | Avg Actual Volume | f0 | f1 | f2 | Intercept |
---|---|---|---|---|---|---|---|---|
1 | 37.85 | 0.27 | 92.78 | 92.15 | 3.78 | -13.89 | 6.63 | 77.63 |
2 | 0.89 | 0.99 | 69.17 | 69.18 | -0.14 | -0.022 | 26.95 | 1.65 |
3 | 40.07 | 0.19 | 94.86 | 94.78 | 0.052 | -0.061 | 5.77 | 80.61 |
Region 2 shows excellent performance: - Very high R² (0.999) indicating the model explains nearly all variance in the data - Very low RMSE (0.89) showing high prediction accuracy - The scatter plot shows points tightly clustered along the perfect prediction line - Error distribution is narrow and normally distributed with small standard deviation - Predicted vs actual volumes are nearly identical (69.18 vs 69.19) - The highly quantized data exhibited by both the f2 feature and the target potentially hold strong influence on the results of this model.
Regions 1 and 3 show poor performance: - Low R² values (0.27 and 0.20 respectively) indicating the model explains very little of the variance - High RMSE values (37.85 and 40.08) showing large prediction errors - Scatter plots show wide dispersion from the perfect prediction line - Error distributions are much wider with larger standard deviations - While average predicted volumes are close to actuals, individual predictions vary greatly
Reviewing the assumptions made by Linear Regression modeling there is an assumption where the errors are normally distributed this is why I have decided to include these in the model performance analysis.
Utilizing the GridSearchCV
module along with a defined parameter space we can further optimize the present model. The parameter space is defined below, as a note; None
means 1 unless in a parallel backend context. For this case, 1 is not included for n_jobs
.
param_space = {'copy_X': [True,False],
'fit_intercept': [True,False],
'n_jobs': [5,10,15,None],
'positive': [True,False]}
def train_and_hypertune_region(data, region_name):
# Separate features and target
features = data[['f0', 'f1', 'f2']]
target = data['product']
# Split data into training and validation sets (75:25)
features_train, features_val, target_train, target_val = train_test_split(features, target, test_size=0.25, random_state=12345)
# Initialize and train model
model = LinearRegression()
# Hypertune with GridSearch
grid_search = GridSearchCV(model, param_space, cv=5, scoring='neg_root_mean_squared_error')
grid_search.fit(features_train, target_train)
best_model = grid_search.best_estimator_
# Make predictions on validation set
target_pred = best_model.predict(features_val)
# Calculate metrics
rmse = np.sqrt(mean_squared_error(target_val, target_pred))
r2 = r2_score(target_val, target_pred)
# Save validation results
validation_results = pd.DataFrame({
'Actual': target_val,
'Predicted': target_pred,
'Error': target_val - target_pred
})
# Get feature coefficients
feature_coefficients = dict(zip(features.columns, best_model.coef_))
# Get hyperparameters as a flattened dictionary
hyperparams = {f'param_{key}': value for key, value in grid_search.best_params_.items()}
return {
'region_name': region_name,
'model': best_model,
'rmse': rmse,
'r2': r2,
'avg_predicted': np.mean(target_pred),
'avg_actual': np.mean(target_val),
'validation_results': validation_results,
'feature_coefficients': feature_coefficients,
'intercept': best_model.intercept_,
'hyperparameters': hyperparams
}