Overview

Python R Kaggle

Data Overview

Accelerated Medicine Partnership® - Parkinson’s Disease Progression Data

File Name Description Utilization Status
train_peptides.csv Mass spectromety data at the peptide level.
train_proteins.csv Protein expression frequencies aggregated from the peptide level data.
train_clinical_data.csv Clinical data which includes patient identification and Unified Parkinson’s Disease Rating Scale questionnaire data.
amp_pd_peptide/ API deliverables for model result testing.

Competition Overview

The primary goal of this kaggle competition was to predict the course of Parkinson’s disease (PD) using protein abundance data. The competition specified the utilization of a custom sMAPE loss function as the given metric for measuring the performance of the models.

Top 59% of competition out of nearly 2,000 individuals and teams.

Algorithms Explored

Utilized Algorithm Definition Characteristic
Deep neural network (DNN) An artificial neural network consisting of many hidden layers between an input and output layer. This algorithm can model complex nonlinear relationships, and it contains multiple hidden layers.

Statistical Programming Setup

Language Usage Specifics
Python Data processing and algorithm deployment. Python was utilized to conduct data preparation which includes feature processing for neural network deployment, and data preparation for regression model deployment. Scikit-Learn and TensorFlow libraries were utilized to conduct the model deployments. Pandas was utilized to conduct data preprocessing
R Data visualization and document rendering. The reticulate package was utilized to communicate between python and R environments. The Highcharts extension for R was utilized for interactive HTML plotting. RMarkdown and Rmdformats package was utilized with the material template for rendering.

Data Preprocessing and Preparation

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn import model_selection
from sklearn import metrics

The clinical data is composed of 8 columns and 2,615 entries.

The clinic in this case has also recorded the patient’s NPX (normalized protein expression) value for all the proteins relevant to Parkinson’s disease during each visit.

The Protein data is composed of 5 columns and 232,741 entries.

Proteins are long molecules made up of multiple peptides. The clinic recorded the Peptide Abundance of each peptide in proteins relevant to Parkinson’s disease. It shows the peptide concentration, similar to NPX for proteins.

The Peptide data is composed of 6 columns and 981834 entries.

Plotting Clinical Data

Plotting Protein Data

Target Preparation

patients = {}
for e in range(1,5):
    for m in [0,6,12,24]:
        df_train_clin_DNN[f'updrs_{e}_plus_{m}_months'] = 0

for patient in df_train_clin_DNN.patient_id.unique():
    temp = df_train_clin_DNN[df_train_clin_DNN.patient_id == patient]
    month_list = []
    month_windows = [0,6,12,24]
    for month in temp.visit_month.values:
        month_list.append([month, month + 6, month + 12, month + 24])
    for month in range(len(month_list)):
        for x in range(1,5):
            arr = temp[temp.visit_month.isin(month_list[month])][f'updrs_{x}'].fillna(0).to_list()
            if len(arr) == 4:
                for e, i in enumerate(arr):
                    m = month_list[month][0]
                    temp.loc[temp.visit_month == m,[f'updrs_{x}_plus_{month_windows[e]}_months']] = i
            else:
                temp = temp[~temp.visit_month.isin(month_list[month])]
    patients[patient] = temp
formatted_clin = pd.concat(patients.values(), ignore_index=True).set_index('visit_id').iloc[:,7:]

Feature Preparation

protfeatures = df_train_prot_DNN.pivot(index='visit_id', columns='UniProt',
                                                         values='NPX')
df = protfeatures.merge(formatted_clin, left_index=True,right_index=True,how='right')
df['visit_month'] = df.reset_index().visit_id.str.split('_').apply(lambda x: int(x[1])).values
visit_month_list = df.reset_index().visit_id.str.split('_').apply(lambda x: int(x[1])).unique().tolist()
protein_list = protfeatures.columns.to_list()
X = df[protfeatures.columns.to_list() + ["visit_month"]]
y = df[formatted_clin.columns]
print('\nX and y shapes:')
## 
## X and y shapes:
X.shape, y.shape
## ((954, 228), (954, 16))

Preprocessing

from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

X.visit_month = X.visit_month.astype('float')
## <string>:2: SettingWithCopyWarning: 
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
## 
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
y = y.astype('float')

feature_trans = ColumnTransformer([
    (
        'numerical',
        make_pipeline(IterativeImputer(), StandardScaler()),
        make_column_selector(dtype_include='number')
    ),
])

X_transformed = feature_trans.fit_transform(X)
import tensorflow.keras.backend as K

def smape_loss(y_true, y_pred):
    epsilon = 0.1
    numer = K.abs(y_pred - y_true)
    denom = K.maximum(K.abs(y_true) + K.abs(y_pred) + epsilon, 0.5 + epsilon)
    smape = numer / (denom/2)
    smape = tf.where(tf.math.is_nan(smape), tf.zeros_like(smape), smape)
    return smape


def calculate_smape(y_true, y_pred):
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    
    numer = np.round(np.abs(y_pred-y_true),0)
    denom = np.round(np.abs(y_true) + np.abs(y_pred),0)

    return 1/len(y_true) * np.sum(np.nan_to_num(numer / (denom/2))) *100

Predicting Parkinson Progression with Dense Neural Network

import tensorflow as tf
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout

input_shape = [X.shape[1]]

model = tf.keras.Sequential()
model.add(Dense(256, input_shape = input_shape, activation = 'relu'))
model.add(Dropout(.6))
model.add(Dense(150, activation = 'relu'))
model.add(Dropout(.6))
model.add(Dense(128, activation = 'relu'))
model.add(Dropout(.6))
model.add(Dense(y.shape[1]))
from keras.utils.vis_utils import plot_model

model.compile(optimizer='adam', loss = smape_loss)
history = model.fit(
    X_transformed,
    y,
    epochs = 800,
    verbose = False,
    validation_split = .2
)
pd.DataFrame(history.history).plot()

This model got me to the top 59% of the entire competition, ahead of 747 submissions. What is shown are the testing results based on the sMAPE loss which was at 0.58. At the time of this model completion to top competitors had managed testing results of ~0.35.

Hyperparameter Tuning (DNN)

Status

Starting with a quote from my Great Grandfather…

“When you are taking a break from building the house, build the bricks.”

Jose Mounetou

The project for most is over, but I am not satisfied. What did I not explore? How can I implement advanced MLOps procedures?

If I had to complete this project today, I would utilize an automated sweep of the learning rate parameter (One parameter I neglected to explore during the competition). The first time I started this project the architecture displayed before is the one that resulted in the best results in testing. However, I never explored the learning rate parameter for the Adam optimizer during the compilation of the model.

I am exploring the following ranges of learning rate values displayed by this YAML.

sweep_config = {
    'method': 'random',
    'metric': {
        'name': 'smape_loss',
        'goal': 'minimize'
    },
    'early_terminate': {
        'type': 'hyperband',
        'min_iter': 5
    },
    'parameters': {
        'learning_rate': {
            'values': [0.01, 0.005, 0.001, 0.0005, 0.0001]
        }
    }
}

This represents a good base range of values that can show the best model performance will lie depending on the learning rate value. Implementing this will be a great learning experience for me as I have only been introduced to the Weights and Biases library which will provide this functionality.

With this we will see if my current model architecture is as good as it will get. The default learning rate of the Adam architecture is 0.001. This is included in the sweep procedure, however we can take the steps to make sure it is the first model that is built. This can be done at the beginning of a sweep function where we can define the defaults.

In this case:

def sweep_function(config_defaults=None):
  config_defaults = {
    'learning_rate': 0.001
}
#
# ...