Accelerated Medicine Partnership® - Parkinson’s Disease Progression Data
File Name | Description | Utilization Status |
---|---|---|
train_peptides.csv | Mass spectromety data at the peptide level. | |
train_proteins.csv | Protein expression frequencies aggregated from the peptide level data. | ✓ |
train_clinical_data.csv | Clinical data which includes patient identification and Unified Parkinson’s Disease Rating Scale questionnaire data. | ✓ |
amp_pd_peptide/ | API deliverables for model result testing. | ✓ |
The primary goal of this kaggle competition was to predict the course of Parkinson’s disease (PD) using protein abundance data. The competition specified the utilization of a custom sMAPE loss function as the given metric for measuring the performance of the models.
Top 59% of competition out of nearly 2,000 individuals and teams.
Utilized | Algorithm | Definition | Characteristic |
---|---|---|---|
✓ | Deep neural network (DNN) | An artificial neural network consisting of many hidden layers between an input and output layer. | This algorithm can model complex nonlinear relationships, and it contains multiple hidden layers. |
Language | Usage | Specifics |
---|---|---|
Python | Data processing and algorithm deployment. | Python was utilized to conduct data preparation which includes
feature processing for neural network deployment, and data preparation
for regression model deployment. Scikit-Learn and
TensorFlow libraries were utilized to conduct the model
deployments. Pandas was utilized to conduct data
preprocessing |
R | Data visualization and document rendering. | The reticulate package was utilized to communicate
between python and R environments. The Highcharts extension
for R was utilized for interactive HTML plotting. RMarkdown
and Rmdformats package was utilized with the material
template for rendering. |
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn import metrics
The clinical
data is composed of 8
columns and 2,615 entries.
The clinic in this case has also recorded the patient’s NPX (normalized protein expression) value for all the proteins relevant to Parkinson’s disease during each visit.
The Protein
data is composed of 5
columns and 232,741 entries.
Proteins are long molecules made up of multiple peptides. The clinic recorded the Peptide Abundance of each peptide in proteins relevant to Parkinson’s disease. It shows the peptide concentration, similar to NPX for proteins.
The Peptide
data is composed of 6
columns and 981834 entries.
patients = {}
for e in range(1,5):
for m in [0,6,12,24]:
df_train_clin_DNN[f'updrs_{e}_plus_{m}_months'] = 0
for patient in df_train_clin_DNN.patient_id.unique():
temp = df_train_clin_DNN[df_train_clin_DNN.patient_id == patient]
month_list = []
month_windows = [0,6,12,24]
for month in temp.visit_month.values:
month_list.append([month, month + 6, month + 12, month + 24])
for month in range(len(month_list)):
for x in range(1,5):
arr = temp[temp.visit_month.isin(month_list[month])][f'updrs_{x}'].fillna(0).to_list()
if len(arr) == 4:
for e, i in enumerate(arr):
m = month_list[month][0]
temp.loc[temp.visit_month == m,[f'updrs_{x}_plus_{month_windows[e]}_months']] = i
else:
temp = temp[~temp.visit_month.isin(month_list[month])]
patients[patient] = temp
df = protfeatures.merge(formatted_clin, left_index=True,right_index=True,how='right')
df['visit_month'] = df.reset_index().visit_id.str.split('_').apply(lambda x: int(x[1])).values
visit_month_list = df.reset_index().visit_id.str.split('_').apply(lambda x: int(x[1])).unique().tolist()
protein_list = protfeatures.columns.to_list()
X = df[protfeatures.columns.to_list() + ["visit_month"]]
y = df[formatted_clin.columns]
print('\nX and y shapes:')
##
## X and y shapes:
## ((954, 228), (954, 16))
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
X.visit_month = X.visit_month.astype('float')
## <string>:2: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
##
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
y = y.astype('float')
feature_trans = ColumnTransformer([
(
'numerical',
make_pipeline(IterativeImputer(), StandardScaler()),
make_column_selector(dtype_include='number')
),
])
X_transformed = feature_trans.fit_transform(X)
import tensorflow.keras.backend as K
def smape_loss(y_true, y_pred):
epsilon = 0.1
numer = K.abs(y_pred - y_true)
denom = K.maximum(K.abs(y_true) + K.abs(y_pred) + epsilon, 0.5 + epsilon)
smape = numer / (denom/2)
smape = tf.where(tf.math.is_nan(smape), tf.zeros_like(smape), smape)
return smape
def calculate_smape(y_true, y_pred):
y_true, y_pred = np.array(y_true), np.array(y_pred)
numer = np.round(np.abs(y_pred-y_true),0)
denom = np.round(np.abs(y_true) + np.abs(y_pred),0)
return 1/len(y_true) * np.sum(np.nan_to_num(numer / (denom/2))) *100
import tensorflow as tf
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
input_shape = [X.shape[1]]
model = tf.keras.Sequential()
model.add(Dense(256, input_shape = input_shape, activation = 'relu'))
model.add(Dropout(.6))
model.add(Dense(150, activation = 'relu'))
model.add(Dropout(.6))
model.add(Dense(128, activation = 'relu'))
model.add(Dropout(.6))
model.add(Dense(y.shape[1]))
history = model.fit(
X_transformed,
y,
epochs = 800,
verbose = False,
validation_split = .2
)
pd.DataFrame(history.history).plot()
This model got me to the top 59% of the entire competition, ahead of
747 submissions. What is shown are the testing results based on the
sMAPE loss which was at 0.58
. At the time of this model
completion to top competitors had managed testing results of
~0.35
.
Starting with a quote from my Great Grandfather…
“When you are taking a break from building the house, build the bricks.”
The project for most is over, but I am not satisfied. What did I not explore? How can I implement advanced MLOps procedures?
If I had to complete this project today, I would utilize an automated
sweep of the learning rate
parameter (One parameter I
neglected to explore during the competition). The first time I started
this project the architecture displayed before is the one that resulted
in the best results in testing. However, I never explored the
learning rate
parameter for the Adam
optimizer
during the compilation of the model.
I am exploring the following ranges of learning rate
values displayed by this YAML.
sweep_config = {
'method': 'random',
'metric': {
'name': 'smape_loss',
'goal': 'minimize'
},
'early_terminate': {
'type': 'hyperband',
'min_iter': 5
},
'parameters': {
'learning_rate': {
'values': [0.01, 0.005, 0.001, 0.0005, 0.0001]
}
}
}
This represents a good base range of values that can show the best
model performance will lie depending on the learning rate
value. Implementing this will be a great learning experience for me as I
have only been introduced to the Weights and Biases
library
which will provide this functionality.
With this we will see if my current model architecture is as good as
it will get. The default learning rate
of the
Adam
architecture is 0.001
. This is included
in the sweep procedure, however we can take the steps to make sure it is
the first model that is built. This can be done at the beginning of a
sweep function where we can define the defaults.
In this case: