This data set provides supply chain health commodity shipment and pricing data. Specifically, the data set identifies Antiretroviral (ARV) and HIV lab shipments to supported countries. In addition, the data set provides the commodity pricing and associated supply chain expenses necessary to move the commodities to countries for use. The data set has similar fields to the Global Fund’s Price, Quality and Reporting (PQR) data. PEPFAR and the Global Fund represent the two largest procurers of HIV health commodities. This data set, when analyzed in conjunction with the PQR data, provides a more complete picture of global spending on specific health commodities. The data are particularly valuable for understanding ranges and trends in pricing as well as volumes delivered by country. The US Government believes this data will help stakeholders make better, data-driven decisions. Care should be taken to consider contextual factors when using the database. Conclusions related to costs associated with moving specific line items or products to specific countries and lead times by product/country will not be accurate.
Utilized | Algorithm | Definition | Characteristic |
---|---|---|---|
✓ | Multiple linear regression (MLR) | A statistical technique for estimating a predictive target utilizing a linear relationship between two or more predictive factors for one predictive target. | Predicts a dependent variable using multiple independent variables. |
✓ | Deep neural network (DNN) | An artificial neural network consisting of many hidden layers between an input and output layer. | This algorithm can model complex nonlinear relationships, and it contains multiple hidden layers. |
Language | Usage | Specifics |
---|---|---|
Python | Data processing and algorithm deployment. | Python was utilized to conduct data preparation which includes
feature processing for neural network deployment, and data preparation
for regression model deployment. Scikit-Learn and
TensorFlow libraries were utilized to conduct the model
deployments. Pandas was utilized to conduct data
preprocessing |
R | Data visualization and document rendering. | The reticulate package was utilized to communicate
between python and R environments. The Highcharts extension
for R was utilized for interactive HTML plotting. RMarkdown
and Rmdformats package was utilized with the material
template for rendering. |
Line Item Insurance page takes a while to load when pressed.
This data set provides supply chain health commodity shipment and pricing data. Specifically, the data set identifies Antiretroviral (ARV) and HIV lab shipments to supported countries. In addition, the data set provides the commodity pricing and associated supply chain expenses necessary to move the commodities to countries for use. The data set has similar fields to the Global Fund’s Price, Quality and Reporting (PQR) data. PEPFAR and the Global Fund represent the two largest procurers of HIV health commodities. This data set, when analyzed in conjunction with the PQR data, provides a more complete picture of global spending on specific health commodities. The data are particularly valuable for understanding ranges and trends in pricing as well as volumes delivered by country. The US Government believes this data will help stakeholders make better, data-driven decisions. Care should be taken to consider contextual factors when using the database. Conclusions related to costs associated with moving specific line items or products to specific countries and lead times by product/country will not be accurate.
import pandas as pd
import numpy as np
dataset_df = pd.read_csv("~/Price_Prediction/Supply_Chain_Shipment_Pricing_Data.csv")
The raw data set has many “missing” data associated with weight and
freight cost. The weight values are either “captured separately”, or
referenced via asn/dn #
. The freight cost values are either
also referenced via asn/dn #
or “included in commodity
cost”. Since we only have lone access to this data set, these references
are not very helpful. So the question is: How do we deal with this?
The first method to explore will be handling these abstracted data by
removing them from the data set by coercing the dtype
of
the two variables of interest.
This data coercion will remove any non-numeric data and their associated rows from the entire data set.
First we can prepare a copy of the raw data set
dataset_df_processed
where we can then conduct the variable
to numeric data type coercion.
dataset_df_processed = dataset_df.copy()
dataset_df_processed['weight (kilograms)'] = pd.to_numeric(dataset_df_processed['weight (kilograms)'], errors = 'coerce')
dataset_df_processed.dropna(inplace = True)
dataset_df_processed['freight cost (usd)'] = pd.to_numeric(dataset_df_processed['freight cost (usd)'], errors = 'coerce')
dataset_df_processed.dropna(inplace = True)
To understand how the data set was affected by this initial processing, we can visualize the shipment counts by country between the unprocessed and processed.
Removing all of the value references has a large effect on the structure of the data. We can confidently say that any analysis that is derived from this process wont be accurate. However, there are other ways of handling this data problem. For the purposes of future analysis, I will remove these troublesome features entirely.
The two features to remove will be the
weight (kilograms)
and the
freight cost (usd)
.
Above shows the first 10 rows of data after removing the
weight
and freight cost
features. Notably,
dosage
, shipment mode
and
line item insurance
are full of missing values from the
looks of it. The dosage
feature has the most missing data
measuring at 1736 missing values. While shipment mode
and
line item insurance (usd)
have 360 and 287 missing values
respectively.
There are several ways to handle the missing values of numerical and categorical features, but for the purposes of this demonstration I will apply the common method of removing all the missing occurrences.
Handling the troublesome features and missing values this way allowed the data to maintain a general structure according to country record count values. As expected, the feature drop process had no major effect on the structure of the data.
Now that we have dealt with the obvious issues with the data, we can continue with an in depth data cleaning and processing procedure. Most of this will be handled by defining a single function which is shown and explained below:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
def preprocess_inputs(df, label_mapping):
# Create copy of df with dropped features and no missing values.
df_base = df.copy()
# Dropping ID column because it wont serve any purpose in modeling.
df_base = df_base.drop('id', axis=1)
# Drop Date Columns with similar referencing issues as `weight` and `freight cost`
df_base = df_base.drop(['pq first sent to client date',
'po sent to vendor date'], axis=1)
# Extract Date Features
for column in ['scheduled delivery date',
'delivered to client date',
'delivery recorded date']:
df_base[column] = pd.to_datetime(df_base[column])
df_base[column + ' year'] = df_base[column].apply(lambda x : x.year)
df_base[column + ' month'] = df_base[column].apply(lambda x : x.month)
df_base[column + ' day'] = df_base[column].apply(lambda x : x.day)
df_base = df_base.drop(column, axis=1)
# Drop high cardinality columns.
df_base = df_base.drop(['pq #','po / so #','asn/dn #'], axis=1)
# Binary Encoding
df_base['fulfill via'] = df_base['fulfill via'].replace({'Direct Drop':0,
'From RDC':1})
df_base['first line designation'] = df_base[ 'first line designation'].replace({'no':0,
'yes':1})
# One-Hot Encoding
for column in df_base.select_dtypes('object').columns.drop('shipment mode'):
dummies = pd.get_dummies(df_base[column], prefix=column)
df_base = pd.concat([df_base, dummies], axis = 1)
df_base = df_base.drop(column, axis=1)
# Splitting The DataFrame into X and y
y = df_base['shipment mode']
X = df_base.drop('shipment mode', axis=1)
# Encoding The Labels In Shipment Mode
y = y.replace(label_mapping)
# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7,
shuffle=True,
random_state=1)
# Scale X
scaler = StandardScaler()
scaler.fit(X_train)
X_train = pd.DataFrame(scaler.transform(X_train),
index=X_train.index, columns=X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test),
index=X_test.index, columns=X_test.columns)
return X_train, X_test, y_train, y_test
Firstly, the id
column was dropped completely from the
data set due to its lack of usefulness in any modeling scenario. Having
completely unique values for each shipment lends to this lack of
usefulness. Secondly, two of the date columns
PQ First Sent to Client Date
and
PO Sent to Vendor Date
has similar issues to the previous
features weight (kilograms)
and
freight cost (usd)
where much of the expected values are
referenced. These two date features are removed from the data set
accordingly.
The remaining date columns Scheduled Delivery Date
,
Delivered to Client Date
and
Delivery Recorded Date
were then converted to
datetime
format. From this a year
,
month
and day
feature were extracted from
each; creating separate features.
The three features which express high cardinality: PQ #
,
PO / SO #
, and ASN/DN #
were dropped.
Essentially, these features are parallel to an ID
type
feature, which would make them not so useful in a modeling process.
Two features: Fufill Via
and
First Line Designation
have two categories, and can be
encoded to either 0 or 1. For the
Fufill Via
feature, Direct Drop
and
From RDC
are encoded 0 and
1 respectively. The First Line Designation
feature’s no
and yes
are encoded
0 and 1 respectively.
This is where the big changes occur! For every column that is
represented by type object
except for
Shipment Mode
: The focus of this first model, a one-hot
encoding procedure is done. After the one-hot encoding is done, the
original features are removed from the data set.
First, the DataFrame is split via Shipment Mode
to
create the target y
and X
sets. A replace
method is then set up for the target set to encode the shipment modes
with labels. We then split the data into X_train
,
X_test
, y_train
, y_test
values
with a split size of 70% training and
30% testing. Lastly, the X
set is
standardized via scikit-learn’s StandardScaler
protocol.
This model will utilize a Deep Neural Network architecture with TensorFlow. Before we start we must instantiate the processed data from the previous section.
Above is one of the results of the previous function’s call:
X_train
. Let’s see how y_train
looks:
It seems the label replacement on the shipment mode feature has
worked, and has successfully transferred to this new
y_train
variable. Let’s see the distribution of these
labels:
The shipment mode Air
represented here by
0, seems to dominate the data with over 3000 records.
The Ocean
shipment mode, represented by 3
has the least occurrences amongst this data. This fact is not at all
surprising, but it will be interesting to see how this affects the
model, if any affect can be noted.
inputs = tf.keras.Input(shape=(529,))
x = tf.keras.layers.Dense(128, activation='relu')(inputs)
x = tf.keras.layers.Dense(128, activation='relu')(x)
outputs = tf.keras.layers.Dense(4, activation='softmax')(x)
model = tf.keras.Model(inputs=inputs, outputs=outputs)
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
history = model.fit(
X_train,
y_train,
validation_split=0.2,
batch_size=32,
epochs=100,
callbacks=[
tf.keras.callbacks.EarlyStopping(
monitor='val_loss',
patience=3,
restore_best_weights=True
)
]
)
Multiple Linear Regression
is a very simple technique
and will only require certain predictors which express a linear
relationship with a target feature. In this case the target feature will
be Line Item Insurance (USD)
. I will firstly extract
variables of interest: the target feature, and potential predictors.
MLR_potential = dataset_dropna[['line item quantity', 'line item value', 'pack price', 'unit price', 'line item insurance (usd)']]
With these we can examine potential linear relationships with
Line Item Insurance (USD)
.
Below shows the linear regression result between
Line Item Quantity
and
Line Item Insurance
:
Similarly, below is the result of a linear regression fitted between
Line Item Value
and Line Item Insurance
:
Two of the four potential predictor variables express a
Linear Relationship of sorts with the
Line Item Insurance
target feature.
Line Item Quantity
and Line Item Value
can now
be used together with multiple linear regression to obtain a more
comprehensive result of this relationship.
MLR_Preds = MLR_potential[['line item quantity', 'line item value']]
MLR_Target = MLR_potential['line item insurance (usd)']
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(MLR_Preds, MLR_Target)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
The following is an insurance prediction where
Line Item Quantity
is 5500, and
Line Item Value
is $140,000.
The Line Item Insurance
result from this prediction is
$218.96. Additionally, the measured coefficients for
this relationship are -0.0003384
and
0.00140536
. This suggests a slight negative relationship
between Line Item Quantity
and
Line Item Insurance
. However, the relationship between
Line Item Value
and Line Item Insurance
has a
more dominant power amongst this set. This shows the usefulness of a
multiple linear regression problem. You can tease out interesting
results in a set rather than in isolation.