Overview

Python R RStudio

Data Overview

Supply Chain Data Source

This data set provides supply chain health commodity shipment and pricing data. Specifically, the data set identifies Antiretroviral (ARV) and HIV lab shipments to supported countries. In addition, the data set provides the commodity pricing and associated supply chain expenses necessary to move the commodities to countries for use. The data set has similar fields to the Global Fund’s Price, Quality and Reporting (PQR) data. PEPFAR and the Global Fund represent the two largest procurers of HIV health commodities. This data set, when analyzed in conjunction with the PQR data, provides a more complete picture of global spending on specific health commodities. The data are particularly valuable for understanding ranges and trends in pricing as well as volumes delivered by country. The US Government believes this data will help stakeholders make better, data-driven decisions. Care should be taken to consider contextual factors when using the database. Conclusions related to costs associated with moving specific line items or products to specific countries and lead times by product/country will not be accurate.

Algorithms Used

Utilized Algorithm Definition Characteristic
Multiple linear regression (MLR) A statistical technique for estimating a predictive target utilizing a linear relationship between two or more predictive factors for one predictive target. Predicts a dependent variable using multiple independent variables.
Deep neural network (DNN) An artificial neural network consisting of many hidden layers between an input and output layer. This algorithm can model complex nonlinear relationships, and it contains multiple hidden layers.

Statistical Programming Setup

Language Usage Specifics
Python Data processing and algorithm deployment. Python was utilized to conduct data preparation which includes feature processing for neural network deployment, and data preparation for regression model deployment. Scikit-Learn and TensorFlow libraries were utilized to conduct the model deployments. Pandas was utilized to conduct data preprocessing
R Data visualization and document rendering. The reticulate package was utilized to communicate between python and R environments. The Highcharts extension for R was utilized for interactive HTML plotting. RMarkdown and Rmdformats package was utilized with the material template for rendering.

Line Item Insurance page takes a while to load when pressed.

Data Collection and Preprocessing

Pandas scikit-learn

Data Collection

This data set provides supply chain health commodity shipment and pricing data. Specifically, the data set identifies Antiretroviral (ARV) and HIV lab shipments to supported countries. In addition, the data set provides the commodity pricing and associated supply chain expenses necessary to move the commodities to countries for use. The data set has similar fields to the Global Fund’s Price, Quality and Reporting (PQR) data. PEPFAR and the Global Fund represent the two largest procurers of HIV health commodities. This data set, when analyzed in conjunction with the PQR data, provides a more complete picture of global spending on specific health commodities. The data are particularly valuable for understanding ranges and trends in pricing as well as volumes delivered by country. The US Government believes this data will help stakeholders make better, data-driven decisions. Care should be taken to consider contextual factors when using the database. Conclusions related to costs associated with moving specific line items or products to specific countries and lead times by product/country will not be accurate.

import pandas as pd
import numpy as np

dataset_df = pd.read_csv("~/Price_Prediction/Supply_Chain_Shipment_Pricing_Data.csv")

Data abstraction issues

The raw data set has many “missing” data associated with weight and freight cost. The weight values are either “captured separately”, or referenced via asn/dn #. The freight cost values are either also referenced via asn/dn # or “included in commodity cost”. Since we only have lone access to this data set, these references are not very helpful. So the question is: How do we deal with this?

Data Preprocessing

Abstraction Handling

The first method to explore will be handling these abstracted data by removing them from the data set by coercing the dtype of the two variables of interest.

This data coercion will remove any non-numeric data and their associated rows from the entire data set.

First we can prepare a copy of the raw data set dataset_df_processed where we can then conduct the variable to numeric data type coercion.

dataset_df_processed = dataset_df.copy()

dataset_df_processed['weight (kilograms)'] = pd.to_numeric(dataset_df_processed['weight (kilograms)'], errors = 'coerce')
dataset_df_processed.dropna(inplace = True)

dataset_df_processed['freight cost (usd)'] = pd.to_numeric(dataset_df_processed['freight cost (usd)'], errors = 'coerce')
dataset_df_processed.dropna(inplace = True)

Abstraction Handling Results

To understand how the data set was affected by this initial processing, we can visualize the shipment counts by country between the unprocessed and processed.

Removing all of the value references has a large effect on the structure of the data. We can confidently say that any analysis that is derived from this process wont be accurate. However, there are other ways of handling this data problem. For the purposes of future analysis, I will remove these troublesome features entirely.

Feature Removal

The two features to remove will be the weight (kilograms) and the freight cost (usd).

Missing Value Handling

Above shows the first 10 rows of data after removing the weight and freight cost features. Notably, dosage, shipment mode and line item insurance are full of missing values from the looks of it. The dosage feature has the most missing data measuring at 1736 missing values. While shipment mode and line item insurance (usd) have 360 and 287 missing values respectively.

There are several ways to handle the missing values of numerical and categorical features, but for the purposes of this demonstration I will apply the common method of removing all the missing occurrences.

Handling the troublesome features and missing values this way allowed the data to maintain a general structure according to country record count values. As expected, the feature drop process had no major effect on the structure of the data.

Feature Preprocessing

Now that we have dealt with the obvious issues with the data, we can continue with an in depth data cleaning and processing procedure. Most of this will be handled by defining a single function which is shown and explained below:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

def preprocess_inputs(df, label_mapping):
    # Create copy of df with dropped features and no missing values.
    df_base = df.copy()
    
    # Dropping ID column because it wont serve any purpose in modeling.
    df_base = df_base.drop('id', axis=1)
    
    # Drop Date Columns with similar referencing issues as `weight` and `freight cost`
    df_base = df_base.drop(['pq first sent to client date',
                            'po sent to vendor date'], axis=1)
    
    # Extract Date Features
    for column in ['scheduled delivery date',
                   'delivered to client date',
                   'delivery recorded date']:
        df_base[column] = pd.to_datetime(df_base[column])
        df_base[column + ' year'] = df_base[column].apply(lambda x : x.year)
        df_base[column + ' month'] = df_base[column].apply(lambda x : x.month)
        df_base[column + ' day'] = df_base[column].apply(lambda x : x.day)
        df_base = df_base.drop(column, axis=1)
        
    # Drop high cardinality columns.
    df_base = df_base.drop(['pq #','po / so #','asn/dn #'], axis=1)
    
    # Binary Encoding
    df_base['fulfill via'] = df_base['fulfill via'].replace({'Direct Drop':0,
                                                             'From RDC':1})
                                                   
    df_base['first line designation'] = df_base[ 'first line designation'].replace({'no':0,
                                                                                    'yes':1})
    
    # One-Hot Encoding
    for column in df_base.select_dtypes('object').columns.drop('shipment mode'):
        dummies = pd.get_dummies(df_base[column], prefix=column)
        df_base = pd.concat([df_base, dummies], axis = 1)
        df_base = df_base.drop(column, axis=1)
        
    # Splitting The DataFrame into X and y
    y = df_base['shipment mode']
    X = df_base.drop('shipment mode', axis=1)
    
    # Encoding The Labels In Shipment Mode
    y = y.replace(label_mapping)
    
    # Train Test Split
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7,
                                                              shuffle=True,
                                                              random_state=1)
    
    # Scale X
    scaler = StandardScaler()
    scaler.fit(X_train)
    
    X_train = pd.DataFrame(scaler.transform(X_train),
                           index=X_train.index, columns=X_train.columns)
                           
    X_test = pd.DataFrame(scaler.transform(X_test),
                          index=X_test.index, columns=X_test.columns)
    
    return X_train, X_test, y_train, y_test

Further Feature Drops

Firstly, the id column was dropped completely from the data set due to its lack of usefulness in any modeling scenario. Having completely unique values for each shipment lends to this lack of usefulness. Secondly, two of the date columns PQ First Sent to Client Date and PO Sent to Vendor Date has similar issues to the previous features weight (kilograms) and freight cost (usd) where much of the expected values are referenced. These two date features are removed from the data set accordingly.

Date Feature Processing

The remaining date columns Scheduled Delivery Date, Delivered to Client Date and Delivery Recorded Date were then converted to datetime format. From this a year, month and day feature were extracted from each; creating separate features.

High Cardinality Feature Drop

The three features which express high cardinality: PQ #, PO / SO #, and ASN/DN # were dropped. Essentially, these features are parallel to an ID type feature, which would make them not so useful in a modeling process.

Binary Encoding

Two features: Fufill Via and First Line Designation have two categories, and can be encoded to either 0 or 1. For the Fufill Via feature, Direct Drop and From RDC are encoded 0 and 1 respectively. The First Line Designation feature’s no and yes are encoded 0 and 1 respectively.

One Hot Encoding

This is where the big changes occur! For every column that is represented by type object except for Shipment Mode: The focus of this first model, a one-hot encoding procedure is done. After the one-hot encoding is done, the original features are removed from the data set.

Prepare for Model 1 (Shipment Mode Prediction)

First, the DataFrame is split via Shipment Mode to create the target y and X sets. A replace method is then set up for the target set to encode the shipment modes with labels. We then split the data into X_train, X_test, y_train, y_test values with a split size of 70% training and 30% testing. Lastly, the X set is standardized via scikit-learn’s StandardScaler protocol.

Shipment Mode Prediction (DNN)

TensorFlow scikit-learn

This model will utilize a Deep Neural Network architecture with TensorFlow. Before we start we must instantiate the processed data from the previous section.

Above is one of the results of the previous function’s call: X_train. Let’s see how y_train looks:

It seems the label replacement on the shipment mode feature has worked, and has successfully transferred to this new y_train variable. Let’s see the distribution of these labels:

The shipment mode Air represented here by 0, seems to dominate the data with over 3000 records. The Ocean shipment mode, represented by 3 has the least occurrences amongst this data. This fact is not at all surprising, but it will be interesting to see how this affects the model, if any affect can be noted.

import tensorflow as tf
from sklearn.metrics import confusion_matrix, classification_report
inputs = tf.keras.Input(shape=(529,))
x = tf.keras.layers.Dense(128, activation='relu')(inputs)
x = tf.keras.layers.Dense(128, activation='relu')(x)
outputs = tf.keras.layers.Dense(4, activation='softmax')(x)

model = tf.keras.Model(inputs=inputs, outputs=outputs)

model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

history = model.fit(
    X_train,
    y_train,
    validation_split=0.2,
    batch_size=32,
    epochs=100,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(
            monitor='val_loss',
            patience=3,
            restore_best_weights=True
        )
    ]
)

Predict Line Item Insurance (MLR)

Pandas scikit-learn

Prepare Data for MLR

Multiple Linear Regression is a very simple technique and will only require certain predictors which express a linear relationship with a target feature. In this case the target feature will be Line Item Insurance (USD). I will firstly extract variables of interest: the target feature, and potential predictors.

MLR_potential = dataset_dropna[['line item quantity', 'line item value', 'pack price', 'unit price', 'line item insurance (usd)']]

With these we can examine potential linear relationships with Line Item Insurance (USD).

Isolated Linear Regression

Below shows the linear regression result between Line Item Quantity and Line Item Insurance:

Similarly, below is the result of a linear regression fitted between Line Item Value and Line Item Insurance:

Two of the four potential predictor variables express a Linear Relationship of sorts with the Line Item Insurance target feature. Line Item Quantity and Line Item Value can now be used together with multiple linear regression to obtain a more comprehensive result of this relationship.

Multiple Linear Regression

MLR_Preds = MLR_potential[['line item quantity', 'line item value']]
MLR_Target = MLR_potential['line item insurance (usd)']
from sklearn import linear_model

regr = linear_model.LinearRegression()
regr.fit(MLR_Preds, MLR_Target)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Insurance Prediction

The following is an insurance prediction where Line Item Quantity is 5500, and Line Item Value is $140,000.

predict_insurance = regr.predict([[5500, 140000]])

The Line Item Insurance result from this prediction is $218.96. Additionally, the measured coefficients for this relationship are -0.0003384 and 0.00140536. This suggests a slight negative relationship between Line Item Quantity and Line Item Insurance. However, the relationship between Line Item Value and Line Item Insurance has a more dominant power amongst this set. This shows the usefulness of a multiple linear regression problem. You can tease out interesting results in a set rather than in isolation.