The World Health Organization (WHO)’s Global Health Observatory (GHO) data repository tracks life expectancy for countries worldwide by following health status and many other related factors.
Although there have been a lot of studies undertaken in the past on factors affecting life expectancy considering demographic variables, income composition, and mortality rates, it was found that the effects of immunization and human development index were not taken into account.
This dataset covers a variety of indicators for all countries from 2000 to 2015 including:
However, all indicators will be uncovered in the next section.
Ideally, this data will eventually inform countries concerning which factors to change in order to improve the life expectancy of their populations. If we can predict life expectancy well given all the factors, this is a good sign that there are some important patterns in the data. Life expectancy is expressed in years, and hence it is a number. This means that in order to build a predictive model one needs to use regression.
In this project, the main focus is to design, train, and evaluate a neural network model performing the task of regression to predict the life expectancy of countries using the dataset.
This dataset is sourced from Life Expectancy Kaggle Dataset, submitted by KumarRajarshi, originally scraped from WHO and United Nations website with the help of Deeksha Russell and Duan Wang.
First, I load the life_expectancy.csv
dataset into a pandas DataFrame by first importing pandas, and then using the pandas.read_csv()
function to load the file and assign the resulting DataFrame to a variable called dataset.
import pandas as pd
import numpy as np
# Set options for printing
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
# Load the dataset
df = pd.read_csv("life_expectancy.csv")
The next step is to observe the data by printing the first entries in the DataFrame dataset by using the df.head()
function.
df.head()
# Sanity check whether the dataset has any missing values
print(df.isnull().values.any())
From here, we know what these columns represent:
Country
represents the name of the countryYear
represents the year of observation in that country.Status
is categorical, either Developing
or Developed
Adult Mortality
is a probability of dying between 15 and 60 years for both sexes (per 1000 population).infant deaths
is the number of Infant Deaths per 1000 populationAlcohol
is alcohol consumption, recorded per capita (age 15+) (in litres of pure alcohol)Percentage Expenditure
is the expenditure on health as a percentage of Gross Domestic Product per capita (%)Hepatitis B
is Hepatitis B (HepB) immunization coverage among 1-year-olds (%)Measles
- number of reported measles cases (per 1000 population)BMI
is the average Body Mass Index of entire population, of that country and yearunder-five-deaths
is the number of under-five deaths (per 1000 population)Polio
is the rate of Polio (Pol3) immunization coverage among 1-year-olds (%)Total expenditure
is the general government expenditure on health as a percentage of total government expenditure (%)Diphtheria
the rate of Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)HIV-AIDS
is the amount of deaths because of HIV/AIDS (0-4 years) (per 1000 live births)GDP
Gross Domestic Product per capita (in USD)Population
is the population of the countrythinness 1-19 years
is the prevalence of thinness among children and adolescents for Age 10 to 19 (%)thinness 5-9 years
is the prevalence of thinness among children for Age 5 to 9 (%)income composition of resources
is the Human Development Index in terms of income composition of resources (index ranging from 0 to 1)Schooling
is the number of years of Schooling (years)Life expectancy
is the life expectancy in ageDropping the Country
and Year
column from the DataFrame using the DataFrame drop
method. Why? To create a predictive model, knowing from which country data comes can be confusing and it is not a column we can generalize over. The goal is to learn a general pattern for all the countries, and not only those dependent on specific countries and/or year.
df.drop(['Country', 'Year'], axis=1, inplace=True)
df
After dropping the aforementioned columns, I'll be splitting the data into labels and features. Labels are contained in the Life expectancy
column, which is the final column in the DataFrame. As such, one method is to use iloc
indexing to assign the final column of dataset to it.
labels = df.iloc[:, -1]
labels
Features span from the first column up until the last column (not including it, because Life Expectancy
is considered a label). Like before, features are assigned using the iloc
indexing to assign a subset of columns.
features = df.iloc[:, np.r_[0:19]]
features
One particular column is categorical in this dataset, namely the Status
column. Categorical columns need to be converted into numerical columns using methods such as one-hot-encoding. To tackle this approach, using pandas.get_dummies(DataFrame)
to apply one-hot-encoding on the categorical column is a prominent method. After that, I'll assign the result of the encoding back to the features
variable.
features = pd.get_dummies(data = features, columns= ['Status'])
features # Check whether one-hot encoding works as intended
Now that the data has been cleaned, they are split into training set and test sets using the sklearn.model_selection.train_test_split()
function.
Variables to assign:
features_train
labels_train
features_test
labels_test
.For this project, the data is split randomly into 33% test data and 67% training data.
Note that 67% of the training data will be split again into 80% training set and 20% validation set when the model is instantiated.
from sklearn.model_selection import train_test_split
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.33, random_state=42)
features_train
The next step is to normalize the data's numerical features.
As a wrapper, normalization with z-score is done using the sklearn.compose.ColumnTransformer
method. The variable ct
will be used to set up the normalization procedure.
Keep in mind that only numerical columns are legitimate arguments for the ColumnTransformer
object. Also, for the columns not specified when the object is created, sklearn
will automatically ignore them when the parameter remainder='passthrough'
is specified.
print("Feature columns are: ", features.columns.tolist())
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([
(
'scaler',
StandardScaler(),
['Adult Mortality', 'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B', 'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure', 'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population', ' thinness 1-19 years', ' thinness 5-9 years', 'Income composition of resources', 'Schooling', 'Status_Developed', 'Status_Developing']
)
], remainder='passthrough')
After instantiating the object ct
of ColumnTransformer
, fit and transform training data by using the ColumnTransformer.fit_transform()
method. The result is assigned to a variable called features_train_scaled
.
By similar method, transform also the test data instance features_test
using the trained ColumnTransformer
instance ct
. The result is assigned into a variable called features_test_scaled
.
features_train_scaled = pd.DataFrame(ct.fit_transform(features_train), columns = features_train.columns)
features_test_scaled = pd.DataFrame(ct.fit_transform(features_test), columns = features_test.columns)
features_test_scaled
In this project, I'll be training and building a neural network from scratch using the Sequential()
method from tensorflow.keras.models
. Afterwards, if the evaluation metrics are satisfactory, I'll export the model with the computed weights in .json
and .pb
.
I start with a dummy model before performing the regression task. This is done to provide a baseline, which I'll use to evaluate whether the model I'll make performs reasonably or not.
For this project, the baseline is calculated with the mean
method along the evaluation metric Mean Absolute Error.
Feel free to use median
or quantile
for the reproducibility of the result.
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_absolute_error
dummy_regr = DummyRegressor(strategy="mean")
dummy_regr.fit(features_train_scaled, labels_train)
y_pred = dummy_regr.predict(features_test_scaled)
MAE_baseline = mean_absolute_error(labels_test, y_pred)
print("The model must have a validation error lower than", round(MAE_baseline, 3))
For reproducibility, I'll wrap my model around functions instead of making instances of them every time. Such functions are free to tweak for your own purposes and/or datasets, but for this project, model specifications are as follows:
relu
activation;Verbose
set to True
;EarlyStopping
enabled, monitoring validation loss with patience
set to 5
.In the case of EarlyStopping
, patience
parameter is set to a small number despite a large amount of initial epochs to prevent overfitting.
The model designed is wrapped around the design_model
function.
## Define functions as wrappers
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
def design_model(X, learning_rate):
"""Function to instantiate empty model
Input: 2-dimensional NumPy array or Pandas DataFrame, and specified learning rate.
Returns: Empty model with 20% dropout between hidden layers"""
model = Sequential(name="second_model")
input = tf.keras.Input(shape=(X.shape[1],)) # Automatically detect features as input nodes
model.add(input)
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dropout(0.2))
model.add(layers.Dense(32, activation='relu'))
model.add(layers.Dropout(0.2))
model.add(layers.Dense(1)) # output layer
opt = tf.keras.optimizers.Adam(learning_rate = learning_rate)
model.compile(loss='mse', metrics=['mae'], optimizer = opt)
return model
def plot(history, path=None):
"""Function to plot learning curves.
Input: model metrics per epoch
Returns: Learning curves for both training and validation sets"""
fig, axs = plt.subplots(1, 2, gridspec_kw={'hspace': 1, 'wspace': 0.5})
(ax1, ax2) = axs
ax1.plot(history.history['loss'], label='train')
ax1.plot(history.history['val_loss'], label='validation')
ax1.legend(loc="upper right")
ax1.set_xlabel("# of epochs")
ax1.set_ylabel("loss (mse)")
ax2.plot(history.history['mae'], label='train')
ax2.plot(history.history['val_mae'], label='validation')
ax2.legend(loc="upper right")
ax2.set_xlabel("# of epochs")
ax2.set_ylabel("MAE")
def fit_model(model, f_train, l_train, learning_rate, num_epochs):
"""Function to train the model with stochastic gradient descent.
Input: 2-dimensional NumPy array or Pandas DataFrame
Returns: Trained model with calculated weights"""
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=5)
model.fit(f_train, l_train, epochs=num_epochs, batch_size= 1, verbose=1, validation_split = 0.2, callbacks=[es])
return model
def save_model(model):
"""Function to save pre-trained model.
Input: pre-trained model"""
return model.save('model')
Before starting training, I'll double check whether we get the layers right by passing the Sequential.summary(model)
method.
learning_rate = 0.01 # Specify learning rate
num_epochs = 200 # Specify number of forward-backward passes along the network
model = design_model(features_train_scaled, learning_rate)
model.summary()
fit_model(model, features_train_scaled, labels_train, learning_rate, num_epochs)
print("------ TRAINING FINISHED! ------".center(110))
mse, mae = model.evaluate(features_test_scaled, labels_test)
print("------------------ EVALUATION FINISHED! ------------------".center(115))
print("Final Mean Squared Error (loss func.) is {}\nFinal Mean Absolute Error (eval. metric) is {}".format(mse, mae))
The observed baseline is 7.833
for the model's evaluation metric. After training, results show that this simple yet effective network has an error of 3.54
on test dataset.
This shows that the model actually performs more than 2x better the baseline, and thus we can draw a conclusion that the result is satisfactory.
We can use this model to predict unseen data from this dataset, and if a new data comes regardless of the country and/or year, the model can right away predict the life expectancy of that data point with minimal error.
# Save the model
save_model(model)