Ilha SolteiraIlha Solteira

UNIVERSIDADE ESTADUAL PAULISTA

"JÚLIO DE MESQUITA FILHO"

School of Engineering of Ilha Solteira - SP

Post-Graduate Program in Electrical Engineering

SELENE LEYA CERNA ÑAHUIS

Comparative Analysis of XGBoost, MLP and LSTM

Techniques for The Problem of Predicting Fire Brigade

Interventions

Análise Comparativa das Técnicas XGBoost, MLP e LSTM para o

Problema de Prever Intervenções de Bombeiros

Ilha Solteira

2019


SELENE LEYA CERNA ÑAHUIS

Comparative Analysis of XGBoost, MLP and LSTM

Techniques for The Problem of Predicting Fire Brigade

Interventions

Análise Comparativa das Técnicas XGBoost, MLP e LSTM para o

Problema de Prever Intervenções de Bombeiros

Dissertation presented to the São Paulo
State University (UNESP) - School of En-
gineering - Campus of Ilha Solteira, in ful-
filment of one of the requirements for ob-
taining the Master’s degree in Electrical
Engineering.
Field of Expertise: Automation.

Prof. Dr. Anna Diva Plasencia Lotufo

Advisor

Prof. PhD. Christophe Guyeux

Co-Advisor

Ilha Solteira

2019


A mi Oshito bonita, por su amor y paciencia.


ACKNOWLEDGEMENTS

I would like to express my thanks to my advisor, Anna Diva, for accepting me in the master

and giving me the chance to meet this beautiful country, Brazil. For guiding me and encouraging

me throughout my research with numerous advice and for the friendship and patience show up

these years.

I thank my co-advisor Christophe Guyeux, for the trust placed in me and for giving me the

precious opportunity to contribute to firefighters during this research. For being an example to

me as a researcher with passion and commitment, for sharing his knowledge and teaching me

with patience.

I am also grateful to Pierre-Cyrille Heam and Raphaël Couturier, professors in The Uni-

versity of Franche-Comté, for their welcome and support in the AND team, Femto-ST. And a

special thanks to Cap. Guillaume Royer from SDIS25, for giving me the honour of work with

such a great organization, the fire brigade of Doubs, whom I admire for their noble effort and

dedication to the community good.

I would like to infinitely thank my parents, Eulalia and Fernando, who never measured

efforts to give studies to their children and always encouraged us to be persevering and to strive

for what we want. To my brother Edwin, for all his effort working and helping the family since

his childhood. To my sister Liesel, for taking care of me and giving me a profession. And to

my brother Fernando, for sharing his knowledge with me since I was a child and for showing

me the scientific way.

I thank Héber, my special person, who patiently guided me and cheered me in the writing of

my scientific texts, for accompanying me in long discussions about research and incoherences,

with whom I discovered Brazil.

I thank my good friends Tania and Monara, who welcomed me affectionately in SINTEL,

they understood my zero or poorly spoken Portuguese and helped me so kindly during the

master’s degree.

Finally, this study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal

de Nível Superior - Brasil (CAPES) - Finance Code 001".


ABSTRACT

Many environmental, economic and societal factors are leading fire brigades to be increasingly

solicited, and, as a result, they face an ever-increasing number of interventions, most of the time

on constant resource. On the other hand, these interventions are directly related to human activ-

ity, which itself is predictable: swimming pool drownings occur in summer while road accidents

due to ice storms occur in winter. One solution to improve the response of firefighters on con-

stant resource is therefore to predict their workload, i.e., their number of interventions per hour,

based on explanatory variables conditioning human activity. The present work aims to develop

three models that are compared to determine if they can predict the firefighters’ response load in

a reasonable way. The tools chosen are the most representative from their respective categories

in Machine Learning, such as XGBoost having as core a decision tree, a classic method such

as Multi-Layer Perceptron and a more advanced algorithm like Long Short-Term Memory both

with neurons as a base. The entire process is detailed, from data collection to obtaining the

predictions. The results obtained prove a reasonable quality prediction that can be improved by

data science techniques such as feature selection and adjustment of hyperparameters.

Keywords: Firefighters. Prediction. XGBoost. Long Short-Term Memory. Multi-Layer Per-

ceptron. Mutual Information Regression. Principal Component Analysis.


RESUMO

Muitos fatores ambientais, econômicos e sociais estão levando as brigadas de incêndio a serem

cada vez mais solicitadas e, como consequência, enfrentam um número cada vez maior de inter-

venções, na maioria das vezes com recursos constantes. Por outro lado, essas intervenções estão

diretamente relacionadas à atividade humana, o que é previsível: os afogamentos em piscina

ocorrem no verão, enquanto os acidentes de tráfego, devido a tempestades de gelo, ocorrem no

inverno. Uma solução para melhorar a resposta dos bombeiros com recursos constantes é pre-

ver sua carga de trabalho, isto é, seu número de intervenções por hora, com base em variáveis

explicativas que condicionam a atividade humana. O presente trabalho visa desenvolver três

modelos que são comparados para determinar se eles podem prever a carga de respostas dos

bombeiros de uma maneira razoável. As ferramentas escolhidas são as mais representativas de

suas respectivas categorias em Machine Learning, como o XGBoost que tem como núcleo uma

árvore de decisão, um método clássico como o Multi-Layer Perceptron e um algoritmo mais

avançado como Long Short-Term Memory ambos com neurônios como base. Todo o processo

é detalhado, desde a coleta de dados até a obtenção de previsões. Os resultados obtidos demon-

stram uma previsão de qualidade razoável que pode ser melhorada por técnicas de ciência de

dados, como seleção de características e ajuste de hiperparâmetros.

Palavras-chave: Bombeiros. Previsão. XGBoost. Long Short-Term Memory. Multi-Layer

Perceptron. Mutual Information Regression. Principal Component Analysis.


LIST OF FIGURES

Figure 1 Bias-Variance tradeoff in machine learning . . . . . . . . . . . . . . . 23

Figure 2 Learning a tree on single variable . . . . . . . . . . . . . . . . . . . . 25

Figure 3 A Multi-Layer Perceptron. The S-shaped curves in the hidden and out-

put layers indicate the application of “sigmoidal” nonlinear activation

functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Figure 4 A recurrent neuron (left), unrolled through time (right). . . . . . . . . 30

Figure 5 LSTM cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Figure 6 Standard scaler representation . . . . . . . . . . . . . . . . . . . . . . 33

Figure 7 One hot encoder representation . . . . . . . . . . . . . . . . . . . . . 34

Figure 8 Procedure for estimating the MI . . . . . . . . . . . . . . . . . . . . . 35

Figure 9 Transformation with PCA . . . . . . . . . . . . . . . . . . . . . . . . 36

Figure 10 Number of interventions per year . . . . . . . . . . . . . . . . . . . . 41

Figure 11 Frequency of number of interventions per hour . . . . . . . . . . . . . 41

Figure 12 Outliers in the occurrence of interventions . . . . . . . . . . . . . . . 41

Figure 13 Maximum number of interventions . . . . . . . . . . . . . . . . . . . 42

Figure 14 Precipitation each 1h . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Figure 15 L’Est Republicain reports on the storms . . . . . . . . . . . . . . . . . 43

Figure 16 Precipitations peaks recorded from Basilea Station . . . . . . . . . . . 43

Figure 17 Storm area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Figure 18 Precipitation each 3h . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Figure 19 Ten minutes average wind speed . . . . . . . . . . . . . . . . . . . . 44

Figure 20 Features structure before modelling . . . . . . . . . . . . . . . . . . . 48

Figure 21 XGBoost - Feature importance . . . . . . . . . . . . . . . . . . . . . 54

Figure 22 XGBoost - Tree construction . . . . . . . . . . . . . . . . . . . . . . 55


Figure 23 MLP model architecture with 3 past hours . . . . . . . . . . . . . . . 56

Figure 24 LSTM model architecture with 24 time steps . . . . . . . . . . . . . . 57

Figure 25 XGBoost with three past hours for predicting one hour . . . . . . . . . 58

Figure 26 XGBoost with twelve past hours for predicting one hour . . . . . . . . 59

Figure 27 XGBoost with twenty-four past hours for predicting one hour . . . . . 60

Figure 28 MLP with three past hours for predicting one hour . . . . . . . . . . . 61

Figure 29 MLP with twelve past hours for predicting one hour . . . . . . . . . . 62

Figure 30 MLP with twenty-four past hours for predicting one hour . . . . . . . 63

Figure 31 LSTM with three past hours for predicting one hour . . . . . . . . . . 64

Figure 32 LSTM with twelve past hours for predicting one hour . . . . . . . . . 65

Figure 33 LSTM with twenty-four past hours for predicting one hour . . . . . . 66

Figure 34 XGBoost, MLP and LSTM with three past hours for predicting one hour 67

Figure 35 XGBoost, MLP and LSTM with twelve past hours for predicting one

hour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Figure 36 XGBoost, MLP and LSTM with twenty-four past hours for predicting

one hour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69


LIST OF TABLES

Table 1 Initialization parameters for each type of activation function . . . . . . 29

Table 2 Illustrated example of the dictionary . . . . . . . . . . . . . . . . . . . 40

Table 3 Numerical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Table 4 Categorical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Table 5 List of features after applying mutual information regression selection . 47

Table 6 Differences between MIR and PCA . . . . . . . . . . . . . . . . . . . 47

Table 7 XGBoost hyperparameters settings . . . . . . . . . . . . . . . . . . . . 49

Table 8 MLP hyperparameters Settings . . . . . . . . . . . . . . . . . . . . . . 50

Table 9 LSTM hyperparameters Settings . . . . . . . . . . . . . . . . . . . . . 50

Table 10 XGBoost prediction results for next one hour on test data 2017 . . . . . 52

Table 11 MLP prediction results for next one hour on test data 2017 . . . . . . . 52

Table 12 LSTM prediction results for next one hour on test data 2017 . . . . . . 52


LIST OF ABBREVIATIONS & ACRONYMS

XGBoost Extreme Gradient Boosting

ANN Artificial Neural Network

FNN Feedforward Neural Network

MLP Multi-Layer Perceptron

RNN Recurrent Neural Network

LSTM Long Short-Term Memory

SVM Support Vector Machine

GA Genetic Algorithm

PSO Particle Swarm Optimization

KDE Kernel Density Estimation

MIR Mutual Information Regression

PCA Principal Component Analysis

RMSE Root Mean Squared Error

MAE Mean Absolute Error

SDIS25 Service Départemental d’Incendie et de Secours 25

ACC Accuracy

DBN Deep Belief Network

RBM Restricted Boltzmann Machine

CART Classification and Regression Trees

GBM Gradient Boosting Machine

MSE Mean Squared Error

DNN Deep Neural Network

Tanh Hyperbolic Tangent function

ReLU Rectified Linear Unit function

RTRL Real Time Recurrent Learning

BPTT Back-Propagation Through Time

OHE One-Hot Encoding

MI Mutual Information

SVD Singular Value Decomposition

ACC0E Accuracy with margin of error 0

ACC1E Accuracy with margin of error 1

ACC2E Accuracy with margin of error 2

GRU Gated Recurrent Unit


TABLE OF CONTENTS

1 INTRODUCTION 13

1.1 MOTIVATION 15

1.2 CONTRIBUTIONS 15

1.3 OUTLINE OF THE THESIS 16

2 MACHINE LEARNING TECHNIQUES 18

2.1 INTRODUCTION 18

2.2 GRADIENT BOOSTING MACHINES 21

2.2.1 OVERVIEW 21

2.2.2 EXTREME GRADIENT BOOSTING 21

2.3 ARTIFICIAL NEURAL NETWORKS 25

2.3.1 OVERVIEW 25

2.3.2 MULTI-LAYER PERCEPTRON 26

2.3.3 LONG SHORT-TERM MEMORY NEURAL NETWORK (LSTM) 29

2.4 ADVANCED FEATURES 32

2.4.1 FEATURE EXTRACTION 32

2.4.2 FEATURE SELECTION 34

2.4.3 MODEL SELECTION 36

3 DATA PREPROCESSING 38

3.1 DATA COLLECTION 38

3.2 DATA CLEANING 40

3.3 VARIABLES ENCODING 44

3.4 DATA SET DIMENSIONALITY REDUCTION 45

4 BUILDING PREDICTION MODELS 48


4.1 MODELLING WITH XGBOOST 49

4.2 CONSTRUCTING A MULTI-LAYER PERCEPTRON 49

4.3 CONSTRUCTING A LONG SHORT-TERM MEMORY NEURAL NETWORK 50

4.4 METRICS 50

4.5 EXPERIMENTAL RESULTS AND DISCUSSION 51

5 CONCLUSIONS AND SUGGESTIONS FOR FURTHER RESEARCH 70

5.1 CONCLUSIONS 70

5.2 SUGGESTIONS FOR FURTHER RESEARCH 71

5.3 SCIENTIFIC PUBLICATIONS 71

REFERENCES 73


13

1 INTRODUCTION

In the past decade, machine learning has provided effective web search, practical speech

and image recognition, efficient statistical arbitrage, more accurate medical diagnostic systems,

self-driving cars, and an immense improvement in understanding the human genome. Machine

learning is so widespread that people use it every day without knowing it, changing lives and

providing an irreplaceable service in the industry (GUPTA; GUSAIN; POPLI, 2016). Numer-

ous researchers believe that is the best way to get to human level artificial intelligence.

Machine learning aims to spot a function based in a great set of data. The function is

created with a first training group of data in order to forecast outputs of a previously unseen

data set called testing set, always searching the highest possible accuracy. This could be seen as

classification problems in the form of labels and classes or as regression problems in the form

of continuous values (SJARDIN; MASSARON; BOSCHETTI, 2016). Nevertheless, it is often

hard to find enough training data available, i.e., lack of suitable data, lack of access to the data,

privacy problems, badly chosen tasks and algorithms, and a lack of resources; as a result, often

several machine learning programs fail to deliver the expected value.

The goal of building these systems is that they could be adaptable, learn from their expe-

rience and being enough simple. With this purpose, two approaches stand out from machine

learning: decision tree learning and artificial neural networks.

Decision trees have for a long time been used for classification purposes. A highly opti-

mized classification system was obtained when Friedman et al. (FRIEDMAN; HASTIE; TIB-

SHIRANI, 1998) applied the boosting technique to perform decision tree ensembles. Later, the

gradient descend approach was introduced to boosting (MASON et al., 1999) and along with its

stochastic variant (FRIEDMAN, 2002) the boosted tree model increased robustness against the

overcapacity of the base learner by adding trees sequentially while fitting a simple parameter-

ized function. Despite all its improvements and wide use, it still was not efficient in processing

time and model complexity. In order to reduce the latter, Johnson et al. (JOHNSON; ZHANG,

2014) made a modification of gradient boosting, it was called Fully Corrective Greedy Algo-

rithm, achieving higher accuracies and smaller models. Nevertheless, this was not enough. In

order to cope with this complexity, it was created the Extreme Gradient Boosting (XGBoost),

a highly scalable and accurate learning system (CHEN; GUESTRIN, 2016), which has gained

a lot of attention in recent years for its simplicity and its great success in competitions such as

Kaggle.

Alternatively, Artificial Neural Networks (ANNs) are versatile, powerful, and scalable, ex-


1 INTRODUCTION 14

cellent to cope with large and tremendously complex machine learning tasks as, for example,

categorizing billions of images as Google Images does or like DeepMind’s AlphaGo that de-

feats the world champion and itself at the game of Go by learning and analyzing millions of

past games (GÉRON, 2017).

Originally, ANNs were developed as a mathematical representation of biological brains

capabilities when information is processed (MCCULLOCH; PITTS, 1988). Despite the fact

that nowadays it is known that ANNs have little resemblance to real biological neurons, they

are still popular as applied methods for the classification of patterns. A large number of varieties

of ANNs have appeared over the years, with very diverse properties. A relevant difference is

between ANNs which have acyclic connections and those with cyclic. ANNs without cycles

are identified as feedforward neural networks (FNNs) and the most recognized example is the

Multi-Layer Perceptron (MLP) that has its beginning with (ROSENBLATT, 1958). ANNs with

cycles are described as recursive or better called Recurrent Neural Networks (RNNs). The main

idea of RNNs is that they can manage contextual information when input and output sequences

are mapped. Unluckily, in practice, conventional RNN architectures have a quite limited access

to a range of context, because of the effect of vanishing gradient problem, in which the impact of

an entry in the hidden layer(s) and, consequently, in the output decays or explodes exponentially

as it travels through the recurrent connections in the network (GRAVES, 2012). One attempt to

overcome this problem was proposed in 1997 by Hochreiter and Schmidhuber (HOCHREITER;

SCHMIDHUBER, 1997) with Long Short-Term Memory (LSTM), which have emerged as an

effective and scalable model for several learning problems related to sequential data (GREFF et

al., 2017).

Around the world, fire brigades seek strategies to deal with the fact of response immediately

to incidents. Researches based on the forecast of the number of interventions, response speed,

materials and engine used by fire departments are scarce in the literature. However, we can find

machine learning techniques applied to forecasting occupational accidents using Support Vec-

tor Machine (SVM) and Multi-Layer Perceptron (MLP) optimized by Genetic Algorithm (GA)

and Particle Swarm Optimization (PSO), where the aim was to predict the accident outcomes

such as injury and property damage using occupational accident data (SARKAR et al., 2018).

Another example, Ren et al. (REN et al., 2018) worked with a deep learning approach, using

LSTM neural networks to predict the traffic accident risk, taking into consideration weather

variables, periodical patterns and getting from the traffic accidents the spatial distribution pat-

terns. The traffic accident data was discretized in space and time, and used as inputs to the

deep model during training. Then, the model was provided with recent historical data to even-

tually obtain the real-time traffic accident risk prediction. Gerber (GERBER, 2014) developed

a research using spatio-temporally tagged tweets with a Kernel Density Estimation (KDE) tech-

nique in order to predict crimes where the crime data collected belonged to the Chicago Police

Department and the selected tweets were the ones tagged with the geographical positioning sys-


1.1 MOTIVATION 15

tem coordinates corresponding to the limits of Chicago and Illinois cities. In the work of (SHI

et al., 2017), the author proposed an approach based on XGBoost to predict urban fire acci-

dents using ten million samples as data set, an algorithm based on association rules to select

features, as well as the Box-Cox transformation to clean outliers. As a result, they obtained

90% of accuracy. Finally, Pettet et al. (PETTET et al., 2017) developed a data-driven toolchain

to forecast the likelihood of vehicular incidents in given time and location using incident data

from the Nashville Fire Department of the United States of America, considering weather and

road type information as added variables. The authors combined a Similarity Based Agglomer-

ative Clustering to categorize incidents with similar characteristics, a survival analysis to learn

the probability of incidents per cluster and a Bayesian Network inference technique to map the

clusters to the spatial locations.

Thereby, most of the problems that firefighters face on are related to the management of

the budget, rising call volume, and personnel and equipment shortages. On the other side, fire

departments have been collecting data on their interventions; notwithstanding, few of them use

data science to elaborate a decision making approach.

Knowing the number of interventions in the next hours can help to decrease the response

time of firefighters and improve their ability to save lives and rescue people. In this research,

it is developed three models for this purpose: one with XGBoost, another with MLP and a last

one with LSTM. Finally, their performances are compared in order to find the most efficient in

terms of prediction accuracy.

1.1 MOTIVATION

There is a need to build an accurate model that is capable of evaluating the possible oc-

currence of an incident using information that happened in the past. Therefore, it could help

fire departments to establish well planned strategies with their allocated mobile and personnel

resources. Thus, the response time of firefighters would be reduced and more lives would be

safe.

In the literature, there are few researches related to the forecasts of interventions carried

out by firefighters, that is the reason for which in this work three paradigms are constructed

using different techniques of machine learning, and their performances are evaluated to discover

which fits better to the considered data.

1.2 CONTRIBUTIONS

In this research, the aim is to develop, evaluate and compare the performance of three

regression models applied to the prediction of the number of interventions that the fire brigade


1.3 OUTLINE OF THE THESIS 16

SDIS25 in Doubs, France, would face on. The techniques used are:

• XGBoost, a software library which provides a gradient boosting framework;

• MLP, a classical machine learning method;

• LSTM, considered a popular and more advanced neural network algorithm.

The procedure establishes the following objectives:

i) Collect and clean data from various sources. The initial data set is provided by the fire

brigade “Service Départemental d’Incendie et de Secours” of Doubs. For complementary

sources, it will be considered weather-related data, traffic hours, epidemiological data,

academic vacations and holidays, astronomical variables as moon phase and others.

ii) Encode and select data. As the initial data are not defined by float value pairs, it is necessary

to applied feature extraction techniques such as Standard Scaler and One-Hot-Encoding

from Scikit-Learn library in Python. And in order to recognize the relationship of all vari-

ables and if they all are needed for the modelling, it will be tested and compared two tech-

niques: Mutual Information Regression (MIR) and Principal Component Analysis (PCA),

also from Scikit-Learn library.

iii) Build models with XGBoost, MLP and LSTM. During the construction of the models and

to improve the predictions, it will be made a research to find the best configuration of

hyperparameters. For XGBoost, it will be used a grid search and for MLP and LSTM, a

random search.

iv) Compare the models’ performances. This activity is made taking into account the Root

Mean Squared Error (RMSE), Mean Absolute Error (MAE) and Accuracy (ACC) as eval-

uation metrics.

Furthermore, this research shows the feasibility of the paradigms constructed, illustrates

several variables not considered yet in the prediction and initializes discussions about these

issues. Finally, it provides a path to the further development of predictive models for firefighters

interventions considering machine learning techniques.

1.3 OUTLINE OF THE THESIS

The rest of the thesis is organized as follows:


1.3 OUTLINE OF THE THESIS 17

• Chapter 2: gives an introduction to machine learning, explaining learning algorithms, its

advantages, challenges and limitations considered in this research.

• Chapter 3: describes how the data were collected, cleaned, encoded and selected before

feeding the models, and depicts the tuning made to each model to improve their hyperpa-

rameters.

• Chapter 4: exhibits the predicting models developed, the results obtained and a discussion

interpreting the highlighting study results.

• Chapter 5: concludes about the subject by offering important insights about the research

problem and makes suggestions for future studies.


18

2 MACHINE LEARNING TECHNIQUES

2.1 INTRODUCTION

The Internet, the enormous computational power of the recently electronic devices and the

fact that practically all process in the world make use of any kind of software are providing a

great quantity of data every second. There are many things to do with this data such as analyze

it to later take a decision, visualize it in a lot of ways or consider it as a source of experience to

enhance the performance of the algorithms. These algorithms, which can learn from previous

data, comprehends the field of Machine Learning (GARRETA; MONCECCHI, 2013).

Some definitions that try to encapsulate the ideal objective or ultimate aim of machine

learning are expressed as follows:

A computer program is said to learn from experience E with respect to some class

of tasks T and performance measure P, if its performance at tasks in T, as measured by P,

improves with experience E. (Mitchell, 1997, p.2)

Machine learning algorithms can figure out how to perform important tasks by gen-

eralizing from examples. This is often feasible and cost-effective where manual program-

ming is not. As more data becomes available, more ambitious problems can be tack-

led. As a result, machine learning is widely used in computer science and other fields.

(DOMINGOS, 2012, p.1)

As it is described in (GÉRON, 2017), there are many different types of machine learning

algorithms and it is useful to classify them in broad categories based on:

• Training with or without human supervision.

– Supervised Learning. During the training phase, the algorithm is fed by data called

Predictors indicating the desired results named Labels. The task types could be

classification, like classify new emails as spam or non-spam; or regression where

the aim is to predict a numerical value, such as the price of a car. This last sort

of prediction is also used for algorithmic trading. Some examples of supervised

learning algorithms are: K-Nearest Neighbors, Linear Regression, Support Vector

Machine, Decision Trees, Random Forests and several Neural Networks architec-

tures (GÉRON, 2017; GOODFELLOW; BENGIO; COURVILLE, 2016).


2.1 INTRODUCTION 19

– Unsupervised learning. Samples of the training data are unlabeled; therefore, the

system attempts to learn without a guide. The model is trained with normal instances

to later, during the prediction, recognizes abnormal ones. However, two main diffi-

culties are known: the generalization that refers to the exponential growth of config-

urations during the learning process and the computational challenge where many

algorithms imply intractable computations. As examples: Clustering, Visualization

and dimensionality reduction, and Association rule learning are referred (GÉRON,

2017; GOODFELLOW; BENGIO; COURVILLE, 2016).

– Semi-supervised. The algorithms can handle a lot of unlabeled data with a few of

labeled data. An example of these algorithms is the Deep Belief Networks (DBNs)

that have the component Restricted Boltzmann Machines (RBMs) which is trained

sequentially in an unsupervised manner and next fine-tuned using supervised learn-

ing techniques (GÉRON, 2017; GARRETA; MONCECCHI, 2013).

– Reinforcement learning: The system is named Agent and it has the ability to ob-

serve the environment, choose and execute actions to get rewards in return or penal-

ties. Hence, the agent learns by itself the best strategy. (GÉRON, 2017; GARRETA;

MONCECCHI, 2013).

• Learning incrementally on the fly.

When the system is trained using all data available and then launched into production

without learning anything else, the process is called Batch learning. This is done offline

and consumes plenty of time and resources. To update the system, it requires to be trained

a new version of itself, including the new data and the old data; then, the old system

would be stopped and replaced by the new one. In order to make the computation faster

and more space efficient, it can be used another method called Online learning, where the

system is trained sequentially by small groups called Mini-batches, learning new data on

the fly (GÉRON, 2017; BURLUTSKIY et al., 2016).

• Learning by generalization of examples.

It is divided into two types: Instance based-learning, where samples are learned by heart

and the generalization process is using a similarity criterion; and Model-based learning,

where a model is built using a set of samples and the predictions are made based on this

model (GÉRON, 2017).

For example, a state-of-the-art spam filter may learn on the fly using a deep neural network

model trained; this makes it an online, model-based, supervised learning system.

Regardless of learning style or function, all combinations of machine learning algorithms

consist of three components according to (DOMINGOS, 2012):


2.1 INTRODUCTION 20

• Representation. The classifier must be depicted in a language that computer could un-

derstand.

• Evaluation. An evaluation function, also called objective function, is required to distin-

guish good classifiers from the least efficient ones. The objective function used locally by

the algorithm may diverge from the external one that it is desired the classifier to optimize.

• Optimization. It is needed a method to optimize the classifiers and obtain the highest-

scoring one. The selection of an optimization technique is essential to the efficiency of

the learner, helping to discover more than one optimum.

As it was described before, there are different approaches to getting machines to learn. With

all of their processing power, they are able to more quickly highlight or find patterns in big data

that would have otherwise been missed by human beings. Machine learning is a tool that can be

used to enhance humans’ abilities to solve problems and make informed inferences on a wide

range of situations.

However, there are some challenges and limitations to overcome, such as the historical

overfitting and underfitting. This could be explained by dividing the generalization error into

bias and variance, where the bias is a learner’s tendency to constantly learn the same erroneous

thing, resulting in underfitting, and variance is the proneness to learn random things regardless

the real signal, resulting in overfitting. After overfitting, the most crucial problem in machine

learning is the curse of dimensionality. This expression was coined by Bellman in 1961 to men-

tion the fact that several algorithms that work well in low dimensions becomes unmanageable

when the input is high-dimensional, consequently, it is more difficult to understand the data.

Another characteristic to regard is the fact that when a learning algorithm is not giving good

results, often the quicker path to successful outcomes is to feed the machine more data, which

is the primary driver of progress in machine and deep learning algorithms in recent years. Nev-

ertheless, this can lead to issues with scalability, in which more data is available but time to

learn that data remains an issue (DOMINGOS, 2012).

A well-known approach that have made great gains over the past decade is Deep Learning,

which is the study and design of machine learning algorithms for learning good representation

of data at multiple levels of abstraction. Recent publicity of deep learning through DeepMind,

Facebook, and other institutions has highlighted it as the “next frontier” of machine learning.


2.2 GRADIENT BOOSTING MACHINES 21

2.2 GRADIENT BOOSTING MACHINES

2.2.1 OVERVIEW

A Decision Tree is an analytical method with a schematic representation of alternatives that

facilitates making better decisions. The goal of a decision tree is to learn a series of decision

rules to infer the target labels. Utilizing an iterative algorithm, the procedure starts at the root

level of the tree and splits the data recursively on the feature that gets the lowest impurity.

The majority of scalable tree-based applications are based on Classification and Regression

Trees (CART) introduced by Breiman, Friedman Stone and Ohlson in 1984. CART works

building binary trees for classes or continuous variables, operating iteratively on given features

and improving in a greedy way the results of an error metric expressed as impurity in order to

finally obtain a prediction (SJARDIN; MASSARON; BOSCHETTI, 2016).

Decision trees are the basis for ensemble methods, such as bagging and boosting. Bagging

creates several subsets of data from training sample chosen randomly, where each collection

generates its own decision tree. All of these produce an ensemble of different models averaged,

i.e., an aggregated predictor (BREIMAN, 1996). Boosting fits consecutive trees generated by a

random sample, its goal is to solve for net error from the prior tree. When an input is incorrectly

classified by a hypothesis, its weight is increased and therefore the next hypothesis will try to

classify it in the most precise way. Thus, weak learners, which performs just slightly better than

random guessing can be “boosted” into an accurate “strong” learner (FRIEDMAN; HASTIE;

TIBSHIRANI, 1998; FREUND; SCHAPIRE, 1999).

When gradient descent procedure was combined with boosting technique, gradient boosting

was created. Traditionally, gradient descent is used to minimize a set of parameters, such as the

coefficients in a regression equation or weights in a neural network. After calculating the error

or loss, the weights are updated to minimize that error. Instead of parameters, weak learner sub-

models or more specifically decision trees are considered. After calculating the loss, to perform

the gradient descent procedure, a tree must be added to the model to reduce the loss. This is

done by parameterizing the tree, then the parameters of the tree are modified and the algorithm

moves in the right direction by reducing the residual loss and ameliorating the prediction. A

specific number of trees are added, or the training terminates, once loss reaches a desirable level

or when the results no longer improve on an external validation data set (MASON et al., 1999;

FRIEDMAN, 2001; CHIO; FREEMAN, 2018).

2.2.2 EXTREME GRADIENT BOOSTING

XGBoost is a well engineered, distributed machine learning system to scale up tree boost-

ing algorithms. It makes use of a novel regularization approach over the conventional Gradient


2.2 GRADIENT BOOSTING MACHINES 22

Boosting Machines (GBMs) to significantly decrease the complexity. The system is optimized

for quick parallel tree construction and adapted to be fault tolerant under the distributed setting.

It can handle millions of samples on a single node and manage billions of them with distributed

computing (CHEN; GUESTRIN, 2016). Furthermore, unlike GMBs, Extreme Gradient Boost-

ing makes use of a regularization method to considerably diminish the complexity of the model.

One notable advantage is the availability of a set of parameters that are able to be modified dur-

ing the training with the inputted data to enhance the model that is being constructed. (GUPTA;

GUSAIN; POPLI, 2016). The process of how XGBoost works is described as follows.

a) Scoring function

In order to measure the performance of a model given a certain data set, XGBoost defines an

objective function considering the training loss L(θ) and regularization Ω(θ) terms, where

the first measures how predictive the model is and the latter penalizes the complexity of the

model and prevents the overfitting. This is represented in equation (1), where θ signifies

the parameters that will be discovered during the training. For the mathematical structure,

XGBoost makes use of a model based on decision tree ensembles, where the result model

ŷi, at the ith training example of a total of n, is the sum of the predictions of multiple trees

together. Equation (2) shows K as the number of trees which are been combined, f as a

function in the functional space F and xi as the input. Thus, the new evaluation function to

be optimized becomes in equation (3) (CHEN, 2014).

Ob j(θ) = L(θ)+Ω(θ) (1)

ŷi =
K

∑
k=1

fk(xi), fk ∈ F (2)

Ob j =
n

∑
i=1

l(yi, ŷi)+
K

∑
k=1

Ω( fk) (3)

The aim is to find a trade-off between bias and variance, in other words, the model needs to

be simpler and predictive. Simpler because it tends to have smaller variance making predic-

tion stable, and predictive because it must fit well in training data. This can be visualized in

Figure 1, where the upper image of the left corner shows the distribution of a variable that

will be modelled, the upper image of the right corner presents the model with high complex-

ity which derivates in an overfitting, the bottom image of the left corner illustrates the model

not learning enough patterns, i.e., an underfitting, and the bottom image of the right corner

shows the model with a good balance between overfitting and underfitting.

b) Boosting

The additive strategy is applied during the training, one new tree that optimizes the system

is added at a time to the model, in equation (4), ŷ
(t)
i is described as the model at training


2.2 GRADIENT BOOSTING MACHINES 23

Figure 1 - Bias-Variance tradeoff in machine learning

Source: (CHEN, 2014)

the round t, ŷ
(t−1)
i is the function added in the previous round and ft(xi) is the new func-

tion (CHEN, 2014).

ŷ
(t)
i =

t

∑
k=1

fk(xi) = ŷ
(t−1)
i + ft(xi) (4)

c) Regularization

To determine the complexity of the tree Ω( f ), (CHEN; GUESTRIN, 2016) proposed an

approach that defines it as equation (5), where the first term γT evaluates the number of

leaves T , taking γ as a constant, and the second term computes L2 norm of leaves scores w j

and λ is a very small constant value (CHEN, 2014).

Ω( f ) = γT +
1
2

λ
T

∑
j=1

w2
j (5)

d) Final structure outcome

As an example of an estimator, it is considered the Mean Squared Error (MSE) for the loss

term; then it is taken the Taylor expansion of the loss to the second order, the objective

function is described as equation (6) and displays how the splitting of the nodes would be

done. G and H are defined in equations (7) and (8) respectively, where gi and hi are the first

and second order partial derivatives after taking the Taylor expansion, I j = {i|q(xi) = j} is

the group of indices of data points attributed to the j-th leaf and q(x) is the structure of the

tree (CHEN, 2014).


2.2 GRADIENT BOOSTING MACHINES 24

ob j(t) =
T

∑
j=1

[G jw j +
1
2
(H j +λ )w2

j ]+ γT (6)

G j = ∑
i∈I j

gi (7)

H j = ∑
i∈I j

hi (8)

Finally, in the objective function, it is taken the argument of the minimum and the minimum

of the quadratic function for the single variable w j, considering q(x) as fixed. The outcomes

are equations (9) and (10), where the last one assess how good a tree structure is, i.e., if the

score is smaller, the structure will be better (CHEN, 2014).

w∗
j =− G j

H j +λ
(9)

ob j∗ =−1
2

T

∑
j=1

G2
j

H j +λ
+ γT (10)

e) Greedy learning of the tree

In practice, it would be unmanageable to enumerate all possible tree structures and select the

best one. Because of this trees are built in a greedy way, starting from a tree with depth zero;

then, one level is optimized at the time until arriving at the maximum depth, this means, it is

added a split for each leaf node of the tree. The gain is expressed as follows:

Gain =
1
2
[

G2
L

HL +λ
+

G2
R

HR +λ
− (GL +GR)

2

HL +HR +λ
]− γ (11)

In equation (11), the first term refers to the score of left child; the second is the score of

right child; the third expresses the score if a split is not taken; and the last term consider a

complexity cost for adding a leaf. From this, it is deduced that is better not to add a split if

the gain is smaller than γ . This technique is called Prunning in tree based models.

In order to find the optimal split:

i) For each node, all possible features are enumerated.

ii) For each feature, instances are sorted by feature value.

iii) A linear scan is used to decide the best split along that feature. This linear scan is taken

from left to right and the split is done where values gi and hi give the best gain, this is

calculated with equation (11).

iv) The best split solution is taken along all the features.


2.3 ARTIFICIAL NEURAL NETWORKS 25

Finally, the ft(x) obtained by the new tree is added to the model. As it is shown in equation

(12):

y
(t)
i = y

(t−1)
i + ε ft(xi) (12)

The ε is called Step-size or Learning Rate and it usually takes values around 0.1. A complete

optimization is not done in each step, like this, a chance of improvement is kept for future

rounds to prevent the overfitting.

In Figure 2, it is illustrated an example of learning a tree on single variable, where the input

variable is t (time) and the goal is to predict a person’s preference of romantic music in a given

time. The first image shows the construction of the tree and the second one shows how the data

is being modelled through the time in order to obtain a final function.

Figure 2 - Learning a tree on single variable

Source: (CHEN, 2014)

For more details about the power of XGBoost method, its features and its algorithmic op-

timizations, (CHEN, 2014; CHEN; GUESTRIN, 2016; GUPTA; GUSAIN; POPLI, 2016) are

strongly suggested for interested readers.

2.3 ARTIFICIAL NEURAL NETWORKS

2.3.1 OVERVIEW

The first occasion that ANNs were mentioned was in 1943 by the neurophysiologist Warren

McCulloch and the mathematician Walter Pitts (MCCULLOCH; PITTS, 1988). They intro-

duced an elementary computational model of how biological neurons from animal brains could

work together to perform tasks through the use of propositional logic. This representation was

known as the first artificial neural network. During ANNs’ history, they passed for a long dark

period in the course of the 1960s and 1970’s where they were set aside. At the beginning of the

1980’s, a revival interest in them returned allowing the development of new architectures and


2.3 ARTIFICIAL NEURAL NETWORKS 26

better training techniques. For the next decade, better results and theoretical foundations were

achieved by powerful machine learning techniques. Nowadays, with the tremendous increase

in computing power, the enormous amount of data available, finer training algorithms and some

theoretical limitations that have become advantageous in practice have permitted ANNs to be

in a continuous evolution with accelerated progress and more fundings (GÉRON, 2017).

The basic structure of an ANN is compounded by small processing units or nodes with

weighted associations which in the biological model represent neurons and the strength of the

synapses between the neurons respectively. The network is stimulated by given inputs to some

or all nodes, this effect is spread throughout the weighted connections in the network. The

activation of an ANN node tries to model the average firing of the spikes observed as a result of

the electrical activity in biological neurons (GRAVES, 2012).

From among all ANNs architectures, there is a special distinction ANNs whose connections

are acyclic called Feedforward Neural Networks (FNNs), and those whose associations form

cycles called Recurrent Neural Networks (RNNs). In the case of FNNs, the most representative

is the Multi-Layer Perceptron (MLP) and for RNNs is the Long Short-Term Memory (LSTM)

that in last years has demonstrated a highlighted performance. These two representations are

described as follows (GRAVES, 2012).

2.3.2 MULTI-LAYER PERCEPTRON

An MLP works with a set of units called neurons organized in an input layer, one or more

hidden layers and an output layer, all fully connected; each layer contains a bias neuron. Thus,

inputs are propagated from the first layer, passing through hidden layers until reaching the

output layer, this is called the forward pass of the network and could be observed in Figure 3.

When a ANN has two or more hidden layers is called Deep Neural Network (DNN) (GRAVES,

2012; GÉRON, 2017).

A brief description of how Multi-Layer Perceptron works is given in (GRAVES, 2012) as

follows:

An MLP with a particular set of weight values defines a function from input to output

vectors. By altering the weights, a single MLP is capable of instantiating many different

functions. Indeed it has been proven that an MLP with a single hidden layer containing a

sufficient number of nonlinear units can approximate any continuous function on a com-

pact input domain to arbitrary precision. For this reason, MLPs are said to be “universal

function approximators”. (GRAVES, 2012, p.13)

There are some principal characteristics that were born with MLP and must be considered

in the modelling of an ANN:


2.3 ARTIFICIAL NEURAL NETWORKS 27

Figure 3 - A Multi-Layer Perceptron. The S-shaped curves in the hidden and output layers indicate the
application of “sigmoidal” nonlinear activation functions.

Source: (GÉRON, 2017)

a) Loss function.

Also called "Error Function", it is the function that is going to be minimized or maximized

during the training phase, its selection depends on the particular application. For classifi-

cation problems, the goal is to model the posterior probabilities of class membership con-

ditioned on the input variables. The most utilized are Categorical Crossentropy or Binary

Crossentropy. For regression problems, the basic goal is to model the distribution of the

output variables also conditioned on the input variables. The most used are Mean Squared

Error (MSE) and Mean Absolute Error (MAE). In the case of modelling count data, which

characteristics are represented by a discrete distribution and non-negative predicted values,

Poisson distribution is widely used (BISHOP, 1995; GRAVES, 2012; CHOLLET et al.,

2015; LONG, 1997). In Keras, a library of open source neural networks written in Python,

Poisson loss function is calculated by equation (13):

Loss =
1
n

n

∑
i=1

(ŷ(i)− y(i)log(ŷ(i))) (13)

Where:

n : number of samples.

ŷ(i) : prediction of sample ith.

y(i) : target of sample ith.

b) Backpropagation.

In 1986, D. E. Rumelhart et al. found a way to train MLPs by introducing the backprop-

agation algorithm in (RUMELHART; HINTON; WILLIAMS, 1986). During the training

phase, each instance is used by the algorithm to feed the network and calculate the output

of each neuron in each successive layer. The network’s error is computed by calculating the

variation between the desired output and the real output of the network at the moment; then,


2.3 ARTIFICIAL NEURAL NETWORKS 28

it is determined the quantity of error contribution that each neuron of the last hidden layer

has in the output neuron’s error. This process continues in each previous hidden layer until

reach the input layer. In this manner, the error gradient is measured efficiently during the

reverse pass through all the connection weights and finally, with the gradients calculated the

algorithm makes a gradient descent step on all the connection weights (GÉRON, 2017).

According to (HAYKIN, 1998), for each iteration, weights can be updated following the

next rule:

wi j = αwi j(n−1)+ εδ j(n)yi(n) (14)

Where:

n : current iteration.

n−1 : previous iteration

wi j : synaptic weight connecting neuron i to neuron j.

α : momentum.

ε : learning rate.

δ j : local gradient.

yi : input signal of neuron j.

c) Problem of vanishing/exploding gradients.

During the backpropagation in architectures with two or more layers, gradients often get

more and more little as the algorithm progresses down to the first layers. Consequently, the

lower layer connection weights stay without being updated by the gradient descent and the

training phase does not converge to a good solution; or the opposite, when gradients become

bigger and bigger and last layers get large weight updates making the algorithm diverges. In

the literature, this is recognized as the vanishing or exploding gradients problem (GÉRON,

2017).

In (GLOROT; BENGIO, 2010) was analyzed this problem. Ideally, to avoid the vanishing

gradient, the variance of each layer outputs should be the same with the variance of its inputs,

also it would be required the gradients to have equivalent variance before and after flowing

through a layer in reverse direction along the network. However, it is not possible to assure

both; instead, what has worked very well on the practical level is the random initialization

of the connections weights. This is described in Table 1, where ninputs is the number of input

connections and nout puts the number of output connections for the layer whose weights are

being initialized (GÉRON, 2017).

Using the Xavier initialization procedure can accelerate training considerably. What is more,

this method has been applied to deep learning giving excellent results (GÉRON, 2017).

d) Activation functions.


2.3 ARTIFICIAL NEURAL NETWORKS 29

Table 1 - Initialization parameters for each type of activation function

Activation Function Uniform Distribution [-r,r] Normal Distribution

Logistic r =
√

6
ninputs+nout puts

σ =
√

6
ninputs+nout puts

Hyperbolic Tangent r = 4
√

6
ninputs+nout puts

σ = 4
√

6
ninputs+nout puts

ReLU (and its variants) r =
√

2
√

6
ninputs+nout puts

σ =
√

2
√

6
ninputs+nout puts

Source: (GÉRON, 2017)

In order to work with the backpropagation, it had to be changed the step activation function

by the logistic function, this is because of the Gradient Descent works in bumpy surfaces

and the resulting flat segments of the step function do not allow it while the logistic function

presents nonzero derivative everywhere admitting some progress at every step. Two other

widespread activation functions are detailed in (GÉRON, 2017) as follows:

• The hyperbolic tangent function (Tanh). It shows an S-shaped, it is continuous and

differentiable, its output value is on the limits of −1 and 1 making the layer’s output

more or less normalized which helps in the convergence.

• The rectified linear unit function (ReLU). It is continuous but not differentiable when

the sum of the average weights is less or equal to zero, its range is between zero and

infinity and tends to be fast to compute.

For more details about MLP architecture, (GRAVES, 2012; GÉRON, 2017; GLOROT;

BENGIO, 2010; BISHOP, 1995) are recommended to interested readers.

2.3.3 LONG SHORT-TERM MEMORY NEURAL NETWORK (LSTM)

Unlike MLP that map from input to output vectors, RNN connections can save a memory of

previous inputs and keep it in the network’s internal state which impact on the output (GRAVES,

2012). This is similar to the fact of comprehending contexts based on the understanding of

previous situations or completing phrases using first words as a basis.

The recurrent process is explained in Figure 4. On the left side, it is showed a recurrent

neural network with one neuron which receives an input x, results in y and returns the outcome

to itself. Generally, the activation function used is the hyperbolic tangent. On the right side

of the picture, the recurrent network is unrolled and presents the recurrent neuron being fed by

the inputs in different time steps, for each time step it is obtained a result that comes back as

entry internally. The part of the network that holds some state over time steps is called Memory

Cell. A single neuron or a layer of recurrent neurons is a Basic Cell (GRAVES, 2012; GÉRON,

2017).


2.3 ARTIFICIAL NEURAL NETWORKS 30

Figure 4 - A recurrent neuron (left), unrolled through time (right).

Source: (GÉRON, 2017)

There are four kind of input and output sequences in recurrent neural networks:

a) The network can receive a sequence of inputs and generate a sequence of outputs;

b) The network can take a series of inputs, ignore all outputs, except for the last one;

c) The network can admit a single entry at the first time step, complete with zeros the rest of

time steps and return a sequence; and

d) A sequence-to-vector network called Encoder, succeeded by a vector-to-sequence network

called Decoder.

However, standard recurrent networks can access to a limited range of context due to the

vanishing or exploding gradient problem explained in section 2.3.2.

In (HOCHREITER; SCHMIDHUBER, 1997), it was presented an improved version of

RNN called Long Short-Term Memory, which is an effective and scalable model for solving

problems related to sequential data such as handwriting recognition, speech recognition, lan-

guage modelling, human activity recognition and traffic prediction. LSTM was designed to get

over error backflow problems by truncating the gradient without affecting the training process.

It is described by the authors in the following:

LSTM can learn to bridge minimal time lags in excess of 1000 discrete time steps

by enforcing constant error flow through “constant error carrousels” within special units.

Multiplicative gate units learn to open and close access to the constant error flow. In

comparisons with RTRL, BPTT, Recurrent Cascade-Correlation, Elman nets, and Neural

Sequence Chunking, LSTM leads to many more successful runs, and learns much faster.

LSTM also solves complex, artificial long time lag tasks that have never been solved by

previous recurrent network algorithms. (HOCHREITER; SCHMIDHUBER, 1997, p.1)

The LSTM architecture presents a set of subnets connected in a recurrent way, named mem-

ory blocks, each block has one or more auto-connected memory cells and three multiplicative


2.3 ARTIFICIAL NEURAL NETWORKS 31

units called “gates” for input, output and forget, which represents operations such as writing,

reading and restart for the cells (GRAVES, 2012).

In Figure 5, it is shown an LSTM cell, its state is divided into two: the short-term state

h(t) and the long-term state c(t), the latter helps the network to recognize what to store, what to

discard and what to read from it. As the long-term state c(t−1) travels the network from left to

right, for each time step some memories are dropped when passing through the forget gate and

new ones are added selected by the input gate via the addition operation, the resulting long-term

state c(t) is duplicated, one copy is sent directly without any further change and another one is

delivered to the tanh activation function and filtered by the output gate, producing the short-term

state h(t). In this example, the resulting h(t) is equal to the cell’s output for this time step y(t)

(GÉRON, 2017).

Figure 5 - LSTM cell

Source: (GÉRON, 2017)

But, where do new memories come from and how do the gates affect them?

i) An input vector x(t) and the previous short-term state h(t) are entries to four fully connected

layers.

ii) The principal layer is the one that outputs g(t) which analyzes the inputs and the previous

short-term states. It is partially stored in the c(t).

iii) The input, forget and output layers are known as “gate controllers”. They make use of

the logistic activation function, where their output values are between 0 and 1, “0” means

closed gate and “1” opened gate. The forget gate f(t) manages which memories should be

dropped from the long-term state. The input gate i(t) controls the resultant flow of g(t) that


2.4 ADVANCED FEATURES 32

should be partially added to the long-term state. Finally, at a specific time step, the output

gate o(t) determines which memories of the long-term state should be resulting.

The equations to compute all LSTM process are:

i(t) = σ(W T
xi · x(t)+W T

hi ·h(t−1)+bi) (15)

f(t) = σ(W T
x f · x(t)+W T

h f ·h(t−1)+b f ) (16)

o(t) = σ(W T
xo · x(t)+W T

ho ·h(t−1)+bo) (17)

g(t) = tanh(W T
xg · x(t)+W T

hg ·h(t−1)+bg) (18)

c(t) = f(t)⊗ c(t−1)+ i(t)⊗g(t) (19)

y(t) = h(t) = o(t)⊗ tanh(c(t)) (20)

Where:

Wxi, Wx f , Wxo, Wxg: are the weight matrices for their connection to the input vector x(t).

Whi, Wh f , Who, Whg: are the weight matrices for their connection to the previous short-term state

h(t−1).

b f , bg, bi, bo: are the bias terms for each of the four layers.

For more details about LSTM and its architecture, (HOCHREITER; SCHMIDHUBER,

1997; GRAVES, 2012; GÉRON, 2017; KALCHBRENNER; DANIHELKA; GRAVES, 2015)

are suggested.

2.4 ADVANCED FEATURES

2.4.1 FEATURE EXTRACTION

In supervised learning, which is the type of learning used in this research, a list of samples is

organized in feature/value pairs called predictors and in one or more independent features called

target classes which will be predicted based on the remaining features. Nevertheless, originally,


2.4 ADVANCED FEATURES 33

the source data is not structured like this, that is why, it is needed to apply feature extraction

process to obtain the most useful ones where values can be integer, float, string, categorical,

etc; and transform them to a specific format for the learning process such as categorical features

into numerical values through the use of One-Hot Encoding (OHE) method where one attribute

will be equal to “1” (hot), while the others will be “0” (cold). Consequentely, the training time

will be reduced (GARRETA; MONCECCHI, 2013).

In this research, StandardScaler and OneHotEncoder methods from Scikit-learn (PE-

DREGOSA et al., 2011) and Pandas (MCKINNEY, 2010) libraries respectively are used to

convert features.

• Standard scaler. It is a method that standardizes features by rescaling the distribution of

values to zero mean and unit variance. The standard outcome for each sample x is:

z =
(x−u)

s
(21)

Where:

u: mean of the training samples, when the parameter mean = False, u = 0.

s: standard deviation of the training samples, when the parameter std = False, s = 1.

After centering and scaling, the mean and standard deviation are stored to be used on later

data with the transform method.

In Figure 6, as an example, it is considered a data set to identify the type of the iris plant,

using as features: sepal length, sepal width, petal length, petal width and the type of the

plant as the target feature. The initial set of features is taken and transformed using the

standard scaler method, resulting in the second data set.

Figure 6 - Standard scaler representation

Source: Developed by the author

• One hot encoder. It converts categorical variables into indicator features. For instance, if

we have the categorical variable “country” with the values [’Peru’, ’Brazil’, ’France’] can

be encoded into a binary vector with 3 positions, having as value “1” where the country

is present and the remaining positions as “0”. For a better visualization see Figure 7.


2.4 ADVANCED FEATURES 34

Figure 7 - One hot encoder representation

Source: Developed by the author

2.4.2 FEATURE SELECTION

Usually, it is desired to use every available feature in the learning data set, since it is believed

that using as much information would build a better model. However, there are two main

reasons why the number of considered features should be restricted. First, it is possible that

unimportant features could establish correlations between features and the target that arise just

by chance and do not correctly model the problem, this may lead to poor generalization or add

redundant information. And second, a large number of features could enormously increase the

computation time without any improvement (GARRETA; MONCECCHI, 2013).

As a result, working with a smaller set of features may conduct to better results. For this

reason, it is required to search an algorithmically way to discover the best features. Thus, for

this work, it will be used “Mutual Information Regression” and “Principal Component Analy-

sis” (PCA) methods from the open source library Scikit-learn (PEDREGOSA et al., 2011) to

select features. Both of them will be evaluated and compared in order to find which method

best captures the necessary features that will be used as inputs for the models developed with

XGBoost, MLP and LSTM.

• Mutual information regression. It is a non-parametric method that evaluates the mu-

tual correspondence between two variables. When the two variables are independent the

output is zero, but higher outputs mean higher dependency (PEDREGOSA et al., 2011).

The method is based on information theory, where the Mutual Information (MI) is the

quantity of uncertainty in a target variable that is removed by knowing a random vari-

able, i.e., the MI is the amount of shared information, usually, measured in units called

bits (SHANNON, 1948). However, the approach to calculate MI differs in the kind of

variables, if they are discrete or continuous. Continuous values are more difficult to deal

with, because basically, they are sparsely sampled. In the special case, where one variable

is discrete and the other is continuous, which is the case of the data set used in this work,

a method described in (ROSS, 2014) is used to overcome this problem using k-nearest

neighbors. The approach computes the mutual information I(X ,Y ) as the weighted form

of Ii, where Ii is the Jensen-Shannon divergence (GROSSE et al., 2002) applied to each

data point i. The data point i belongs to a list of (x,y), where x ∈ X and y ∈Y , considering


2.4 ADVANCED FEATURES 35

X as the discrete variable and Y as the real-valued variable. Initially, it is established the

kth-nearest neighbor to the data point i within all Nxi, where the parameter k is a fixed in-

teger and Nxi are the data points whose value of the discrete variable equals xi. Then, it is

counted the total number of neighbors mi, including the ones whose value of the discrete

variable does not equal xi but lie in the distance to the kth-nearest neighbor. N are all data

points in X and ψ(.) is the digamma function taken from (ABRAMOWITZ, 1974).

Ii = ψ(N)−ψ(Nxi)+ψ(k)−ψ(mi) (22)

I(X ,Y ) = 〈Ii〉= ψ(N)−〈ψ(Nx)〉+ψ(k)−〈ψ(m)〉 (23)

In Figure 8, from the iris plant data set, it is extracted the first column to illustrate how

the k-nearest neighbors are taken in order to sample distributions. Later, it is applied

the equation (23) and obtained its mutual information observed in the third part of the

picture, considering the representation of each iris category as an integer number to make

the computation. For more details about the mathematical analysis of the procedure,

see (ROSS, 2014).

Figure 8 - Procedure for estimating the MI

Source: Developed by the author

• Principal component analysis. It is a data compression technique. It makes use of linear

algebra through the Singular Value Decomposition (SVD) in order to find orthogonal

directions of greatest variance, i.e, it discovers data combinations that retain the most

information. SVD decomposes any matrix A into three pieces: U , Σ and V t , where the


2.4 ADVANCED FEATURES 36

matrix U shows the eigenvectors with the most meaningful direction of the data and

the matrix Σ depicts the total variance in the data. This is represented in equation (24).

In scikit-learn library (PEDREGOSA et al., 2011), the attribute n_components_ is the

number of components selected and the attribute explained_variance_ratio_ represents

the variance percentage of each selected component.

A =UΣV t (24)

Where:

U : left singular vectors (orthogonal matrix).

V t : right singular vectors (orthogonal matrix).

Σ: singular values (diagonal matrix).

Thus, the data dimension can be reduced by generating transformed variables called com-

ponents from the original ones. As an example and continuing with the data set of the

iris plant, in Figure 9, the second data set demonstrates that the 90% of the variance is

concentrated in two principal components that can be taken as a new data set. For more

details about Singular Value Decomposition, see (STRANG, 2016) and about using the

PCA technique, see (PEDREGOSA et al., 2011; GÉRON, 2017).

Figure 9 - Transformation with PCA

Source: Developed by the author

2.4.3 MODEL SELECTION

Selecting the model parameters, known as hyperparameters, is another important step dur-

ing the training. In this research, the GridSearchCV method from Scikit-learn is used for select

the best model to XGBoost; and for MLP and LSTM neural networks, a Random Search is

implemented.

• Grid search cross-validation. It is a method that performs an exhaustive search through

cross-validation using specific parameter values for an estimator, this means, the classifier


2.4 ADVANCED FEATURES 37

or estimator will be trained for each combination and get a cross-validation accuracy

at evaluating each one. Thus, results will be shown and the best parameters could be

identified (PEDREGOSA et al., 2011; GARRETA; MONCECCHI, 2013).

Commonly, it is used with three or fewer hyperparameters due to the fact that its compu-

tational cost grows exponentially as the number of hyperparameters increases (GOOD-

FELLOW; BENGIO; COURVILLE, 2016).

• Random search. Another method to optimize hyperparameters is Random Search, which

defines a marginal distribution for binary or discrete hyperparameters, or uniform distri-

bution for positive real-valued ones on log-scale as an example. There is a principal

reason that why random search converges much faster to good values than grid search

and it is because the former does not misspend experimental executions (they would usu-

ally have different values), unlike the latter (GOODFELLOW; BENGIO; COURVILLE,

2016).


38

3 DATA PREPROCESSING

3.1 DATA COLLECTION

The departmental fire and rescue, SDIS 25, in the region Doubs-France has provided a set

of information collected from 2012 to 2017 containing a quantity of 195 628 interventions.

This data file contains: the identifier code of an intervention, date and time of the start and

end of an intervention, date and time in which the first engine arrives, the community where

the intervention happens, the classification of the incident, response time and duration of the

intervention.

From the list of interventions gathered, it was extracted the date and time in order to detect

tendencies correlated with these parameters. For instance, the number of car accidents increases

on Saturday night because young people tend to drink more alcohol during this period of time.

Other factors considered in the occurrence of incidents are the weather conditions, that

affects significantly the number of road accidents, fires and casualties; traffic hours, height of the

main rivers in Doubs, epidemiological data, academic vacations, holidays, moonrise, moonset

and moon phase in order to predict the number of interventions that will occur in the next hour.

Therefore, at the beginning, it was created a dictionary with the extracted data from the fire

department and the supplementary data imported from other sources together. The process is

explained as follows:

• The dictionary is initialized containing keys ranging from “01/01/2012 00:00:00” until

“31/12/2017 23:00:00” in the form “YYYYMMJJhhmmss”. The keys are generated by

blocks of one hour.

• The following weather-related data reported by “Meteo France” (FRANCE, 2019) was

imported from three stations located in Dijon-Longvic, Bale-Mulhouse, and Nancy-

Ochey: temperature, pressure, pressure variation each one hour, barometric trend type,

total cloudiness, humidity, dew point, precipitation in the last hour, precipitation in the

last three hours, average wind speed for every ten minutes, average wind direction for

every ten minutes, bursts over a period, horizontal visibility, and finally the current time.

However, the data were not complete, some fields were missing. For this reason, it was

applied a linear interpolation to fill the blanks. Finally, the meteorological data were

added to the dictionary.

• It was introduced various temporal information: hour, day, day in the week, day in the


3.1 DATA COLLECTION 39

year, month and year.

• It was considered the height of twelve rivers, the most representative of the Doubs de-

partment, the data reported by “Hydro” (HYDRO, 2015) are from the stations: L’Allan

à Courcelles-lès-Montbéliard, Le Doubs à Voujeaucourt, Le Doubs à Besançon, La Loue

à Ornans, L’Ognon à Montessaux, L’Ognon à Bonnal, Le Dessoubre á Saint-Hippolyte,

Le Doubs á Mouthe, Le Doubs á Mathay, Le Drugeon á Rivière Drugeon, Le Gland á

Meslières and La Loue á Vuillafans. The dictionary was filled with the average of the

readings closest to the time of the block considered, the standard deviation of variation of

the water height on this block, the number of readings during this block, the maximum

height occurred during this block of time with the alert 1 (true) if the height of the river

pass a limit established as a flood alert or 0 (false) if not.

• From the list of interventions given by the fire brigade department, the interventions were

organized according to the date of occurrence, grouping them by a period of one hour to

add it to our dictionary.

• Holidays were considered as a binary variable initialized with 0 (false), that will be 1

(true) for any one hour block within an academic holiday period. Also, it was contem-

plated the start and end of vacations as a binary variable where it is 1 (true) for the days

corresponding to the beginning and end of holiday periods and 0 (false) if not.

• Public holidays were added with values 1 or 0, for true or false respectively, as well as

a second key that is set to 1 the days before public holidays, for the hours ranging from

3:00 pm to 11:00 pm (otherwise 0).

• It was included information related to the “Bison Futé” (FUTÉ, 2015) which is a system

put in place in France to communicate to motorists all the recommendations of public

authorities regarding traffic, traffic jams, bad weather, accidents, advices, etc. It classifies

the days at risk according to several colors: green = fluid traffic, orange = dense traffic, red

= difficult traffic, black = to avoid because of traffic jams and slow traffic. We integrate

these information with two additional keys related to the departure and the return. They

are 0, 1, 2 or 3, depending on whether the traffic forecasts correspond to green, orange,

red or black.

• It was incorporated weekly epidemiological information organized by each given hour

and related to the incidence of chickenpox, influenza and acute diarrhea, collected from

the “Sentinelles” network (SENTINELLES, 2007).

• Finally, it was added to the dictionary a boolean variable to know if it is a day (0) or night

(1) for each given hour. Moreover, we added another boolean variable to recognize if


3.2 DATA CLEANING 40

the moon had already risen, this was determined considering the given hour plus thirty

minutes, and what is its phase (an integer from 0 to 7, namely 0 for new moon, 2 for the

first quarter, 4 for the full moon, and 6 for last quarter).

In Table2, it is shown how the dictionary looks like. It contains the information extracted

from the list of interventions (number of interventions, hour, day, etc.) and the ones imported

from external sources (meteorological, rivers height, ephemeris, traffic, diseases, vacations,

etc.). Each line represents a block of one hour. Over the period of 6 years, for each day we have

twenty four columns representing the 24 hours.

Table 2 - Illustrated example of the dictionary

ID year startEndHolidays ... windDirectionBasilea humidityDijon temperatureNancy nbInterventions

0 2012 0 ... 240 97 283.35 7
1 2012 1 ... 216 96 283.38 10
2 2012 0 ... 193 95 283.45 9
... ... ... ... ... ... ... ...

52560 2017 0 ... 200 96 277.45 10

Source: Developed by the author

3.2 DATA CLEANING

In this subsection, it is explained how outliers, that can affect negatively the final results,

were detected and removed. At first, it was analyzed the number of interventions per year

in Figure 10 and it is observed a great increment of them over the 6 years, maybe related to

the population-ageing and growth. Then, it was calculated the mean value of the number of

interventions per hour, the average was 3.59 interventions/h. Moreover, it was found that the

minimum number of intervention per hour is 0, while the maximum is 85. Finally, in 75% of

cases the number of interventions is less than 5.

Leap years have an impact on the variable of the day in the year. For instance, June 21th

(Music day in France) or July 14th (the National Day of France) are not the same day in the

year when the month of February has 29 days. For this reason, February 29th of 2012 and 2016

were removed.

On the other side, looking at Figure 11 and in more detail Figure 12, there seems to be some

very particular situations that generated a large number of interventions (more than 80). These

events can affect the learning phase. Therefore, they were analyzed in more detail to know if

they should or not be discarded. The hours were sorted, ranging from 0 to 52 559, in descending

order according to their corresponding number of intervention. It was noticed that the first 7

IDs with the highest number of interventions correspond to the same period of time.


3.2 DATA CLEANING 41

Figure 10 - Number of interventions per year

Source: Developed by the author

Figure 11 - Frequency of number of interventions per hour

Source: Developed by the author

Figure 12 - Outliers in the occurrence of interventions

Source: Developed by the author

The ID number 39243 has the maximum number of interventions, that is 85, illustrated in

Figure 13. The year, month, day, and hour of the 7 IDs were listed and it was noticed that they

all belong to the night of 24th to the 25th of June 2016. In the list of interventions given by

the fire department, the following main causes were noted for these days: exhaustion, floods,


3.2 DATA CLEANING 42

protection of miscellaneous property, and accidents. It was found that there were very violent

storms that night and that was recognized as a natural disaster in the region of Doubs. Therefore,

it is necessary to evaluate if these outliers should be considered as a consequence of exceptional

weather and be artificially smoothed or keep them to be predicted with weather variables. In

what follows, a forecasting analysis of these picks will be made with meteorological data from

Basilea, Dijon, and Nancy.

Figure 13 - Maximum number of interventions

Source: Developed by the author

It is taken an interval of one hundred hours (approximately four days) centered around this

storm. While checking the precipitation in the last one hour, Figure 14, it can be observed a

peak in rainfall during the last hour, almost four millimeters, but not enough bigger than eighty

millimeters that the article in Figure 15 from the “Est Republicain” (REPUBLICAIN, 2016)

mentions. If it is compared the IDs of the peaks with the IDs of the maximum precipitations

in the data provided by the weather station in Basilea (Figure 16), it seems that they are not

unusual values. Although Basilea is closer to the storm location (between Sancey and L’Isle-

sur-le-Doubs, Figure 17) than Dijon or Nancy, it is possible that its precipitations values are not

very representative.

Nevertheless, it might be possible that a lot of water has fallen for a relative long time.

Therefore, Figure 18 shows the precipitation over a period of three hours. A thunderstorm peak

appears clearly, and the amount dropped twenty millimeters is closer to the eighty millimeters

mentioned above. With the data, it is checked if such quantity twenty millimeters is something

frequent and it turns out to be the fifth highest rainfall in three hours recorded from 2012 to

2016.

These precipitation data are very important for the prediction model, but they do not allow

the prediction of the extreme situation of June 25th, 2016 due to weather measurements that are

not sufficiently localized. Another variable analyzed was the wind speed. However, the maxi-

mum value in the defined time interval (Figure 19) is less than 15.9m/s which is the maximum

wind speed from 2012 to 2017. With the weather data considered at the moment, it looks like


3.2 DATA CLEANING 43

complicated to forecast such a peak of interventions. This kind of situation is very exceptional

and has a big impact on the data studied. For instance, the number of incidents in June might be

considerably overvaluated. For this reason, the chosen option is to artificially smooth the data

on this date by putting the same number of interventions at the same time of the following day.

Figure 14 - Precipitation each 1h

Source: Developed by the author

Figure 15 - L’Est Republicain reports on the storms

Source: (REPUBLICAIN, 2016)

Figure 16 - Precipitations peaks recorded from Basilea Station

Source: Developed by the author


3.3 VARIABLES ENCODING 44

Figure 17 - Storm area

Source: Developed by the author

Figure 18 - Precipitation each 3h

Source: Developed by the author

Figure 19 - Ten minutes average wind speed

Source: From Author

3.3 VARIABLES ENCODING

First, the data was divided in two sets: learning (years 2012-2016) and testing (year 2017).

It was used the StandardScaler method to extract features from the numerical variables. The

mean and the standard deviation was computed with the learning set. Then, the standardiza-

tion by centering and scaling was made with the complete data set, i.e., with the learning and

testing sets. This process is made in order to discover changes in the data through the time and

capture them during the training process. The numerical variables considered were: year, hour,

wind direction, humidity, nebulosity, moon phase, dew point, precipitations, bursts, tempera-

ture, visibility, wind speed, chickenpox statistics, influenza statistics, acute diarrhea statistics,

rivers height variables except by the alert variable. More details are presented in Table 3.

In the case of the categorical variables, they were encoded with the method OneHotEncoder.

OHE was fitted to the complete data set in order to recognize all the categories for each variable.

Then the data was transformed with the identified pattern. The categorical variables considered

were: bison futé variables, time variables such as day, day of the week, day of the year and


3.4 DATA SET DIMENSIONALITY REDUCTION 45

month, holiday indicator, night indicator, barometric trend, river height alert variable. These

variables are shown in Table 4.

Finally, the target feature “nbInterventions” was not standardized, after several previous

tests, better results were obtained with original values. In total, 831 features were extracted.

Table 3 - Numerical variables

Variable : Description

annee : year varicelle_inc : number of registered incidents for chickenpox
heure : hour varicelle_inc100 : number of registered incidents for chickenpox
directionVentBale : wind direction in Basilea varicelle_inc100_low : number of registered incidents for chickenpox
directionVentDijon : wind direction in Dijon varicelle_inc100_up : number of registered incidents for chickenpox
directionVentNancy : wind direction in Nancy varicelle_inc_low : number of registered incidents for chickenpox
humiditeBale : humidity in Basilea hLoueVuillafansMax : maximum records value of Loue river’s height at Vuillafans
humiditeDijon : humidity in Dijon hAllanCourcellesMean : records average of Allan river’s height at Courcelles
humiditeNancy : humidity in Nancy hDoubsBesanconMean : records average of Doubs river’s height at Besançon
nebulositeBale : nebulosity in Basilea hDoubsVoujeaucourtMean : records average of Doubs river’s height at Voujeaucourt
phaseLune : moon phase hLoueOrnansMean : records average of Loue river’s height at Ornans
pointRoseeBale : dew point in Basilea hOgnonBonnalMean : records average of Ognon river’s height at Bonnal
pointRoseeDijon : dew point in Dijon hOgnonMontessauxMean : records average of Ognon river’s height Montessaux
pointRoseeNancy : dew point in Nancy hAllanCourcellesStd : records std of Allan river’s height at Courcelles
precipitations1hBale : precipitation in the last 1h in Basilea hDoubsBesanconStd : records std of Doubs river’s height at Courcelles
precipitations1hDijon : precipitation in the last 1h in Dijon hDoubsVoujeaucourtStd : records std of Doubs river’s height at Voujeaucourt
precipitations1hNancy : precipitation in the last 1h 1h in Nancy hLoueOrnansStd : records std of Loue river’s height at Ornans
precipitations3hBale : precipitation in the last 3h in Basilea hOgnonBonnalStd : records std of Ognon river’s height at Bonnal
precipitations3hDijon : precipitation in the last 3h in Dijon hOgnonMontessauxStd : records std of Ognon river’s height at Montessaux
precipitations3hNancy : precipitation in the last 3h in Nancy hAllanCourcellesNb : records number of Allan river’s height at Courcelles
pressionBale : pressure in Basilea hDoubsBesanconNb : records number of Doubs river’s height at Besancon
pressionDijon : pressure in Dijon hDoubsVoujeaucourtNb : records number of Doubs river’s height at Voujeaucourt
pressionNancy : pressure in Nancy hLoueOrnansNb : records number of Loue river’s height at Ornans
pressionMerBale : pressure at sea level in Basilea hOgnonBonnalNb : records number of Ognon river’s height at Bonnal
pressionMerDijon : pressure at sea level in Dijon hOgnonMontessauxNb : records number of Ognon river’s height at Montessaux
pressionMerNancy : pressure at sea level in Nancy hAllanCourcellesMax : maximum records value of Allan river’s height at Courcelles
pressionVar3hBale : pressure variation in the last 3h in Basilea hDoubsBesanconMax : maximum records value of Doubs river’s height at Besancon
pressionVar3hDijon : pressure variation in the last 3h in Dijon hDoubsVoujeaucourtMax : maximum records value of Doubs river’s height at Voujeaucourt
pressionVar3hNancy : pressure variation in the last 3h in Nancy hLoueOrnansMax : maximum records value of Loue river’s height at Ornans
rafalesSur1perBale : gusts in Basilea hOgnonBonnalMax : maximum records value of Doubs river’s height at Besancon
rafalesSur1perDijon : gusts in Dijon hOgnonMontessauxMax : maximum records value of Doubs river’s height at Besancon
rafalesSur1perNancy : gusts in Nancy hDessoubreHippoMean : records average of Dessoubre river’s height at Saint-Hippolyte
temperatureBale : temperature in Basilea hDoubsMathayMean : records average of Doubs river’s height at Mathay
temperatureDijon : temperature in Dijon hDoubsMoutheMean : records average of Doubs river’s height at Mouthe
temperatureNancy : temperature in Nancy hDrugeonRiviereMean : records average of Drugeon river’s height at La Rivière
visibiliteBale : horizontal visibility in Basilea hGlandMeslieresMean : records average of Gland river’s height at Meslières
visibiliteDijon : horizontal visibility in Dijon hLoueVuillafansMean : records average of Loue river’s height at Vuillafans
visibiliteNancy : horizontal visibility in Nancy hDessoubreHippoStd : records std of Dessoubre river’s height at Saint-Hippolyte
vitesseVentBale : 10 min average wind speed in Basilea hDoubsMathayStd : records std of Doubs river’s height at Mathay
vitesseVentDijon : 10 min average wind speed in Dijon hDoubsMoutheStd : records std of Doubs river’s height at Mouthe
vitesseVentNancy : 10 min average wind speed in Nancy hDrugeonRiviereStd : records std of Drugeon river’s height at La Rivière
diarrhee_inc : number of registered incidents for diarrhea hGlandMeslieresStd : records std of Gland river’s height at Meslières
diarrhee_inc100 : number of registered incidents for diarrhea hLoueVuillafansStd : records std of Loue river’s height at Vuillafans
diarrhee_inc100_low : number of registered incidents for diarrhea hDessoubreHippoNb : records number of Dessoubre river’s height at Saint-Hippolyte
diarrhee_inc100_up : number of registered incidents for diarrhea hDoubsMathayNb : records number of Doubs river’s height at Mathay
diarrhee_inc_low : number of registered incidents for diarrhea hDoubsMoutheNb : records number of Doubs river’s height at Mouthe
diarrhee_inc_up : number of registered incidents for diarrhea hDrugeonRiviereNb : records number of Drugeon river’s height at La Rivière
grippe_inc : number of registered incidents for influenza hGlandMeslieresNb : records numberof Gland river’s height at Meslières
grippe_inc100 : number of registered incidents for influenza hLoueVuillafansNb : records number of Loue river’s height at Vuillafans
grippe_inc100_low : number of registered incidents for influenza hDessoubreHippoMax : maximum records value of Dessoubre river’s height at Saint-Hippolyte
grippe_inc100_up : number of registered incidents for influenza hDoubsMathayMax : maximum records value of Doubs river’s height at Mathay
grippe_inc_low : number of registered incidents for influenza hDoubsMoutheMax : maximum records value of Doubs river’s height at Mouthe
grippe_inc_up : number of registered incidents for influenza hDrugeonRiviereMax : maximum records value of Drugeon river’s height at La Rivière
varicelle_inc_up : number of registered incidents for chickenpox hGlandMeslieresMax : maximum records value of Gland river’s height at Meslières

Source: Developed by the author

3.4 DATA SET DIMENSIONALITY REDUCTION

The standardized data set used presents a high dimension: 831 features. For this reason,

the methods: Mutual information regression and PCA were tested in a parallel way to evaluate

and record the most relevant characteristics and use the regarded variables (outcomes from MI)

or components (outcomes from PCA) to optimize the modelling process in time and memory

without missing the integrity of the data. Naturally, it is necessary to bear in mind that features


3.4 DATA SET DIMENSIONALITY REDUCTION 46

Table 4 - Categorical variables

Variable : Description

bisonFuteDepart : departure traffic level tendanceBaromNancy : barometric trend Nancy
bisonFuteRetour : return traffic level vacances : vacations
debutFinVacances : start/end vacations veilleFerie : days before holidays
ferie : holidays hAllanCourcellesAle : warning for Allan river’s height at Courcelles
jourSemaine : day in the week hDoubsBesanconAle : warning for Doubs river’s height at Besançon
jour : day hDoubsVoujeaucourtAle : warning for Doubs river’s height at Voujeaucourt
jourAnnee : day in the year hLoueOrnansAle : warning for Loue river’s height at Ornans
mois : month hOgnonBonnalAle : warning for Ognon river’s height at Bonnal
luneApparente : sunrise or dusk hOgnonMontessauxAle : warning for Ognon river’s height at Montessaux
nuit : day or night hDessoubreHippoAle : warning for Dessoubre river’s height at Saint-Hippolyte
tempsPresentBale : weather conditions in Basilea hDoubsMathayAle : warning for Doubs river’s height at Mathay
tempsPresentDijon : weather conditions in Dijon hDoubsMoutheAle : warning for Doubs river’s height at Mouthe
tempsPresentNancy : weather conditions in Nancy hDrugeonRiviereAle : warning for Drugeon river’s height at Rivière
tendanceBaromBale : barometric trend type in Basilea hGlandMeslieresAle : warning for Gland river’s height at Meslières
tendanceBaromDijon : barometric trend type in Dijon hLoueVuillafansAle : warning for Loue river’s height at Vuillafans

Source: Developed by the author

that look unimportant in isolation, i.e., when applying Mutual information regression or PCA,

might be important in combination, i.e., during the modelling.

To get the Mutual information regression score, the entirety data set with the 831 extracted

features was processed. The variables were divided in features and target, and 800 neighbors

were taken into account to calculate the score. The outcomes were scaled between 0 and 1. Fi-

nally, a threshold to select the features was established with the value 0.01, i.e., all features with

scores higher than 0.01 will be considered as input to the future models developed. Hence, 56

features and the target were taken to train the model. The selected features with their respective

scores are shown in Table 5.

PCA was used as an alternative method to reduce the dimension of the standardized data

set. The model was fitted with the learning set and it was retained the 95% of the variance. The

transformation was applied to both data sets (learning and testing), i.e., the application of the

dimensionality reduction to the data set with the 831 extracted features resulted in 83 principal

components and the target.

The division on the learning and testing sets during the PCA fitting process corresponds to

the idea of applying a real-world case, where initially, the testing set would not exist and the

model would be fitted just with the learning set. Eventually, the future data (the testing set)

would be transformed with the built model. In comparison with Mutual information regression,

where it is possible to use the complete data to quantify the information between variables

(because within the process samples are taken for the selection of variables), there would be

no need to transform their values. As a summary, after analyzing this part of the process, a

differentiation of both methods is made in Table 6.


3.4 DATA SET DIMENSIONALITY REDUCTION 47

Table 5 - List of features after applying mutual information regression selection

Feature : Mutual information regression score

heure : 1.000000 luneApparente_1 : 0.027064
humiditeBale : 0.316402 nuit_1 : 0.026946
humiditeDijon : 0.230820 luneApparente_0 : 0.023712
humiditeNancy : 0.195237 pressionVar3hDijon : 0.021664
temperatureBale : 0.148263 ferie_0 : 0.018265
temperatureDijon : 0.144453 directionVentNancy : 0.018011
temperatureNancy : 0.122189 pointRoseeDijon : 0.014356
rafalesSur1perBale : 0.114104 tempsPresentNancy_0 : 0.014349
tempsPresentBale_0 : 0.091858 ferie_1 : 0.013768
rafalesSur1perNancy : 0.087239 hDrugeonRiviereMean : 0.013719
rafalesSur1perDijon : 0.080465 pointRoseeBale : 0.013154
veilleFerie_1 : 0.078794 tempsPresentDijon_0 : 0.013146
veilleFerie_0 : 0.077291 hDrugeonRiviereMax : 0.013036
vitesseVentBale : 0.066838 hDessoubreHippoMax : 0.012802
directionVentDijon : 0.064387 tempsPresentDijon_10 : 0.012493
visibiliteBale : 0.059371 tempsPresentBale_10 : 0.012339
vitesseVentNancy : 0.057049 precipitations3hBale : 0.012093
vitesseVentDijon : 0.047521 hDessoubreHippoMean : 0.011828
hDoubsBesanconNb : 0.046924 hDoubsBesanconMean : 0.011741
directionVentBale : 0.045515 hDoubsBesanconMax : 0.011203
visibiliteNancy : 0.042089 tendanceBaromBale_1 : 0.011116
visibiliteDijon : 0.039114 hOgnonBonnalMax : 0.010492
hDoubsBesanconStd : 0.036746 pointRoseeNancy : 0.010385
annee : 0.034604 hOgnonBonnalMean : 0.010360
tempsPresentBale_2 : 0.032631 pressionVar3hNancy : 0.010296
nebulositeBale : 0.032262 hDoubsMoutheMax : 0.010232
nuit_0 : 0.028023 tendanceBaromDijon_1 : 0.010222
pressionVar3hBale : 0.027969 tempsPresentNancy_10 : 0.010109

Source: Developed by the author

Table 6 - Differences between MIR and PCA

Mutual Information Regression Principal Component Analysis

1. It is a non-parametric method, i.e, it does not need for previ-
ous standardization. However, in order to keep the same treat-
ment of data to both methods, the standardization was made.

1. Standardization is needed to avoid emphasizing variables
with higher variances than variables with low variances while
searching for principal components.

2. Its base is the calculus of kth-neighbors and mutual infor-
mation.

2. Its base is the calculus of SVD (linear algebra).

3. Relevant variables are choosen. 3. The variables and its values are transformed.
4. The new data set is reduced with the selected variables. 4. The new data set is made up of the new values (principal

components).

Source: Developed by the author


48

4 BUILDING PREDICTION MODELS

After collecting, cleaning, encoding and selecting the features, data were split into three

sets for the three kind of techniques: the training set (2011-2015), the validation set (2016) and

the testing set (2017). The first and second sets are used to build a model that will predict the

number of interventions for each one hour block since it is needed an immediate response to

firefighters, and the last set is used to verify the accuracy of these predictions. LSTM works

with time steps, in this way, it was used three, twelve and twenty-four hours as time steps with

the goal of discovering which quantity of past hours give us better results. On the other hand,

it was built a "similar" structure for XGBoost and MLP techniques which consists in defining

a number of past interventions as new features for the next hour prediction in order to try to

give an artificial memory to the models. The organization of the features for each technique is

depicted in Figure 20.

Figure 20 - Features structure before modelling

Source: Developed by the author

The programming language used was Python, the gradient boosting models and the artifi-

cial neural networks were developed with XGBoost (CHEN; HE; KHOTILOVICH, 2016) and

Keras (CHOLLET et al., 2015) libraries respectively. In order to run the codes, it was employed

a GeForce GTX TITAN X, Intel® Xeon® CPU E5-2623 v4 @2.60GHz with an architecture


4.1 MODELLING WITH XGBOOST 49

x86_64.

4.1 MODELLING WITH XGBOOST

In the training phase of the XGBoost model, the hyperparameters specified for the Grid-

SearchCV were "max_depth": [3, 4, 5, 6, 7, 8], "learning_rate": [0.02, 0.03, 0.04, 0.05, 0.06,

0.07, 0.08], "colsample_bytree": [0.6, 0.7, 0.8, 0.9, 1], "n_estimators": [100, 200, 300, 400]

and "objective": "count:poisson". The "count:poisson" corresponds to poisson regression for

count data and uses "poisson-nloglik" as an internal evaluation metric where the prediction will

be better if the probability of occurrence is higher. After finishing the execution on the server,

the hyperparameters of the best model were tuned by hand in order to enhance the result. The

values used to model our data are specified in Table 7. In (CHEN; HE; KHOTILOVICH, 2016)

can be found more details about the hyperparameters specifications.

Table 7 - XGBoost hyperparameters settings

Hyperparameters Description Value

base_score The initial prediction score of all instances 0.5
booster It produces a tree based model tree based model
colsample_bylevel Subsample ratio of columns for each level 1
colsample_bytree Subsample ratio of columns when constructing each tree [0.6, 1]
gamma Minimum loss reduction required to make a partition on a leaf 0
learning_rate Step size shrinkage used in update [0.02, 0.08]
max_delta_step Absolute regularization that restricts weights 0
max_depth Maximum depth of a tree [3, 8]
min_child_weight Minimum sum of instance weight to continue splitting 1
n_estimators Number of trees [100, 400]
n_jobs Number of parallel threads 1
objective Learning task and the corresponding learning objective count:poisson
reg_alpha L1 regularization term on weights 0
reg_lambda L2 regularization term on weights 1
scale_pos_weight It controls the balance of weights 1
subsample Subsample ratio of the training instances 1

Source: Developed by the author

4.2 CONSTRUCTING A MULTI-LAYER PERCEPTRON

During the training phase of the MLP models, a random search with 50 iterations was

performed with a range of values detailed in Table 8. The optimizer used was "Adam", which is

an extension to stochastic gradient descent, and the loss function was "Poisson". The maximum

number of epochs is 5000, but the callback "early stopping", with a value of fifteen epochs, was

used to end training if the validation set loss had stopped improving.


4.3 CONSTRUCTING A LONG SHORT-TERM MEMORY NEURAL NETWORK 50

Table 8 - MLP hyperparameters Settings

Hyperparameter Values

Neurons first layer [300, 400]
Neurons second layer [420, 520]
Neurons third layer [520, 620]
Neurons fourth layer 1
Epochs 5000
Batch size [60, 80]
Learning rate [0.0001, 0.008]

4.3 CONSTRUCTING A LONG SHORT-TERM MEMORY NEURAL NETWORK

In order to model with LSTM, the hyperparameters were chosen within a range as described

in Table 9, using "Adam" as optimizer and "Poisson" as a loss function. Fifty iterations were

performed in a random search to discover which combination of hyperparameters produce the

best model. The maximum number of epochs was 5000, but the "early stopping" method was set

to fifteen epochs ,i.e., if after fifteen epochs the loss function of validation set had not improved,

the LSTM stopped.

Table 9 - LSTM hyperparameters Settings

Hyperparameter Values

Units first layer [3, 10]
Units second layer [50, 70]
Units third layer [50, 100]
Units fourth layer 1
Epochs 5000
Batch size [55, 160]
Learning rate [0.000001, 0.001]

Source: Developed by the author

4.4 METRICS

The metrics used to measure the performance of the models are described below:

• RMSE: The Root Mean Square Error calculates the standard deviation of the errors ob-

tained during the predictions, highlighting the variance of the frequency distribution of

errors by taking their square before they are averaged. A lower value is better (GÉRON,

2017).

RMSE =

√

1
m

m

∑
i=1

(h(x(i))− y(i))
2

(25)

where:

m is the number of instances.


4.5 EXPERIMENTAL RESULTS AND DISCUSSION 51

x(i) are the predictors vector of ith instance.

h is the predicting function.

y(i) is the label of ith instance.

• MAE: The Mean Absolute Error evaluates the average magnitude of errors’ predictions,

where each error contributes in proportion to the scoring rule. A lower value is bet-

ter (GÉRON, 2017).

MAE =
1
m

m

∑
i=1

| (h(x(i))− y(i)) | (26)

where:

m is the number of instances.

x(i) are the predictors vector of ith instance.

h is the predicting function.

y(i) is the label of ith instance.

• ACC0E: It is the accuracy of the predictions’ results with a margin of error 0, i.e. per-

centage of exact predictions.

ACC0E = (
#predictions_margin_0

#total_predictions
)100% (27)

• ACC1E: It is the accuracy of the predictions’ results with a margin of error equal or less

than 1.

ACC1E = (
#predictions_margin_1

#total_predictions
)100% (28)

• ACC2E: It is the accuracy of the predictions’ results with a margin of error equal or less

than 2.

ACC2E = (
#predictions_margin_2

#total_predictions
)100% (29)

4.5 EXPERIMENTAL RESULTS AND DISCUSSION

In this section, the eighteen results are presented and analyzed. Table 10, 11 and 12 present

results for XGBoost, MLP and LSTM models with the two types of reduction techniques, re-

spectively. It was tested different past hours: three, twelve and twenty-four to discover which

one improves the prediction of the number of interventions that firefighters will face in the next

one hour. The best performance is marked in bold considering the accuracy metrics: ACC0E,

ACC1E and ACC2E as the most representative in real life. In this context, the best XGBoost


4.5 EXPERIMENTAL RESULTS AND DISCUSSION 52

model has a MAE of 1.6787, corresponding to twenty-four past hours with MIR reduction

method, while for MLP and LSTM the MAE is 1.7325 with three past hours and 1.6960 with

twenty-four past hours, respectively, both with MIR reduction method.

Ta