Ilha SolteiraIlha Solteira UNIVERSIDADE ESTADUAL PAULISTA "JÚLIO DE MESQUITA FILHO" School of Engineering of Ilha Solteira - SP Post-Graduate Program in Electrical Engineering SELENE LEYA CERNA ÑAHUIS Comparative Analysis of XGBoost, MLP and LSTM Techniques for The Problem of Predicting Fire Brigade Interventions Análise Comparativa das Técnicas XGBoost, MLP e LSTM para o Problema de Prever Intervenções de Bombeiros Ilha Solteira 2019 SELENE LEYA CERNA ÑAHUIS Comparative Analysis of XGBoost, MLP and LSTM Techniques for The Problem of Predicting Fire Brigade Interventions Análise Comparativa das Técnicas XGBoost, MLP e LSTM para o Problema de Prever Intervenções de Bombeiros Dissertation presented to the São Paulo State University (UNESP) - School of En- gineering - Campus of Ilha Solteira, in ful- filment of one of the requirements for ob- taining the Master’s degree in Electrical Engineering. Field of Expertise: Automation. Prof. Dr. Anna Diva Plasencia Lotufo Advisor Prof. PhD. Christophe Guyeux Co-Advisor Ilha Solteira 2019 A mi Oshito bonita, por su amor y paciencia. ACKNOWLEDGEMENTS I would like to express my thanks to my advisor, Anna Diva, for accepting me in the master and giving me the chance to meet this beautiful country, Brazil. For guiding me and encouraging me throughout my research with numerous advice and for the friendship and patience show up these years. I thank my co-advisor Christophe Guyeux, for the trust placed in me and for giving me the precious opportunity to contribute to firefighters during this research. For being an example to me as a researcher with passion and commitment, for sharing his knowledge and teaching me with patience. I am also grateful to Pierre-Cyrille Heam and Raphaël Couturier, professors in The Uni- versity of Franche-Comté, for their welcome and support in the AND team, Femto-ST. And a special thanks to Cap. Guillaume Royer from SDIS25, for giving me the honour of work with such a great organization, the fire brigade of Doubs, whom I admire for their noble effort and dedication to the community good. I would like to infinitely thank my parents, Eulalia and Fernando, who never measured efforts to give studies to their children and always encouraged us to be persevering and to strive for what we want. To my brother Edwin, for all his effort working and helping the family since his childhood. To my sister Liesel, for taking care of me and giving me a profession. And to my brother Fernando, for sharing his knowledge with me since I was a child and for showing me the scientific way. I thank Héber, my special person, who patiently guided me and cheered me in the writing of my scientific texts, for accompanying me in long discussions about research and incoherences, with whom I discovered Brazil. I thank my good friends Tania and Monara, who welcomed me affectionately in SINTEL, they understood my zero or poorly spoken Portuguese and helped me so kindly during the master’s degree. Finally, this study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001". ABSTRACT Many environmental, economic and societal factors are leading fire brigades to be increasingly solicited, and, as a result, they face an ever-increasing number of interventions, most of the time on constant resource. On the other hand, these interventions are directly related to human activ- ity, which itself is predictable: swimming pool drownings occur in summer while road accidents due to ice storms occur in winter. One solution to improve the response of firefighters on con- stant resource is therefore to predict their workload, i.e., their number of interventions per hour, based on explanatory variables conditioning human activity. The present work aims to develop three models that are compared to determine if they can predict the firefighters’ response load in a reasonable way. The tools chosen are the most representative from their respective categories in Machine Learning, such as XGBoost having as core a decision tree, a classic method such as Multi-Layer Perceptron and a more advanced algorithm like Long Short-Term Memory both with neurons as a base. The entire process is detailed, from data collection to obtaining the predictions. The results obtained prove a reasonable quality prediction that can be improved by data science techniques such as feature selection and adjustment of hyperparameters. Keywords: Firefighters. Prediction. XGBoost. Long Short-Term Memory. Multi-Layer Per- ceptron. Mutual Information Regression. Principal Component Analysis. RESUMO Muitos fatores ambientais, econômicos e sociais estão levando as brigadas de incêndio a serem cada vez mais solicitadas e, como consequência, enfrentam um número cada vez maior de inter- venções, na maioria das vezes com recursos constantes. Por outro lado, essas intervenções estão diretamente relacionadas à atividade humana, o que é previsível: os afogamentos em piscina ocorrem no verão, enquanto os acidentes de tráfego, devido a tempestades de gelo, ocorrem no inverno. Uma solução para melhorar a resposta dos bombeiros com recursos constantes é pre- ver sua carga de trabalho, isto é, seu número de intervenções por hora, com base em variáveis explicativas que condicionam a atividade humana. O presente trabalho visa desenvolver três modelos que são comparados para determinar se eles podem prever a carga de respostas dos bombeiros de uma maneira razoável. As ferramentas escolhidas são as mais representativas de suas respectivas categorias em Machine Learning, como o XGBoost que tem como núcleo uma árvore de decisão, um método clássico como o Multi-Layer Perceptron e um algoritmo mais avançado como Long Short-Term Memory ambos com neurônios como base. Todo o processo é detalhado, desde a coleta de dados até a obtenção de previsões. Os resultados obtidos demon- stram uma previsão de qualidade razoável que pode ser melhorada por técnicas de ciência de dados, como seleção de características e ajuste de hiperparâmetros. Palavras-chave: Bombeiros. Previsão. XGBoost. Long Short-Term Memory. Multi-Layer Perceptron. Mutual Information Regression. Principal Component Analysis. LIST OF FIGURES Figure 1 Bias-Variance tradeoff in machine learning . . . . . . . . . . . . . . . 23 Figure 2 Learning a tree on single variable . . . . . . . . . . . . . . . . . . . . 25 Figure 3 A Multi-Layer Perceptron. The S-shaped curves in the hidden and out- put layers indicate the application of “sigmoidal” nonlinear activation functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Figure 4 A recurrent neuron (left), unrolled through time (right). . . . . . . . . 30 Figure 5 LSTM cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Figure 6 Standard scaler representation . . . . . . . . . . . . . . . . . . . . . . 33 Figure 7 One hot encoder representation . . . . . . . . . . . . . . . . . . . . . 34 Figure 8 Procedure for estimating the MI . . . . . . . . . . . . . . . . . . . . . 35 Figure 9 Transformation with PCA . . . . . . . . . . . . . . . . . . . . . . . . 36 Figure 10 Number of interventions per year . . . . . . . . . . . . . . . . . . . . 41 Figure 11 Frequency of number of interventions per hour . . . . . . . . . . . . . 41 Figure 12 Outliers in the occurrence of interventions . . . . . . . . . . . . . . . 41 Figure 13 Maximum number of interventions . . . . . . . . . . . . . . . . . . . 42 Figure 14 Precipitation each 1h . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Figure 15 L’Est Republicain reports on the storms . . . . . . . . . . . . . . . . . 43 Figure 16 Precipitations peaks recorded from Basilea Station . . . . . . . . . . . 43 Figure 17 Storm area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Figure 18 Precipitation each 3h . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Figure 19 Ten minutes average wind speed . . . . . . . . . . . . . . . . . . . . 44 Figure 20 Features structure before modelling . . . . . . . . . . . . . . . . . . . 48 Figure 21 XGBoost - Feature importance . . . . . . . . . . . . . . . . . . . . . 54 Figure 22 XGBoost - Tree construction . . . . . . . . . . . . . . . . . . . . . . 55 Figure 23 MLP model architecture with 3 past hours . . . . . . . . . . . . . . . 56 Figure 24 LSTM model architecture with 24 time steps . . . . . . . . . . . . . . 57 Figure 25 XGBoost with three past hours for predicting one hour . . . . . . . . . 58 Figure 26 XGBoost with twelve past hours for predicting one hour . . . . . . . . 59 Figure 27 XGBoost with twenty-four past hours for predicting one hour . . . . . 60 Figure 28 MLP with three past hours for predicting one hour . . . . . . . . . . . 61 Figure 29 MLP with twelve past hours for predicting one hour . . . . . . . . . . 62 Figure 30 MLP with twenty-four past hours for predicting one hour . . . . . . . 63 Figure 31 LSTM with three past hours for predicting one hour . . . . . . . . . . 64 Figure 32 LSTM with twelve past hours for predicting one hour . . . . . . . . . 65 Figure 33 LSTM with twenty-four past hours for predicting one hour . . . . . . 66 Figure 34 XGBoost, MLP and LSTM with three past hours for predicting one hour 67 Figure 35 XGBoost, MLP and LSTM with twelve past hours for predicting one hour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Figure 36 XGBoost, MLP and LSTM with twenty-four past hours for predicting one hour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 LIST OF TABLES Table 1 Initialization parameters for each type of activation function . . . . . . 29 Table 2 Illustrated example of the dictionary . . . . . . . . . . . . . . . . . . . 40 Table 3 Numerical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Table 4 Categorical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Table 5 List of features after applying mutual information regression selection . 47 Table 6 Differences between MIR and PCA . . . . . . . . . . . . . . . . . . . 47 Table 7 XGBoost hyperparameters settings . . . . . . . . . . . . . . . . . . . . 49 Table 8 MLP hyperparameters Settings . . . . . . . . . . . . . . . . . . . . . . 50 Table 9 LSTM hyperparameters Settings . . . . . . . . . . . . . . . . . . . . . 50 Table 10 XGBoost prediction results for next one hour on test data 2017 . . . . . 52 Table 11 MLP prediction results for next one hour on test data 2017 . . . . . . . 52 Table 12 LSTM prediction results for next one hour on test data 2017 . . . . . . 52 LIST OF ABBREVIATIONS & ACRONYMS XGBoost Extreme Gradient Boosting ANN Artificial Neural Network FNN Feedforward Neural Network MLP Multi-Layer Perceptron RNN Recurrent Neural Network LSTM Long Short-Term Memory SVM Support Vector Machine GA Genetic Algorithm PSO Particle Swarm Optimization KDE Kernel Density Estimation MIR Mutual Information Regression PCA Principal Component Analysis RMSE Root Mean Squared Error MAE Mean Absolute Error SDIS25 Service Départemental d’Incendie et de Secours 25 ACC Accuracy DBN Deep Belief Network RBM Restricted Boltzmann Machine CART Classification and Regression Trees GBM Gradient Boosting Machine MSE Mean Squared Error DNN Deep Neural Network Tanh Hyperbolic Tangent function ReLU Rectified Linear Unit function RTRL Real Time Recurrent Learning BPTT Back-Propagation Through Time OHE One-Hot Encoding MI Mutual Information SVD Singular Value Decomposition ACC0E Accuracy with margin of error 0 ACC1E Accuracy with margin of error 1 ACC2E Accuracy with margin of error 2 GRU Gated Recurrent Unit TABLE OF CONTENTS 1 INTRODUCTION 13 1.1 MOTIVATION 15 1.2 CONTRIBUTIONS 15 1.3 OUTLINE OF THE THESIS 16 2 MACHINE LEARNING TECHNIQUES 18 2.1 INTRODUCTION 18 2.2 GRADIENT BOOSTING MACHINES 21 2.2.1 OVERVIEW 21 2.2.2 EXTREME GRADIENT BOOSTING 21 2.3 ARTIFICIAL NEURAL NETWORKS 25 2.3.1 OVERVIEW 25 2.3.2 MULTI-LAYER PERCEPTRON 26 2.3.3 LONG SHORT-TERM MEMORY NEURAL NETWORK (LSTM) 29 2.4 ADVANCED FEATURES 32 2.4.1 FEATURE EXTRACTION 32 2.4.2 FEATURE SELECTION 34 2.4.3 MODEL SELECTION 36 3 DATA PREPROCESSING 38 3.1 DATA COLLECTION 38 3.2 DATA CLEANING 40 3.3 VARIABLES ENCODING 44 3.4 DATA SET DIMENSIONALITY REDUCTION 45 4 BUILDING PREDICTION MODELS 48 4.1 MODELLING WITH XGBOOST 49 4.2 CONSTRUCTING A MULTI-LAYER PERCEPTRON 49 4.3 CONSTRUCTING A LONG SHORT-TERM MEMORY NEURAL NETWORK 50 4.4 METRICS 50 4.5 EXPERIMENTAL RESULTS AND DISCUSSION 51 5 CONCLUSIONS AND SUGGESTIONS FOR FURTHER RESEARCH 70 5.1 CONCLUSIONS 70 5.2 SUGGESTIONS FOR FURTHER RESEARCH 71 5.3 SCIENTIFIC PUBLICATIONS 71 REFERENCES 73 13 1 INTRODUCTION In the past decade, machine learning has provided effective web search, practical speech and image recognition, efficient statistical arbitrage, more accurate medical diagnostic systems, self-driving cars, and an immense improvement in understanding the human genome. Machine learning is so widespread that people use it every day without knowing it, changing lives and providing an irreplaceable service in the industry (GUPTA; GUSAIN; POPLI, 2016). Numer- ous researchers believe that is the best way to get to human level artificial intelligence. Machine learning aims to spot a function based in a great set of data. The function is created with a first training group of data in order to forecast outputs of a previously unseen data set called testing set, always searching the highest possible accuracy. This could be seen as classification problems in the form of labels and classes or as regression problems in the form of continuous values (SJARDIN; MASSARON; BOSCHETTI, 2016). Nevertheless, it is often hard to find enough training data available, i.e., lack of suitable data, lack of access to the data, privacy problems, badly chosen tasks and algorithms, and a lack of resources; as a result, often several machine learning programs fail to deliver the expected value. The goal of building these systems is that they could be adaptable, learn from their expe- rience and being enough simple. With this purpose, two approaches stand out from machine learning: decision tree learning and artificial neural networks. Decision trees have for a long time been used for classification purposes. A highly opti- mized classification system was obtained when Friedman et al. (FRIEDMAN; HASTIE; TIB- SHIRANI, 1998) applied the boosting technique to perform decision tree ensembles. Later, the gradient descend approach was introduced to boosting (MASON et al., 1999) and along with its stochastic variant (FRIEDMAN, 2002) the boosted tree model increased robustness against the overcapacity of the base learner by adding trees sequentially while fitting a simple parameter- ized function. Despite all its improvements and wide use, it still was not efficient in processing time and model complexity. In order to reduce the latter, Johnson et al. (JOHNSON; ZHANG, 2014) made a modification of gradient boosting, it was called Fully Corrective Greedy Algo- rithm, achieving higher accuracies and smaller models. Nevertheless, this was not enough. In order to cope with this complexity, it was created the Extreme Gradient Boosting (XGBoost), a highly scalable and accurate learning system (CHEN; GUESTRIN, 2016), which has gained a lot of attention in recent years for its simplicity and its great success in competitions such as Kaggle. Alternatively, Artificial Neural Networks (ANNs) are versatile, powerful, and scalable, ex- 1 INTRODUCTION 14 cellent to cope with large and tremendously complex machine learning tasks as, for example, categorizing billions of images as Google Images does or like DeepMind’s AlphaGo that de- feats the world champion and itself at the game of Go by learning and analyzing millions of past games (GÉRON, 2017). Originally, ANNs were developed as a mathematical representation of biological brains capabilities when information is processed (MCCULLOCH; PITTS, 1988). Despite the fact that nowadays it is known that ANNs have little resemblance to real biological neurons, they are still popular as applied methods for the classification of patterns. A large number of varieties of ANNs have appeared over the years, with very diverse properties. A relevant difference is between ANNs which have acyclic connections and those with cyclic. ANNs without cycles are identified as feedforward neural networks (FNNs) and the most recognized example is the Multi-Layer Perceptron (MLP) that has its beginning with (ROSENBLATT, 1958). ANNs with cycles are described as recursive or better called Recurrent Neural Networks (RNNs). The main idea of RNNs is that they can manage contextual information when input and output sequences are mapped. Unluckily, in practice, conventional RNN architectures have a quite limited access to a range of context, because of the effect of vanishing gradient problem, in which the impact of an entry in the hidden layer(s) and, consequently, in the output decays or explodes exponentially as it travels through the recurrent connections in the network (GRAVES, 2012). One attempt to overcome this problem was proposed in 1997 by Hochreiter and Schmidhuber (HOCHREITER; SCHMIDHUBER, 1997) with Long Short-Term Memory (LSTM), which have emerged as an effective and scalable model for several learning problems related to sequential data (GREFF et al., 2017). Around the world, fire brigades seek strategies to deal with the fact of response immediately to incidents. Researches based on the forecast of the number of interventions, response speed, materials and engine used by fire departments are scarce in the literature. However, we can find machine learning techniques applied to forecasting occupational accidents using Support Vec- tor Machine (SVM) and Multi-Layer Perceptron (MLP) optimized by Genetic Algorithm (GA) and Particle Swarm Optimization (PSO), where the aim was to predict the accident outcomes such as injury and property damage using occupational accident data (SARKAR et al., 2018). Another example, Ren et al. (REN et al., 2018) worked with a deep learning approach, using LSTM neural networks to predict the traffic accident risk, taking into consideration weather variables, periodical patterns and getting from the traffic accidents the spatial distribution pat- terns. The traffic accident data was discretized in space and time, and used as inputs to the deep model during training. Then, the model was provided with recent historical data to even- tually obtain the real-time traffic accident risk prediction. Gerber (GERBER, 2014) developed a research using spatio-temporally tagged tweets with a Kernel Density Estimation (KDE) tech- nique in order to predict crimes where the crime data collected belonged to the Chicago Police Department and the selected tweets were the ones tagged with the geographical positioning sys- 1.1 MOTIVATION 15 tem coordinates corresponding to the limits of Chicago and Illinois cities. In the work of (SHI et al., 2017), the author proposed an approach based on XGBoost to predict urban fire acci- dents using ten million samples as data set, an algorithm based on association rules to select features, as well as the Box-Cox transformation to clean outliers. As a result, they obtained 90% of accuracy. Finally, Pettet et al. (PETTET et al., 2017) developed a data-driven toolchain to forecast the likelihood of vehicular incidents in given time and location using incident data from the Nashville Fire Department of the United States of America, considering weather and road type information as added variables. The authors combined a Similarity Based Agglomer- ative Clustering to categorize incidents with similar characteristics, a survival analysis to learn the probability of incidents per cluster and a Bayesian Network inference technique to map the clusters to the spatial locations. Thereby, most of the problems that firefighters face on are related to the management of the budget, rising call volume, and personnel and equipment shortages. On the other side, fire departments have been collecting data on their interventions; notwithstanding, few of them use data science to elaborate a decision making approach. Knowing the number of interventions in the next hours can help to decrease the response time of firefighters and improve their ability to save lives and rescue people. In this research, it is developed three models for this purpose: one with XGBoost, another with MLP and a last one with LSTM. Finally, their performances are compared in order to find the most efficient in terms of prediction accuracy. 1.1 MOTIVATION There is a need to build an accurate model that is capable of evaluating the possible oc- currence of an incident using information that happened in the past. Therefore, it could help fire departments to establish well planned strategies with their allocated mobile and personnel resources. Thus, the response time of firefighters would be reduced and more lives would be safe. In the literature, there are few researches related to the forecasts of interventions carried out by firefighters, that is the reason for which in this work three paradigms are constructed using different techniques of machine learning, and their performances are evaluated to discover which fits better to the considered data. 1.2 CONTRIBUTIONS In this research, the aim is to develop, evaluate and compare the performance of three regression models applied to the prediction of the number of interventions that the fire brigade 1.3 OUTLINE OF THE THESIS 16 SDIS25 in Doubs, France, would face on. The techniques used are: • XGBoost, a software library which provides a gradient boosting framework; • MLP, a classical machine learning method; • LSTM, considered a popular and more advanced neural network algorithm. The procedure establishes the following objectives: i) Collect and clean data from various sources. The initial data set is provided by the fire brigade “Service Départemental d’Incendie et de Secours” of Doubs. For complementary sources, it will be considered weather-related data, traffic hours, epidemiological data, academic vacations and holidays, astronomical variables as moon phase and others. ii) Encode and select data. As the initial data are not defined by float value pairs, it is necessary to applied feature extraction techniques such as Standard Scaler and One-Hot-Encoding from Scikit-Learn library in Python. And in order to recognize the relationship of all vari- ables and if they all are needed for the modelling, it will be tested and compared two tech- niques: Mutual Information Regression (MIR) and Principal Component Analysis (PCA), also from Scikit-Learn library. iii) Build models with XGBoost, MLP and LSTM. During the construction of the models and to improve the predictions, it will be made a research to find the best configuration of hyperparameters. For XGBoost, it will be used a grid search and for MLP and LSTM, a random search. iv) Compare the models’ performances. This activity is made taking into account the Root Mean Squared Error (RMSE), Mean Absolute Error (MAE) and Accuracy (ACC) as eval- uation metrics. Furthermore, this research shows the feasibility of the paradigms constructed, illustrates several variables not considered yet in the prediction and initializes discussions about these issues. Finally, it provides a path to the further development of predictive models for firefighters interventions considering machine learning techniques. 1.3 OUTLINE OF THE THESIS The rest of the thesis is organized as follows: 1.3 OUTLINE OF THE THESIS 17 • Chapter 2: gives an introduction to machine learning, explaining learning algorithms, its advantages, challenges and limitations considered in this research. • Chapter 3: describes how the data were collected, cleaned, encoded and selected before feeding the models, and depicts the tuning made to each model to improve their hyperpa- rameters. • Chapter 4: exhibits the predicting models developed, the results obtained and a discussion interpreting the highlighting study results. • Chapter 5: concludes about the subject by offering important insights about the research problem and makes suggestions for future studies. 18 2 MACHINE LEARNING TECHNIQUES 2.1 INTRODUCTION The Internet, the enormous computational power of the recently electronic devices and the fact that practically all process in the world make use of any kind of software are providing a great quantity of data every second. There are many things to do with this data such as analyze it to later take a decision, visualize it in a lot of ways or consider it as a source of experience to enhance the performance of the algorithms. These algorithms, which can learn from previous data, comprehends the field of Machine Learning (GARRETA; MONCECCHI, 2013). Some definitions that try to encapsulate the ideal objective or ultimate aim of machine learning are expressed as follows: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. (Mitchell, 1997, p.2) Machine learning algorithms can figure out how to perform important tasks by gen- eralizing from examples. This is often feasible and cost-effective where manual program- ming is not. As more data becomes available, more ambitious problems can be tack- led. As a result, machine learning is widely used in computer science and other fields. (DOMINGOS, 2012, p.1) As it is described in (GÉRON, 2017), there are many different types of machine learning algorithms and it is useful to classify them in broad categories based on: • Training with or without human supervision. – Supervised Learning. During the training phase, the algorithm is fed by data called Predictors indicating the desired results named Labels. The task types could be classification, like classify new emails as spam or non-spam; or regression where the aim is to predict a numerical value, such as the price of a car. This last sort of prediction is also used for algorithmic trading. Some examples of supervised learning algorithms are: K-Nearest Neighbors, Linear Regression, Support Vector Machine, Decision Trees, Random Forests and several Neural Networks architec- tures (GÉRON, 2017; GOODFELLOW; BENGIO; COURVILLE, 2016). 2.1 INTRODUCTION 19 – Unsupervised learning. Samples of the training data are unlabeled; therefore, the system attempts to learn without a guide. The model is trained with normal instances to later, during the prediction, recognizes abnormal ones. However, two main diffi- culties are known: the generalization that refers to the exponential growth of config- urations during the learning process and the computational challenge where many algorithms imply intractable computations. As examples: Clustering, Visualization and dimensionality reduction, and Association rule learning are referred (GÉRON, 2017; GOODFELLOW; BENGIO; COURVILLE, 2016). – Semi-supervised. The algorithms can handle a lot of unlabeled data with a few of labeled data. An example of these algorithms is the Deep Belief Networks (DBNs) that have the component Restricted Boltzmann Machines (RBMs) which is trained sequentially in an unsupervised manner and next fine-tuned using supervised learn- ing techniques (GÉRON, 2017; GARRETA; MONCECCHI, 2013). – Reinforcement learning: The system is named Agent and it has the ability to ob- serve the environment, choose and execute actions to get rewards in return or penal- ties. Hence, the agent learns by itself the best strategy. (GÉRON, 2017; GARRETA; MONCECCHI, 2013). • Learning incrementally on the fly. When the system is trained using all data available and then launched into production without learning anything else, the process is called Batch learning. This is done offline and consumes plenty of time and resources. To update the system, it requires to be trained a new version of itself, including the new data and the old data; then, the old system would be stopped and replaced by the new one. In order to make the computation faster and more space efficient, it can be used another method called Online learning, where the system is trained sequentially by small groups called Mini-batches, learning new data on the fly (GÉRON, 2017; BURLUTSKIY et al., 2016). • Learning by generalization of examples. It is divided into two types: Instance based-learning, where samples are learned by heart and the generalization process is using a similarity criterion; and Model-based learning, where a model is built using a set of samples and the predictions are made based on this model (GÉRON, 2017). For example, a state-of-the-art spam filter may learn on the fly using a deep neural network model trained; this makes it an online, model-based, supervised learning system. Regardless of learning style or function, all combinations of machine learning algorithms consist of three components according to (DOMINGOS, 2012): 2.1 INTRODUCTION 20 • Representation. The classifier must be depicted in a language that computer could un- derstand. • Evaluation. An evaluation function, also called objective function, is required to distin- guish good classifiers from the least efficient ones. The objective function used locally by the algorithm may diverge from the external one that it is desired the classifier to optimize. • Optimization. It is needed a method to optimize the classifiers and obtain the highest- scoring one. The selection of an optimization technique is essential to the efficiency of the learner, helping to discover more than one optimum. As it was described before, there are different approaches to getting machines to learn. With all of their processing power, they are able to more quickly highlight or find patterns in big data that would have otherwise been missed by human beings. Machine learning is a tool that can be used to enhance humans’ abilities to solve problems and make informed inferences on a wide range of situations. However, there are some challenges and limitations to overcome, such as the historical overfitting and underfitting. This could be explained by dividing the generalization error into bias and variance, where the bias is a learner’s tendency to constantly learn the same erroneous thing, resulting in underfitting, and variance is the proneness to learn random things regardless the real signal, resulting in overfitting. After overfitting, the most crucial problem in machine learning is the curse of dimensionality. This expression was coined by Bellman in 1961 to men- tion the fact that several algorithms that work well in low dimensions becomes unmanageable when the input is high-dimensional, consequently, it is more difficult to understand the data. Another characteristic to regard is the fact that when a learning algorithm is not giving good results, often the quicker path to successful outcomes is to feed the machine more data, which is the primary driver of progress in machine and deep learning algorithms in recent years. Nev- ertheless, this can lead to issues with scalability, in which more data is available but time to learn that data remains an issue (DOMINGOS, 2012). A well-known approach that have made great gains over the past decade is Deep Learning, which is the study and design of machine learning algorithms for learning good representation of data at multiple levels of abstraction. Recent publicity of deep learning through DeepMind, Facebook, and other institutions has highlighted it as the “next frontier” of machine learning. 2.2 GRADIENT BOOSTING MACHINES 21 2.2 GRADIENT BOOSTING MACHINES 2.2.1 OVERVIEW A Decision Tree is an analytical method with a schematic representation of alternatives that facilitates making better decisions. The goal of a decision tree is to learn a series of decision rules to infer the target labels. Utilizing an iterative algorithm, the procedure starts at the root level of the tree and splits the data recursively on the feature that gets the lowest impurity. The majority of scalable tree-based applications are based on Classification and Regression Trees (CART) introduced by Breiman, Friedman Stone and Ohlson in 1984. CART works building binary trees for classes or continuous variables, operating iteratively on given features and improving in a greedy way the results of an error metric expressed as impurity in order to finally obtain a prediction (SJARDIN; MASSARON; BOSCHETTI, 2016). Decision trees are the basis for ensemble methods, such as bagging and boosting. Bagging creates several subsets of data from training sample chosen randomly, where each collection generates its own decision tree. All of these produce an ensemble of different models averaged, i.e., an aggregated predictor (BREIMAN, 1996). Boosting fits consecutive trees generated by a random sample, its goal is to solve for net error from the prior tree. When an input is incorrectly classified by a hypothesis, its weight is increased and therefore the next hypothesis will try to classify it in the most precise way. Thus, weak learners, which performs just slightly better than random guessing can be “boosted” into an accurate “strong” learner (FRIEDMAN; HASTIE; TIBSHIRANI, 1998; FREUND; SCHAPIRE, 1999). When gradient descent procedure was combined with boosting technique, gradient boosting was created. Traditionally, gradient descent is used to minimize a set of parameters, such as the coefficients in a regression equation or weights in a neural network. After calculating the error or loss, the weights are updated to minimize that error. Instead of parameters, weak learner sub- models or more specifically decision trees are considered. After calculating the loss, to perform the gradient descent procedure, a tree must be added to the model to reduce the loss. This is done by parameterizing the tree, then the parameters of the tree are modified and the algorithm moves in the right direction by reducing the residual loss and ameliorating the prediction. A specific number of trees are added, or the training terminates, once loss reaches a desirable level or when the results no longer improve on an external validation data set (MASON et al., 1999; FRIEDMAN, 2001; CHIO; FREEMAN, 2018). 2.2.2 EXTREME GRADIENT BOOSTING XGBoost is a well engineered, distributed machine learning system to scale up tree boost- ing algorithms. It makes use of a novel regularization approach over the conventional Gradient 2.2 GRADIENT BOOSTING MACHINES 22 Boosting Machines (GBMs) to significantly decrease the complexity. The system is optimized for quick parallel tree construction and adapted to be fault tolerant under the distributed setting. It can handle millions of samples on a single node and manage billions of them with distributed computing (CHEN; GUESTRIN, 2016). Furthermore, unlike GMBs, Extreme Gradient Boost- ing makes use of a regularization method to considerably diminish the complexity of the model. One notable advantage is the availability of a set of parameters that are able to be modified dur- ing the training with the inputted data to enhance the model that is being constructed. (GUPTA; GUSAIN; POPLI, 2016). The process of how XGBoost works is described as follows. a) Scoring function In order to measure the performance of a model given a certain data set, XGBoost defines an objective function considering the training loss L(θ) and regularization Ω(θ) terms, where the first measures how predictive the model is and the latter penalizes the complexity of the model and prevents the overfitting. This is represented in equation (1), where θ signifies the parameters that will be discovered during the training. For the mathematical structure, XGBoost makes use of a model based on decision tree ensembles, where the result model ŷi, at the ith training example of a total of n, is the sum of the predictions of multiple trees together. Equation (2) shows K as the number of trees which are been combined, f as a function in the functional space F and xi as the input. Thus, the new evaluation function to be optimized becomes in equation (3) (CHEN, 2014). Ob j(θ) = L(θ)+Ω(θ) (1) ŷi = K ∑ k=1 fk(xi), fk ∈ F (2) Ob j = n ∑ i=1 l(yi, ŷi)+ K ∑ k=1 Ω( fk) (3) The aim is to find a trade-off between bias and variance, in other words, the model needs to be simpler and predictive. Simpler because it tends to have smaller variance making predic- tion stable, and predictive because it must fit well in training data. This can be visualized in Figure 1, where the upper image of the left corner shows the distribution of a variable that will be modelled, the upper image of the right corner presents the model with high complex- ity which derivates in an overfitting, the bottom image of the left corner illustrates the model not learning enough patterns, i.e., an underfitting, and the bottom image of the right corner shows the model with a good balance between overfitting and underfitting. b) Boosting The additive strategy is applied during the training, one new tree that optimizes the system is added at a time to the model, in equation (4), ŷ (t) i is described as the model at training 2.2 GRADIENT BOOSTING MACHINES 23 Figure 1 - Bias-Variance tradeoff in machine learning Source: (CHEN, 2014) the round t, ŷ (t−1) i is the function added in the previous round and ft(xi) is the new func- tion (CHEN, 2014). ŷ (t) i = t ∑ k=1 fk(xi) = ŷ (t−1) i + ft(xi) (4) c) Regularization To determine the complexity of the tree Ω( f ), (CHEN; GUESTRIN, 2016) proposed an approach that defines it as equation (5), where the first term γT evaluates the number of leaves T , taking γ as a constant, and the second term computes L2 norm of leaves scores w j and λ is a very small constant value (CHEN, 2014). Ω( f ) = γT + 1 2 λ T ∑ j=1 w2 j (5) d) Final structure outcome As an example of an estimator, it is considered the Mean Squared Error (MSE) for the loss term; then it is taken the Taylor expansion of the loss to the second order, the objective function is described as equation (6) and displays how the splitting of the nodes would be done. G and H are defined in equations (7) and (8) respectively, where gi and hi are the first and second order partial derivatives after taking the Taylor expansion, I j = {i|q(xi) = j} is the group of indices of data points attributed to the j-th leaf and q(x) is the structure of the tree (CHEN, 2014). 2.2 GRADIENT BOOSTING MACHINES 24 ob j(t) = T ∑ j=1 [G jw j + 1 2 (H j +λ )w2 j ]+ γT (6) G j = ∑ i∈I j gi (7) H j = ∑ i∈I j hi (8) Finally, in the objective function, it is taken the argument of the minimum and the minimum of the quadratic function for the single variable w j, considering q(x) as fixed. The outcomes are equations (9) and (10), where the last one assess how good a tree structure is, i.e., if the score is smaller, the structure will be better (CHEN, 2014). w∗ j =− G j H j +λ (9) ob j∗ =−1 2 T ∑ j=1 G2 j H j +λ + γT (10) e) Greedy learning of the tree In practice, it would be unmanageable to enumerate all possible tree structures and select the best one. Because of this trees are built in a greedy way, starting from a tree with depth zero; then, one level is optimized at the time until arriving at the maximum depth, this means, it is added a split for each leaf node of the tree. The gain is expressed as follows: Gain = 1 2 [ G2 L HL +λ + G2 R HR +λ − (GL +GR) 2 HL +HR +λ ]− γ (11) In equation (11), the first term refers to the score of left child; the second is the score of right child; the third expresses the score if a split is not taken; and the last term consider a complexity cost for adding a leaf. From this, it is deduced that is better not to add a split if the gain is smaller than γ . This technique is called Prunning in tree based models. In order to find the optimal split: i) For each node, all possible features are enumerated. ii) For each feature, instances are sorted by feature value. iii) A linear scan is used to decide the best split along that feature. This linear scan is taken from left to right and the split is done where values gi and hi give the best gain, this is calculated with equation (11). iv) The best split solution is taken along all the features. 2.3 ARTIFICIAL NEURAL NETWORKS 25 Finally, the ft(x) obtained by the new tree is added to the model. As it is shown in equation (12): y (t) i = y (t−1) i + ε ft(xi) (12) The ε is called Step-size or Learning Rate and it usually takes values around 0.1. A complete optimization is not done in each step, like this, a chance of improvement is kept for future rounds to prevent the overfitting. In Figure 2, it is illustrated an example of learning a tree on single variable, where the input variable is t (time) and the goal is to predict a person’s preference of romantic music in a given time. The first image shows the construction of the tree and the second one shows how the data is being modelled through the time in order to obtain a final function. Figure 2 - Learning a tree on single variable Source: (CHEN, 2014) For more details about the power of XGBoost method, its features and its algorithmic op- timizations, (CHEN, 2014; CHEN; GUESTRIN, 2016; GUPTA; GUSAIN; POPLI, 2016) are strongly suggested for interested readers. 2.3 ARTIFICIAL NEURAL NETWORKS 2.3.1 OVERVIEW The first occasion that ANNs were mentioned was in 1943 by the neurophysiologist Warren McCulloch and the mathematician Walter Pitts (MCCULLOCH; PITTS, 1988). They intro- duced an elementary computational model of how biological neurons from animal brains could work together to perform tasks through the use of propositional logic. This representation was known as the first artificial neural network. During ANNs’ history, they passed for a long dark period in the course of the 1960s and 1970’s where they were set aside. At the beginning of the 1980’s, a revival interest in them returned allowing the development of new architectures and 2.3 ARTIFICIAL NEURAL NETWORKS 26 better training techniques. For the next decade, better results and theoretical foundations were achieved by powerful machine learning techniques. Nowadays, with the tremendous increase in computing power, the enormous amount of data available, finer training algorithms and some theoretical limitations that have become advantageous in practice have permitted ANNs to be in a continuous evolution with accelerated progress and more fundings (GÉRON, 2017). The basic structure of an ANN is compounded by small processing units or nodes with weighted associations which in the biological model represent neurons and the strength of the synapses between the neurons respectively. The network is stimulated by given inputs to some or all nodes, this effect is spread throughout the weighted connections in the network. The activation of an ANN node tries to model the average firing of the spikes observed as a result of the electrical activity in biological neurons (GRAVES, 2012). From among all ANNs architectures, there is a special distinction ANNs whose connections are acyclic called Feedforward Neural Networks (FNNs), and those whose associations form cycles called Recurrent Neural Networks (RNNs). In the case of FNNs, the most representative is the Multi-Layer Perceptron (MLP) and for RNNs is the Long Short-Term Memory (LSTM) that in last years has demonstrated a highlighted performance. These two representations are described as follows (GRAVES, 2012). 2.3.2 MULTI-LAYER PERCEPTRON An MLP works with a set of units called neurons organized in an input layer, one or more hidden layers and an output layer, all fully connected; each layer contains a bias neuron. Thus, inputs are propagated from the first layer, passing through hidden layers until reaching the output layer, this is called the forward pass of the network and could be observed in Figure 3. When a ANN has two or more hidden layers is called Deep Neural Network (DNN) (GRAVES, 2012; GÉRON, 2017). A brief description of how Multi-Layer Perceptron works is given in (GRAVES, 2012) as follows: An MLP with a particular set of weight values defines a function from input to output vectors. By altering the weights, a single MLP is capable of instantiating many different functions. Indeed it has been proven that an MLP with a single hidden layer containing a sufficient number of nonlinear units can approximate any continuous function on a com- pact input domain to arbitrary precision. For this reason, MLPs are said to be “universal function approximators”. (GRAVES, 2012, p.13) There are some principal characteristics that were born with MLP and must be considered in the modelling of an ANN: 2.3 ARTIFICIAL NEURAL NETWORKS 27 Figure 3 - A Multi-Layer Perceptron. The S-shaped curves in the hidden and output layers indicate the application of “sigmoidal” nonlinear activation functions. Source: (GÉRON, 2017) a) Loss function. Also called "Error Function", it is the function that is going to be minimized or maximized during the training phase, its selection depends on the particular application. For classifi- cation problems, the goal is to model the posterior probabilities of class membership con- ditioned on the input variables. The most utilized are Categorical Crossentropy or Binary Crossentropy. For regression problems, the basic goal is to model the distribution of the output variables also conditioned on the input variables. The most used are Mean Squared Error (MSE) and Mean Absolute Error (MAE). In the case of modelling count data, which characteristics are represented by a discrete distribution and non-negative predicted values, Poisson distribution is widely used (BISHOP, 1995; GRAVES, 2012; CHOLLET et al., 2015; LONG, 1997). In Keras, a library of open source neural networks written in Python, Poisson loss function is calculated by equation (13): Loss = 1 n n ∑ i=1 (ŷ(i)− y(i)log(ŷ(i))) (13) Where: n : number of samples. ŷ(i) : prediction of sample ith. y(i) : target of sample ith. b) Backpropagation. In 1986, D. E. Rumelhart et al. found a way to train MLPs by introducing the backprop- agation algorithm in (RUMELHART; HINTON; WILLIAMS, 1986). During the training phase, each instance is used by the algorithm to feed the network and calculate the output of each neuron in each successive layer. The network’s error is computed by calculating the variation between the desired output and the real output of the network at the moment; then, 2.3 ARTIFICIAL NEURAL NETWORKS 28 it is determined the quantity of error contribution that each neuron of the last hidden layer has in the output neuron’s error. This process continues in each previous hidden layer until reach the input layer. In this manner, the error gradient is measured efficiently during the reverse pass through all the connection weights and finally, with the gradients calculated the algorithm makes a gradient descent step on all the connection weights (GÉRON, 2017). According to (HAYKIN, 1998), for each iteration, weights can be updated following the next rule: wi j = αwi j(n−1)+ εδ j(n)yi(n) (14) Where: n : current iteration. n−1 : previous iteration wi j : synaptic weight connecting neuron i to neuron j. α : momentum. ε : learning rate. δ j : local gradient. yi : input signal of neuron j. c) Problem of vanishing/exploding gradients. During the backpropagation in architectures with two or more layers, gradients often get more and more little as the algorithm progresses down to the first layers. Consequently, the lower layer connection weights stay without being updated by the gradient descent and the training phase does not converge to a good solution; or the opposite, when gradients become bigger and bigger and last layers get large weight updates making the algorithm diverges. In the literature, this is recognized as the vanishing or exploding gradients problem (GÉRON, 2017). In (GLOROT; BENGIO, 2010) was analyzed this problem. Ideally, to avoid the vanishing gradient, the variance of each layer outputs should be the same with the variance of its inputs, also it would be required the gradients to have equivalent variance before and after flowing through a layer in reverse direction along the network. However, it is not possible to assure both; instead, what has worked very well on the practical level is the random initialization of the connections weights. This is described in Table 1, where ninputs is the number of input connections and nout puts the number of output connections for the layer whose weights are being initialized (GÉRON, 2017). Using the Xavier initialization procedure can accelerate training considerably. What is more, this method has been applied to deep learning giving excellent results (GÉRON, 2017). d) Activation functions. 2.3 ARTIFICIAL NEURAL NETWORKS 29 Table 1 - Initialization parameters for each type of activation function Activation Function Uniform Distribution [-r,r] Normal Distribution Logistic r = √ 6 ninputs+nout puts σ = √ 6 ninputs+nout puts Hyperbolic Tangent r = 4 √ 6 ninputs+nout puts σ = 4 √ 6 ninputs+nout puts ReLU (and its variants) r = √ 2 √ 6 ninputs+nout puts σ = √ 2 √ 6 ninputs+nout puts Source: (GÉRON, 2017) In order to work with the backpropagation, it had to be changed the step activation function by the logistic function, this is because of the Gradient Descent works in bumpy surfaces and the resulting flat segments of the step function do not allow it while the logistic function presents nonzero derivative everywhere admitting some progress at every step. Two other widespread activation functions are detailed in (GÉRON, 2017) as follows: • The hyperbolic tangent function (Tanh). It shows an S-shaped, it is continuous and differentiable, its output value is on the limits of −1 and 1 making the layer’s output more or less normalized which helps in the convergence. • The rectified linear unit function (ReLU). It is continuous but not differentiable when the sum of the average weights is less or equal to zero, its range is between zero and infinity and tends to be fast to compute. For more details about MLP architecture, (GRAVES, 2012; GÉRON, 2017; GLOROT; BENGIO, 2010; BISHOP, 1995) are recommended to interested readers. 2.3.3 LONG SHORT-TERM MEMORY NEURAL NETWORK (LSTM) Unlike MLP that map from input to output vectors, RNN connections can save a memory of previous inputs and keep it in the network’s internal state which impact on the output (GRAVES, 2012). This is similar to the fact of comprehending contexts based on the understanding of previous situations or completing phrases using first words as a basis. The recurrent process is explained in Figure 4. On the left side, it is showed a recurrent neural network with one neuron which receives an input x, results in y and returns the outcome to itself. Generally, the activation function used is the hyperbolic tangent. On the right side of the picture, the recurrent network is unrolled and presents the recurrent neuron being fed by the inputs in different time steps, for each time step it is obtained a result that comes back as entry internally. The part of the network that holds some state over time steps is called Memory Cell. A single neuron or a layer of recurrent neurons is a Basic Cell (GRAVES, 2012; GÉRON, 2017). 2.3 ARTIFICIAL NEURAL NETWORKS 30 Figure 4 - A recurrent neuron (left), unrolled through time (right). Source: (GÉRON, 2017) There are four kind of input and output sequences in recurrent neural networks: a) The network can receive a sequence of inputs and generate a sequence of outputs; b) The network can take a series of inputs, ignore all outputs, except for the last one; c) The network can admit a single entry at the first time step, complete with zeros the rest of time steps and return a sequence; and d) A sequence-to-vector network called Encoder, succeeded by a vector-to-sequence network called Decoder. However, standard recurrent networks can access to a limited range of context due to the vanishing or exploding gradient problem explained in section 2.3.2. In (HOCHREITER; SCHMIDHUBER, 1997), it was presented an improved version of RNN called Long Short-Term Memory, which is an effective and scalable model for solving problems related to sequential data such as handwriting recognition, speech recognition, lan- guage modelling, human activity recognition and traffic prediction. LSTM was designed to get over error backflow problems by truncating the gradient without affecting the training process. It is described by the authors in the following: LSTM can learn to bridge minimal time lags in excess of 1000 discrete time steps by enforcing constant error flow through “constant error carrousels” within special units. Multiplicative gate units learn to open and close access to the constant error flow. In comparisons with RTRL, BPTT, Recurrent Cascade-Correlation, Elman nets, and Neural Sequence Chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long time lag tasks that have never been solved by previous recurrent network algorithms. (HOCHREITER; SCHMIDHUBER, 1997, p.1) The LSTM architecture presents a set of subnets connected in a recurrent way, named mem- ory blocks, each block has one or more auto-connected memory cells and three multiplicative 2.3 ARTIFICIAL NEURAL NETWORKS 31 units called “gates” for input, output and forget, which represents operations such as writing, reading and restart for the cells (GRAVES, 2012). In Figure 5, it is shown an LSTM cell, its state is divided into two: the short-term state h(t) and the long-term state c(t), the latter helps the network to recognize what to store, what to discard and what to read from it. As the long-term state c(t−1) travels the network from left to right, for each time step some memories are dropped when passing through the forget gate and new ones are added selected by the input gate via the addition operation, the resulting long-term state c(t) is duplicated, one copy is sent directly without any further change and another one is delivered to the tanh activation function and filtered by the output gate, producing the short-term state h(t). In this example, the resulting h(t) is equal to the cell’s output for this time step y(t) (GÉRON, 2017). Figure 5 - LSTM cell Source: (GÉRON, 2017) But, where do new memories come from and how do the gates affect them? i) An input vector x(t) and the previous short-term state h(t) are entries to four fully connected layers. ii) The principal layer is the one that outputs g(t) which analyzes the inputs and the previous short-term states. It is partially stored in the c(t). iii) The input, forget and output layers are known as “gate controllers”. They make use of the logistic activation function, where their output values are between 0 and 1, “0” means closed gate and “1” opened gate. The forget gate f(t) manages which memories should be dropped from the long-term state. The input gate i(t) controls the resultant flow of g(t) that 2.4 ADVANCED FEATURES 32 should be partially added to the long-term state. Finally, at a specific time step, the output gate o(t) determines which memories of the long-term state should be resulting. The equations to compute all LSTM process are: i(t) = σ(W T xi · x(t)+W T hi ·h(t−1)+bi) (15) f(t) = σ(W T x f · x(t)+W T h f ·h(t−1)+b f ) (16) o(t) = σ(W T xo · x(t)+W T ho ·h(t−1)+bo) (17) g(t) = tanh(W T xg · x(t)+W T hg ·h(t−1)+bg) (18) c(t) = f(t)⊗ c(t−1)+ i(t)⊗g(t) (19) y(t) = h(t) = o(t)⊗ tanh(c(t)) (20) Where: Wxi, Wx f , Wxo, Wxg: are the weight matrices for their connection to the input vector x(t). Whi, Wh f , Who, Whg: are the weight matrices for their connection to the previous short-term state h(t−1). b f , bg, bi, bo: are the bias terms for each of the four layers. For more details about LSTM and its architecture, (HOCHREITER; SCHMIDHUBER, 1997; GRAVES, 2012; GÉRON, 2017; KALCHBRENNER; DANIHELKA; GRAVES, 2015) are suggested. 2.4 ADVANCED FEATURES 2.4.1 FEATURE EXTRACTION In supervised learning, which is the type of learning used in this research, a list of samples is organized in feature/value pairs called predictors and in one or more independent features called target classes which will be predicted based on the remaining features. Nevertheless, originally, 2.4 ADVANCED FEATURES 33 the source data is not structured like this, that is why, it is needed to apply feature extraction process to obtain the most useful ones where values can be integer, float, string, categorical, etc; and transform them to a specific format for the learning process such as categorical features into numerical values through the use of One-Hot Encoding (OHE) method where one attribute will be equal to “1” (hot), while the others will be “0” (cold). Consequentely, the training time will be reduced (GARRETA; MONCECCHI, 2013). In this research, StandardScaler and OneHotEncoder methods from Scikit-learn (PE- DREGOSA et al., 2011) and Pandas (MCKINNEY, 2010) libraries respectively are used to convert features. • Standard scaler. It is a method that standardizes features by rescaling the distribution of values to zero mean and unit variance. The standard outcome for each sample x is: z = (x−u) s (21) Where: u: mean of the training samples, when the parameter mean = False, u = 0. s: standard deviation of the training samples, when the parameter std = False, s = 1. After centering and scaling, the mean and standard deviation are stored to be used on later data with the transform method. In Figure 6, as an example, it is considered a data set to identify the type of the iris plant, using as features: sepal length, sepal width, petal length, petal width and the type of the plant as the target feature. The initial set of features is taken and transformed using the standard scaler method, resulting in the second data set. Figure 6 - Standard scaler representation Source: Developed by the author • One hot encoder. It converts categorical variables into indicator features. For instance, if we have the categorical variable “country” with the values [’Peru’, ’Brazil’, ’France’] can be encoded into a binary vector with 3 positions, having as value “1” where the country is present and the remaining positions as “0”. For a better visualization see Figure 7. 2.4 ADVANCED FEATURES 34 Figure 7 - One hot encoder representation Source: Developed by the author 2.4.2 FEATURE SELECTION Usually, it is desired to use every available feature in the learning data set, since it is believed that using as much information would build a better model. However, there are two main reasons why the number of considered features should be restricted. First, it is possible that unimportant features could establish correlations between features and the target that arise just by chance and do not correctly model the problem, this may lead to poor generalization or add redundant information. And second, a large number of features could enormously increase the computation time without any improvement (GARRETA; MONCECCHI, 2013). As a result, working with a smaller set of features may conduct to better results. For this reason, it is required to search an algorithmically way to discover the best features. Thus, for this work, it will be used “Mutual Information Regression” and “Principal Component Analy- sis” (PCA) methods from the open source library Scikit-learn (PEDREGOSA et al., 2011) to select features. Both of them will be evaluated and compared in order to find which method best captures the necessary features that will be used as inputs for the models developed with XGBoost, MLP and LSTM. • Mutual information regression. It is a non-parametric method that evaluates the mu- tual correspondence between two variables. When the two variables are independent the output is zero, but higher outputs mean higher dependency (PEDREGOSA et al., 2011). The method is based on information theory, where the Mutual Information (MI) is the quantity of uncertainty in a target variable that is removed by knowing a random vari- able, i.e., the MI is the amount of shared information, usually, measured in units called bits (SHANNON, 1948). However, the approach to calculate MI differs in the kind of variables, if they are discrete or continuous. Continuous values are more difficult to deal with, because basically, they are sparsely sampled. In the special case, where one variable is discrete and the other is continuous, which is the case of the data set used in this work, a method described in (ROSS, 2014) is used to overcome this problem using k-nearest neighbors. The approach computes the mutual information I(X ,Y ) as the weighted form of Ii, where Ii is the Jensen-Shannon divergence (GROSSE et al., 2002) applied to each data point i. The data point i belongs to a list of (x,y), where x ∈ X and y ∈Y , considering 2.4 ADVANCED FEATURES 35 X as the discrete variable and Y as the real-valued variable. Initially, it is established the kth-nearest neighbor to the data point i within all Nxi, where the parameter k is a fixed in- teger and Nxi are the data points whose value of the discrete variable equals xi. Then, it is counted the total number of neighbors mi, including the ones whose value of the discrete variable does not equal xi but lie in the distance to the kth-nearest neighbor. N are all data points in X and ψ(.) is the digamma function taken from (ABRAMOWITZ, 1974). Ii = ψ(N)−ψ(Nxi)+ψ(k)−ψ(mi) (22) I(X ,Y ) = 〈Ii〉= ψ(N)−〈ψ(Nx)〉+ψ(k)−〈ψ(m)〉 (23) In Figure 8, from the iris plant data set, it is extracted the first column to illustrate how the k-nearest neighbors are taken in order to sample distributions. Later, it is applied the equation (23) and obtained its mutual information observed in the third part of the picture, considering the representation of each iris category as an integer number to make the computation. For more details about the mathematical analysis of the procedure, see (ROSS, 2014). Figure 8 - Procedure for estimating the MI Source: Developed by the author • Principal component analysis. It is a data compression technique. It makes use of linear algebra through the Singular Value Decomposition (SVD) in order to find orthogonal directions of greatest variance, i.e, it discovers data combinations that retain the most information. SVD decomposes any matrix A into three pieces: U , Σ and V t , where the 2.4 ADVANCED FEATURES 36 matrix U shows the eigenvectors with the most meaningful direction of the data and the matrix Σ depicts the total variance in the data. This is represented in equation (24). In scikit-learn library (PEDREGOSA et al., 2011), the attribute n_components_ is the number of components selected and the attribute explained_variance_ratio_ represents the variance percentage of each selected component. A =UΣV t (24) Where: U : left singular vectors (orthogonal matrix). V t : right singular vectors (orthogonal matrix). Σ: singular values (diagonal matrix). Thus, the data dimension can be reduced by generating transformed variables called com- ponents from the original ones. As an example and continuing with the data set of the iris plant, in Figure 9, the second data set demonstrates that the 90% of the variance is concentrated in two principal components that can be taken as a new data set. For more details about Singular Value Decomposition, see (STRANG, 2016) and about using the PCA technique, see (PEDREGOSA et al., 2011; GÉRON, 2017). Figure 9 - Transformation with PCA Source: Developed by the author 2.4.3 MODEL SELECTION Selecting the model parameters, known as hyperparameters, is another important step dur- ing the training. In this research, the GridSearchCV method from Scikit-learn is used for select the best model to XGBoost; and for MLP and LSTM neural networks, a Random Search is implemented. • Grid search cross-validation. It is a method that performs an exhaustive search through cross-validation using specific parameter values for an estimator, this means, the classifier 2.4 ADVANCED FEATURES 37 or estimator will be trained for each combination and get a cross-validation accuracy at evaluating each one. Thus, results will be shown and the best parameters could be identified (PEDREGOSA et al., 2011; GARRETA; MONCECCHI, 2013). Commonly, it is used with three or fewer hyperparameters due to the fact that its compu- tational cost grows exponentially as the number of hyperparameters increases (GOOD- FELLOW; BENGIO; COURVILLE, 2016). • Random search. Another method to optimize hyperparameters is Random Search, which defines a marginal distribution for binary or discrete hyperparameters, or uniform distri- bution for positive real-valued ones on log-scale as an example. There is a principal reason that why random search converges much faster to good values than grid search and it is because the former does not misspend experimental executions (they would usu- ally have different values), unlike the latter (GOODFELLOW; BENGIO; COURVILLE, 2016). 38 3 DATA PREPROCESSING 3.1 DATA COLLECTION The departmental fire and rescue, SDIS 25, in the region Doubs-France has provided a set of information collected from 2012 to 2017 containing a quantity of 195 628 interventions. This data file contains: the identifier code of an intervention, date and time of the start and end of an intervention, date and time in which the first engine arrives, the community where the intervention happens, the classification of the incident, response time and duration of the intervention. From the list of interventions gathered, it was extracted the date and time in order to detect tendencies correlated with these parameters. For instance, the number of car accidents increases on Saturday night because young people tend to drink more alcohol during this period of time. Other factors considered in the occurrence of incidents are the weather conditions, that affects significantly the number of road accidents, fires and casualties; traffic hours, height of the main rivers in Doubs, epidemiological data, academic vacations, holidays, moonrise, moonset and moon phase in order to predict the number of interventions that will occur in the next hour. Therefore, at the beginning, it was created a dictionary with the extracted data from the fire department and the supplementary data imported from other sources together. The process is explained as follows: • The dictionary is initialized containing keys ranging from “01/01/2012 00:00:00” until “31/12/2017 23:00:00” in the form “YYYYMMJJhhmmss”. The keys are generated by blocks of one hour. • The following weather-related data reported by “Meteo France” (FRANCE, 2019) was imported from three stations located in Dijon-Longvic, Bale-Mulhouse, and Nancy- Ochey: temperature, pressure, pressure variation each one hour, barometric trend type, total cloudiness, humidity, dew point, precipitation in the last hour, precipitation in the last three hours, average wind speed for every ten minutes, average wind direction for every ten minutes, bursts over a period, horizontal visibility, and finally the current time. However, the data were not complete, some fields were missing. For this reason, it was applied a linear interpolation to fill the blanks. Finally, the meteorological data were added to the dictionary. • It was introduced various temporal information: hour, day, day in the week, day in the 3.1 DATA COLLECTION 39 year, month and year. • It was considered the height of twelve rivers, the most representative of the Doubs de- partment, the data reported by “Hydro” (HYDRO, 2015) are from the stations: L’Allan à Courcelles-lès-Montbéliard, Le Doubs à Voujeaucourt, Le Doubs à Besançon, La Loue à Ornans, L’Ognon à Montessaux, L’Ognon à Bonnal, Le Dessoubre á Saint-Hippolyte, Le Doubs á Mouthe, Le Doubs á Mathay, Le Drugeon á Rivière Drugeon, Le Gland á Meslières and La Loue á Vuillafans. The dictionary was filled with the average of the readings closest to the time of the block considered, the standard deviation of variation of the water height on this block, the number of readings during this block, the maximum height occurred during this block of time with the alert 1 (true) if the height of the river pass a limit established as a flood alert or 0 (false) if not. • From the list of interventions given by the fire brigade department, the interventions were organized according to the date of occurrence, grouping them by a period of one hour to add it to our dictionary. • Holidays were considered as a binary variable initialized with 0 (false), that will be 1 (true) for any one hour block within an academic holiday period. Also, it was contem- plated the start and end of vacations as a binary variable where it is 1 (true) for the days corresponding to the beginning and end of holiday periods and 0 (false) if not. • Public holidays were added with values 1 or 0, for true or false respectively, as well as a second key that is set to 1 the days before public holidays, for the hours ranging from 3:00 pm to 11:00 pm (otherwise 0). • It was included information related to the “Bison Futé” (FUTÉ, 2015) which is a system put in place in France to communicate to motorists all the recommendations of public authorities regarding traffic, traffic jams, bad weather, accidents, advices, etc. It classifies the days at risk according to several colors: green = fluid traffic, orange = dense traffic, red = difficult traffic, black = to avoid because of traffic jams and slow traffic. We integrate these information with two additional keys related to the departure and the return. They are 0, 1, 2 or 3, depending on whether the traffic forecasts correspond to green, orange, red or black. • It was incorporated weekly epidemiological information organized by each given hour and related to the incidence of chickenpox, influenza and acute diarrhea, collected from the “Sentinelles” network (SENTINELLES, 2007). • Finally, it was added to the dictionary a boolean variable to know if it is a day (0) or night (1) for each given hour. Moreover, we added another boolean variable to recognize if 3.2 DATA CLEANING 40 the moon had already risen, this was determined considering the given hour plus thirty minutes, and what is its phase (an integer from 0 to 7, namely 0 for new moon, 2 for the first quarter, 4 for the full moon, and 6 for last quarter). In Table2, it is shown how the dictionary looks like. It contains the information extracted from the list of interventions (number of interventions, hour, day, etc.) and the ones imported from external sources (meteorological, rivers height, ephemeris, traffic, diseases, vacations, etc.). Each line represents a block of one hour. Over the period of 6 years, for each day we have twenty four columns representing the 24 hours. Table 2 - Illustrated example of the dictionary ID year startEndHolidays ... windDirectionBasilea humidityDijon temperatureNancy nbInterventions 0 2012 0 ... 240 97 283.35 7 1 2012 1 ... 216 96 283.38 10 2 2012 0 ... 193 95 283.45 9 ... ... ... ... ... ... ... ... 52560 2017 0 ... 200 96 277.45 10 Source: Developed by the author 3.2 DATA CLEANING In this subsection, it is explained how outliers, that can affect negatively the final results, were detected and removed. At first, it was analyzed the number of interventions per year in Figure 10 and it is observed a great increment of them over the 6 years, maybe related to the population-ageing and growth. Then, it was calculated the mean value of the number of interventions per hour, the average was 3.59 interventions/h. Moreover, it was found that the minimum number of intervention per hour is 0, while the maximum is 85. Finally, in 75% of cases the number of interventions is less than 5. Leap years have an impact on the variable of the day in the year. For instance, June 21th (Music day in France) or July 14th (the National Day of France) are not the same day in the year when the month of February has 29 days. For this reason, February 29th of 2012 and 2016 were removed. On the other side, looking at Figure 11 and in more detail Figure 12, there seems to be some very particular situations that generated a large number of interventions (more than 80). These events can affect the learning phase. Therefore, they were analyzed in more detail to know if they should or not be discarded. The hours were sorted, ranging from 0 to 52 559, in descending order according to their corresponding number of intervention. It was noticed that the first 7 IDs with the highest number of interventions correspond to the same period of time. 3.2 DATA CLEANING 41 Figure 10 - Number of interventions per year Source: Developed by the author Figure 11 - Frequency of number of interventions per hour Source: Developed by the author Figure 12 - Outliers in the occurrence of interventions Source: Developed by the author The ID number 39243 has the maximum number of interventions, that is 85, illustrated in Figure 13. The year, month, day, and hour of the 7 IDs were listed and it was noticed that they all belong to the night of 24th to the 25th of June 2016. In the list of interventions given by the fire department, the following main causes were noted for these days: exhaustion, floods, 3.2 DATA CLEANING 42 protection of miscellaneous property, and accidents. It was found that there were very violent storms that night and that was recognized as a natural disaster in the region of Doubs. Therefore, it is necessary to evaluate if these outliers should be considered as a consequence of exceptional weather and be artificially smoothed or keep them to be predicted with weather variables. In what follows, a forecasting analysis of these picks will be made with meteorological data from Basilea, Dijon, and Nancy. Figure 13 - Maximum number of interventions Source: Developed by the author It is taken an interval of one hundred hours (approximately four days) centered around this storm. While checking the precipitation in the last one hour, Figure 14, it can be observed a peak in rainfall during the last hour, almost four millimeters, but not enough bigger than eighty millimeters that the article in Figure 15 from the “Est Republicain” (REPUBLICAIN, 2016) mentions. If it is compared the IDs of the peaks with the IDs of the maximum precipitations in the data provided by the weather station in Basilea (Figure 16), it seems that they are not unusual values. Although Basilea is closer to the storm location (between Sancey and L’Isle- sur-le-Doubs, Figure 17) than Dijon or Nancy, it is possible that its precipitations values are not very representative. Nevertheless, it might be possible that a lot of water has fallen for a relative long time. Therefore, Figure 18 shows the precipitation over a period of three hours. A thunderstorm peak appears clearly, and the amount dropped twenty millimeters is closer to the eighty millimeters mentioned above. With the data, it is checked if such quantity twenty millimeters is something frequent and it turns out to be the fifth highest rainfall in three hours recorded from 2012 to 2016. These precipitation data are very important for the prediction model, but they do not allow the prediction of the extreme situation of June 25th, 2016 due to weather measurements that are not sufficiently localized. Another variable analyzed was the wind speed. However, the maxi- mum value in the defined time interval (Figure 19) is less than 15.9m/s which is the maximum wind speed from 2012 to 2017. With the weather data considered at the moment, it looks like 3.2 DATA CLEANING 43 complicated to forecast such a peak of interventions. This kind of situation is very exceptional and has a big impact on the data studied. For instance, the number of incidents in June might be considerably overvaluated. For this reason, the chosen option is to artificially smooth the data on this date by putting the same number of interventions at the same time of the following day. Figure 14 - Precipitation each 1h Source: Developed by the author Figure 15 - L’Est Republicain reports on the storms Source: (REPUBLICAIN, 2016) Figure 16 - Precipitations peaks recorded from Basilea Station Source: Developed by the author 3.3 VARIABLES ENCODING 44 Figure 17 - Storm area Source: Developed by the author Figure 18 - Precipitation each 3h Source: Developed by the author Figure 19 - Ten minutes average wind speed Source: From Author 3.3 VARIABLES ENCODING First, the data was divided in two sets: learning (years 2012-2016) and testing (year 2017). It was used the StandardScaler method to extract features from the numerical variables. The mean and the standard deviation was computed with the learning set. Then, the standardiza- tion by centering and scaling was made with the complete data set, i.e., with the learning and testing sets. This process is made in order to discover changes in the data through the time and capture them during the training process. The numerical variables considered were: year, hour, wind direction, humidity, nebulosity, moon phase, dew point, precipitations, bursts, tempera- ture, visibility, wind speed, chickenpox statistics, influenza statistics, acute diarrhea statistics, rivers height variables except by the alert variable. More details are presented in Table 3. In the case of the categorical variables, they were encoded with the method OneHotEncoder. OHE was fitted to the complete data set in order to recognize all the categories for each variable. Then the data was transformed with the identified pattern. The categorical variables considered were: bison futé variables, time variables such as day, day of the week, day of the year and 3.4 DATA SET DIMENSIONALITY REDUCTION 45 month, holiday indicator, night indicator, barometric trend, river height alert variable. These variables are shown in Table 4. Finally, the target feature “nbInterventions” was not standardized, after several previous tests, better results were obtained with original values. In total, 831 features were extracted. Table 3 - Numerical variables Variable : Description annee : year varicelle_inc : number of registered incidents for chickenpox heure : hour varicelle_inc100 : number of registered incidents for chickenpox directionVentBale : wind direction in Basilea varicelle_inc100_low : number of registered incidents for chickenpox directionVentDijon : wind direction in Dijon varicelle_inc100_up : number of registered incidents for chickenpox directionVentNancy : wind direction in Nancy varicelle_inc_low : number of registered incidents for chickenpox humiditeBale : humidity in Basilea hLoueVuillafansMax : maximum records value of Loue river’s height at Vuillafans humiditeDijon : humidity in Dijon hAllanCourcellesMean : records average of Allan river’s height at Courcelles humiditeNancy : humidity in Nancy hDoubsBesanconMean : records average of Doubs river’s height at Besançon nebulositeBale : nebulosity in Basilea hDoubsVoujeaucourtMean : records average of Doubs river’s height at Voujeaucourt phaseLune : moon phase hLoueOrnansMean : records average of Loue river’s height at Ornans pointRoseeBale : dew point in Basilea hOgnonBonnalMean : records average of Ognon river’s height at Bonnal pointRoseeDijon : dew point in Dijon hOgnonMontessauxMean : records average of Ognon river’s height Montessaux pointRoseeNancy : dew point in Nancy hAllanCourcellesStd : records std of Allan river’s height at Courcelles precipitations1hBale : precipitation in the last 1h in Basilea hDoubsBesanconStd : records std of Doubs river’s height at Courcelles precipitations1hDijon : precipitation in the last 1h in Dijon hDoubsVoujeaucourtStd : records std of Doubs river’s height at Voujeaucourt precipitations1hNancy : precipitation in the last 1h 1h in Nancy hLoueOrnansStd : records std of Loue river’s height at Ornans precipitations3hBale : precipitation in the last 3h in Basilea hOgnonBonnalStd : records std of Ognon river’s height at Bonnal precipitations3hDijon : precipitation in the last 3h in Dijon hOgnonMontessauxStd : records std of Ognon river’s height at Montessaux precipitations3hNancy : precipitation in the last 3h in Nancy hAllanCourcellesNb : records number of Allan river’s height at Courcelles pressionBale : pressure in Basilea hDoubsBesanconNb : records number of Doubs river’s height at Besancon pressionDijon : pressure in Dijon hDoubsVoujeaucourtNb : records number of Doubs river’s height at Voujeaucourt pressionNancy : pressure in Nancy hLoueOrnansNb : records number of Loue river’s height at Ornans pressionMerBale : pressure at sea level in Basilea hOgnonBonnalNb : records number of Ognon river’s height at Bonnal pressionMerDijon : pressure at sea level in Dijon hOgnonMontessauxNb : records number of Ognon river’s height at Montessaux pressionMerNancy : pressure at sea level in Nancy hAllanCourcellesMax : maximum records value of Allan river’s height at Courcelles pressionVar3hBale : pressure variation in the last 3h in Basilea hDoubsBesanconMax : maximum records value of Doubs river’s height at Besancon pressionVar3hDijon : pressure variation in the last 3h in Dijon hDoubsVoujeaucourtMax : maximum records value of Doubs river’s height at Voujeaucourt pressionVar3hNancy : pressure variation in the last 3h in Nancy hLoueOrnansMax : maximum records value of Loue river’s height at Ornans rafalesSur1perBale : gusts in Basilea hOgnonBonnalMax : maximum records value of Doubs river’s height at Besancon rafalesSur1perDijon : gusts in Dijon hOgnonMontessauxMax : maximum records value of Doubs river’s height at Besancon rafalesSur1perNancy : gusts in Nancy hDessoubreHippoMean : records average of Dessoubre river’s height at Saint-Hippolyte temperatureBale : temperature in Basilea hDoubsMathayMean : records average of Doubs river’s height at Mathay temperatureDijon : temperature in Dijon hDoubsMoutheMean : records average of Doubs river’s height at Mouthe temperatureNancy : temperature in Nancy hDrugeonRiviereMean : records average of Drugeon river’s height at La Rivière visibiliteBale : horizontal visibility in Basilea hGlandMeslieresMean : records average of Gland river’s height at Meslières visibiliteDijon : horizontal visibility in Dijon hLoueVuillafansMean : records average of Loue river’s height at Vuillafans visibiliteNancy : horizontal visibility in Nancy hDessoubreHippoStd : records std of Dessoubre river’s height at Saint-Hippolyte vitesseVentBale : 10 min average wind speed in Basilea hDoubsMathayStd : records std of Doubs river’s height at Mathay vitesseVentDijon : 10 min average wind speed in Dijon hDoubsMoutheStd : records std of Doubs river’s height at Mouthe vitesseVentNancy : 10 min average wind speed in Nancy hDrugeonRiviereStd : records std of Drugeon river’s height at La Rivière diarrhee_inc : number of registered incidents for diarrhea hGlandMeslieresStd : records std of Gland river’s height at Meslières diarrhee_inc100 : number of registered incidents for diarrhea hLoueVuillafansStd : records std of Loue river’s height at Vuillafans diarrhee_inc100_low : number of registered incidents for diarrhea hDessoubreHippoNb : records number of Dessoubre river’s height at Saint-Hippolyte diarrhee_inc100_up : number of registered incidents for diarrhea hDoubsMathayNb : records number of Doubs river’s height at Mathay diarrhee_inc_low : number of registered incidents for diarrhea hDoubsMoutheNb : records number of Doubs river’s height at Mouthe diarrhee_inc_up : number of registered incidents for diarrhea hDrugeonRiviereNb : records number of Drugeon river’s height at La Rivière grippe_inc : number of registered incidents for influenza hGlandMeslieresNb : records numberof Gland river’s height at Meslières grippe_inc100 : number of registered incidents for influenza hLoueVuillafansNb : records number of Loue river’s height at Vuillafans grippe_inc100_low : number of registered incidents for influenza hDessoubreHippoMax : maximum records value of Dessoubre river’s height at Saint-Hippolyte grippe_inc100_up : number of registered incidents for influenza hDoubsMathayMax : maximum records value of Doubs river’s height at Mathay grippe_inc_low : number of registered incidents for influenza hDoubsMoutheMax : maximum records value of Doubs river’s height at Mouthe grippe_inc_up : number of registered incidents for influenza hDrugeonRiviereMax : maximum records value of Drugeon river’s height at La Rivière varicelle_inc_up : number of registered incidents for chickenpox hGlandMeslieresMax : maximum records value of Gland river’s height at Meslières Source: Developed by the author 3.4 DATA SET DIMENSIONALITY REDUCTION The standardized data set used presents a high dimension: 831 features. For this reason, the methods: Mutual information regression and PCA were tested in a parallel way to evaluate and record the most relevant characteristics and use the regarded variables (outcomes from MI) or components (outcomes from PCA) to optimize the modelling process in time and memory without missing the integrity of the data. Naturally, it is necessary to bear in mind that features 3.4 DATA SET DIMENSIONALITY REDUCTION 46 Table 4 - Categorical variables Variable : Description bisonFuteDepart : departure traffic level tendanceBaromNancy : barometric trend Nancy bisonFuteRetour : return traffic level vacances : vacations debutFinVacances : start/end vacations veilleFerie : days before holidays ferie : holidays hAllanCourcellesAle : warning for Allan river’s height at Courcelles jourSemaine : day in the week hDoubsBesanconAle : warning for Doubs river’s height at Besançon jour : day hDoubsVoujeaucourtAle : warning for Doubs river’s height at Voujeaucourt jourAnnee : day in the year hLoueOrnansAle : warning for Loue river’s height at Ornans mois : month hOgnonBonnalAle : warning for Ognon river’s height at Bonnal luneApparente : sunrise or dusk hOgnonMontessauxAle : warning for Ognon river’s height at Montessaux nuit : day or night hDessoubreHippoAle : warning for Dessoubre river’s height at Saint-Hippolyte tempsPresentBale : weather conditions in Basilea hDoubsMathayAle : warning for Doubs river’s height at Mathay tempsPresentDijon : weather conditions in Dijon hDoubsMoutheAle : warning for Doubs river’s height at Mouthe tempsPresentNancy : weather conditions in Nancy hDrugeonRiviereAle : warning for Drugeon river’s height at Rivière tendanceBaromBale : barometric trend type in Basilea hGlandMeslieresAle : warning for Gland river’s height at Meslières tendanceBaromDijon : barometric trend type in Dijon hLoueVuillafansAle : warning for Loue river’s height at Vuillafans Source: Developed by the author that look unimportant in isolation, i.e., when applying Mutual information regression or PCA, might be important in combination, i.e., during the modelling. To get the Mutual information regression score, the entirety data set with the 831 extracted features was processed. The variables were divided in features and target, and 800 neighbors were taken into account to calculate the score. The outcomes were scaled between 0 and 1. Fi- nally, a threshold to select the features was established with the value 0.01, i.e., all features with scores higher than 0.01 will be considered as input to the future models developed. Hence, 56 features and the target were taken to train the model. The selected features with their respective scores are shown in Table 5. PCA was used as an alternative method to reduce the dimension of the standardized data set. The model was fitted with the learning set and it was retained the 95% of the variance. The transformation was applied to both data sets (learning and testing), i.e., the application of the dimensionality reduction to the data set with the 831 extracted features resulted in 83 principal components and the target. The division on the learning and testing sets during the PCA fitting process corresponds to the idea of applying a real-world case, where initially, the testing set would not exist and the model would be fitted just with the learning set. Eventually, the future data (the testing set) would be transformed with the built model. In comparison with Mutual information regression, where it is possible to use the complete data to quantify the information between variables (because within the process samples are taken for the selection of variables), there would be no need to transform their values. As a summary, after analyzing this part of the process, a differentiation of both methods is made in Table 6. 3.4 DATA SET DIMENSIONALITY REDUCTION 47 Table 5 - List of features after applying mutual information regression selection Feature : Mutual information regression score heure : 1.000000 luneApparente_1 : 0.027064 humiditeBale : 0.316402 nuit_1 : 0.026946 humiditeDijon : 0.230820 luneApparente_0 : 0.023712 humiditeNancy : 0.195237 pressionVar3hDijon : 0.021664 temperatureBale : 0.148263 ferie_0 : 0.018265 temperatureDijon : 0.144453 directionVentNancy : 0.018011 temperatureNancy : 0.122189 pointRoseeDijon : 0.014356 rafalesSur1perBale : 0.114104 tempsPresentNancy_0 : 0.014349 tempsPresentBale_0 : 0.091858 ferie_1 : 0.013768 rafalesSur1perNancy : 0.087239 hDrugeonRiviereMean : 0.013719 rafalesSur1perDijon : 0.080465 pointRoseeBale : 0.013154 veilleFerie_1 : 0.078794 tempsPresentDijon_0 : 0.013146 veilleFerie_0 : 0.077291 hDrugeonRiviereMax : 0.013036 vitesseVentBale : 0.066838 hDessoubreHippoMax : 0.012802 directionVentDijon : 0.064387 tempsPresentDijon_10 : 0.012493 visibiliteBale : 0.059371 tempsPresentBale_10 : 0.012339 vitesseVentNancy : 0.057049 precipitations3hBale : 0.012093 vitesseVentDijon : 0.047521 hDessoubreHippoMean : 0.011828 hDoubsBesanconNb : 0.046924 hDoubsBesanconMean : 0.011741 directionVentBale : 0.045515 hDoubsBesanconMax : 0.011203 visibiliteNancy : 0.042089 tendanceBaromBale_1 : 0.011116 visibiliteDijon : 0.039114 hOgnonBonnalMax : 0.010492 hDoubsBesanconStd : 0.036746 pointRoseeNancy : 0.010385 annee : 0.034604 hOgnonBonnalMean : 0.010360 tempsPresentBale_2 : 0.032631 pressionVar3hNancy : 0.010296 nebulositeBale : 0.032262 hDoubsMoutheMax : 0.010232 nuit_0 : 0.028023 tendanceBaromDijon_1 : 0.010222 pressionVar3hBale : 0.027969 tempsPresentNancy_10 : 0.010109 Source: Developed by the author Table 6 - Differences between MIR and PCA Mutual Information Regression Principal Component Analysis 1. It is a non-parametric method, i.e, it does not need for previ- ous standardization. However, in order to keep the same treat- ment of data to both methods, the standardization was made. 1. Standardization is needed to avoid emphasizing variables with higher variances than variables with low variances while searching for principal components. 2. Its base is the calculus of kth-neighbors and mutual infor- mation. 2. Its base is the calculus of SVD (linear algebra). 3. Relevant variables are choosen. 3. The variables and its values are transformed. 4. The new data set is reduced with the selected variables. 4. The new data set is made up of the new values (principal components). Source: Developed by the author 48 4 BUILDING PREDICTION MODELS After collecting, cleaning, encoding and selecting the features, data were split into three sets for the three kind of techniques: the training set (2011-2015), the validation set (2016) and the testing set (2017). The first and second sets are used to build a model that will predict the number of interventions for each one hour block since it is needed an immediate response to firefighters, and the last set is used to verify the accuracy of these predictions. LSTM works with time steps, in this way, it was used three, twelve and twenty-four hours as time steps with the goal of discovering which quantity of past hours give us better results. On the other hand, it was built a "similar" structure for XGBoost and MLP techniques which consists in defining a number of past interventions as new features for the next hour prediction in order to try to give an artificial memory to the models. The organization of the features for each technique is depicted in Figure 20. Figure 20 - Features structure before modelling Source: Developed by the author The programming language used was Python, the gradient boosting models and the artifi- cial neural networks were developed with XGBoost (CHEN; HE; KHOTILOVICH, 2016) and Keras (CHOLLET et al., 2015) libraries respectively. In order to run the codes, it was employed a GeForce GTX TITAN X, Intel® Xeon® CPU E5-2623 v4 @2.60GHz with an architecture 4.1 MODELLING WITH XGBOOST 49 x86_64. 4.1 MODELLING WITH XGBOOST In the training phase of the XGBoost model, the hyperparameters specified for the Grid- SearchCV were "max_depth": [3, 4, 5, 6, 7, 8], "learning_rate": [0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08], "colsample_bytree": [0.6, 0.7, 0.8, 0.9, 1], "n_estimators": [100, 200, 300, 400] and "objective": "count:poisson". The "count:poisson" corresponds to poisson regression for count data and uses "poisson-nloglik" as an internal evaluation metric where the prediction will be better if the probability of occurrence is higher. After finishing the execution on the server, the hyperparameters of the best model were tuned by hand in order to enhance the result. The values used to model our data are specified in Table 7. In (CHEN; HE; KHOTILOVICH, 2016) can be found more details about the hyperparameters specifications. Table 7 - XGBoost hyperparameters settings Hyperparameters Description Value base_score The initial prediction score of all instances 0.5 booster It produces a tree based model tree based model colsample_bylevel Subsample ratio of columns for each level 1 colsample_bytree Subsample ratio of columns when constructing each tree [0.6, 1] gamma Minimum loss reduction required to make a partition on a leaf 0 learning_rate Step size shrinkage used in update [0.02, 0.08] max_delta_step Absolute regularization that restricts weights 0 max_depth Maximum depth of a tree [3, 8] min_child_weight Minimum sum of instance weight to continue splitting 1 n_estimators Number of trees [100, 400] n_jobs Number of parallel threads 1 objective Learning task and the corresponding learning objective count:poisson reg_alpha L1 regularization term on weights 0 reg_lambda L2 regularization term on weights 1 scale_pos_weight It controls the balance of weights 1 subsample Subsample ratio of the training instances 1 Source: Developed by the author 4.2 CONSTRUCTING A MULTI-LAYER PERCEPTRON During the training phase of the MLP models, a random search with 50 iterations was performed with a range of values detailed in Table 8. The optimizer used was "Adam", which is an extension to stochastic gradient descent, and the loss function was "Poisson". The maximum number of epochs is 5000, but the callback "early stopping", with a value of fifteen epochs, was used to end training if the validation set loss had stopped improving. 4.3 CONSTRUCTING A LONG SHORT-TERM MEMORY NEURAL NETWORK 50 Table 8 - MLP hyperparameters Settings Hyperparameter Values Neurons first layer [300, 400] Neurons second layer [420, 520] Neurons third layer [520, 620] Neurons fourth layer 1 Epochs 5000 Batch size [60, 80] Learning rate [0.0001, 0.008] 4.3 CONSTRUCTING A LONG SHORT-TERM MEMORY NEURAL NETWORK In order to model with LSTM, the hyperparameters were chosen within a range as described in Table 9, using "Adam" as optimizer and "Poisson" as a loss function. Fifty iterations were performed in a random search to discover which combination of hyperparameters produce the best model. The maximum number of epochs was 5000, but the "early stopping" method was set to fifteen epochs ,i.e., if after fifteen epochs the loss function of validation set had not improved, the LSTM stopped. Table 9 - LSTM hyperparameters Settings Hyperparameter Values Units first layer [3, 10] Units second layer [50, 70] Units third layer [50, 100] Units fourth layer 1 Epochs 5000 Batch size [55, 160] Learning rate [0.000001, 0.001] Source: Developed by the author 4.4 METRICS The metrics used to measure the performance of the models are described below: • RMSE: The Root Mean Square Error calculates the standard deviation of the errors ob- tained during the predictions, highlighting the variance of the frequency distribution of errors by taking their square before they are averaged. A lower value is better (GÉRON, 2017). RMSE = √ 1 m m ∑ i=1 (h(x(i))− y(i)) 2 (25) where: m is the number of instances. 4.5 EXPERIMENTAL RESULTS AND DISCUSSION 51 x(i) are the predictors vector of ith instance. h is the predicting function. y(i) is the label of ith instance. • MAE: The Mean Absolute Error evaluates the average magnitude of errors’ predictions, where each error contributes in proportion to the scoring rule. A lower value is bet- ter (GÉRON, 2017). MAE = 1 m m ∑ i=1 | (h(x(i))− y(i)) | (26) where: m is the number of instances. x(i) are the predictors vector of ith instance. h is the predicting function. y(i) is the label of ith instance. • ACC0E: It is the accuracy of the predictions’ results with a margin of error 0, i.e. per- centage of exact predictions. ACC0E = ( #predictions_margin_0 #total_predictions )100% (27) • ACC1E: It is the accuracy of the predictions’ results with a margin of error equal or less than 1. ACC1E = ( #predictions_margin_1 #total_predictions )100% (28) • ACC2E: It is the accuracy of the predictions’ results with a margin of error equal or less than 2. ACC2E = ( #predictions_margin_2 #total_predictions )100% (29) 4.5 EXPERIMENTAL RESULTS AND DISCUSSION In this section, the eighteen results are presented and analyzed. Table 10, 11 and 12 present results for XGBoost, MLP and LSTM models with the two types of reduction techniques, re- spectively. It was tested different past hours: three, twelve and twenty-four to discover which one improves the prediction of the number of interventions that firefighters will face in the next one hour. The best performance is marked in bold considering the accuracy metrics: ACC0E, ACC1E and ACC2E as the most representative in real life. In this context, the best XGBoost 4.5 EXPERIMENTAL RESULTS AND DISCUSSION 52 model has a MAE of 1.6787, corresponding to twenty-four past hours with MIR reduction method, while for MLP and LSTM the MAE is 1.7325 with three past hours and 1.6960 with twenty-four past hours, respectively, both with MIR reduction method. Ta