UNIVERSIDADE ESTADUAL PAULISTA “JÚLIO DE MESQUITA FILHO” Instituto de Ciência e Tecnologia de Sorocaba ADSON NOGUEIRA ALVES Control of an unmanned aerial vehicle (UAV) using deep reinforcement learning (DRL) approach Sorocaba - SP 2021 ADSON NOGUEIRA ALVES Control of an unmanned aerial vehicle (UAV) using deep reinforcement learning (DRL) approach Text presented to the Graduate Program in Electrical Engineering (PGEE) of the Insti- tute of Science and Technology of Sorocaba as part of the requirements for obtaining the title of Master in Electrical Engineering. This study was financed in part by the Coor- denação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil (CAPES) – Finance Code 001. Supervisor: Prof. Dr. Alexandre da Silva Simões Sorocaba - SP 2021 A474c Alves, Adson Nogueira Control of an unmanned aerial vehicle (UAV) using deep reinforcement learning (DRL) approach / Adson Nogueira Alves. -- Sorocaba, 2021 87 p. Dissertação (mestrado) - Universidade Estadual Paulista (Unesp), Instituto de Ciência e Tecnologia, Sorocaba Orientador: Alexandre da Silva Simões 1. Inteligência artificial. 2. Robot vision. 3. Redes neurais (Computação). 4. Sistemas embarcados (Computadores). 5. Drone aircraft. I. Título. Sistema de geração automática de fichas catalográficas da Unesp. Biblioteca do Instituto de Ciência e Tecnologia, Sorocaba. Dados fornecidos pelo autor(a). Essa ficha não pode ser modificada. UNIVERSIDADE ESTADUAL PAULISTA Câmpus de Sorocaba CERTIFICADO DE APROVAÇÃO TÍTULO DA DISSERTAÇÃO: Control of an unmanned aerial vehicle (UAV) using deep reinforcement learning (DRL) approach AUTOR: ADSON NOGUEIRA ALVES ORIENTADOR: ALEXANDRE DA SILVA SIMÕES Aprovado como parte das exigências para obtenção do Título de Mestre em ENGENHARIA ELÉTRICA, área: Automação pela Comissão Examinadora: Prof. Dr. ALEXANDRE DA SILVA SIMÕES (Participaçao Virtual) Departamento de Engenharia de Controle e Automação / Instituto de Ciência e Tecnologia - UNESP - Câmpus de Sorocaba Prof. Dr. PAULO FERNANDO FERREIRA ROSA (Participaçao Virtual) Seção de Ensino de Engenharia de Computação / Instituto Militar de Engenharia -IME Profª. Drª. MARILZA ANTUNES DE LEMOS (Participaçao Virtual) Departamento de Engenharia de Controle e Automação / Instituto de Ciência e Tecnologia / UNESP / Sorocaba Sorocaba, 16 de julho de 2021 Instituto de Ciência e Tecnologia - Câmpus de Sorocaba - Três de Março, 511, 18087180, Sorocaba - São Paulo http://www.sorocaba.unesp.br/#!/pos-graduacao/--engenharia-eletrica-local/CNPJ: 48031918003573. My advisor and other researchers for sharing knowledge Acknowledgements All my family, friends, teachers and employees of the Institute of Science and Technology of Sorocaba, who directly or indirectly contributed to the accomplishment of this work. In particular, I offer my thanks: • To my parents Adelaide and Nelson for their support; • To Prof. Dr. Alexandre da Silva Simões and Prof. Dra. Esther Luna Colombini, for all their teaching, encouragement, confidence and guidance; • To my friends and colleagues at the lab who directly or indirectly helped me. • The Virtual University of the State of São Paulo (UNIVESP) for the opportunity of teaching professional experience. • This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil (CAPES) – Finance Code 001. "Only don’t achieve the goal, who dream too far. Only don’t achieve the goal, who intends to take a very long step. Only don’t achieve the goal, who beliaves that things are easy, all things are hard, all things must be fight and when you get something easy, wary." Senor Abravanel - (Silvio Santos) Resumo Veículos aéreos não tripulados (VANT) têm sido alvo de crescente atenção nos últimos anos principalmente devido a sua amplitude de aplicação em atividades complexas e onerosas, como no setor de vigilância, agricultura, entretenimento, entre outros. Todo esse interesse do mercado e acadêmico colocou em evidência novos desafios que a plataforma enfrentará. Entre esses está a complexidade de navegação em ambientes desconhecidos que têm a presença de múltiplos agentes com dinâmica de movimento desconhecida. Novas técnicas de aprendizado têm sido propostas para essas e outras tarefas nos últimos anos. Particularmente, algoritmos livres de modelo baseados no processo de exploração e aprendizado autônomo têm obtido destaque nesse domínio, como é o caso do Aprendizado por Reforço (RL), que busca obter comportamentos apropriados para o robô através de uma abordagem baseada em tentativa e erro e mapeando estados de entrada diretamente para comandos nos atuadores. O presente trabalho busca investigar a navegação de VANTs utilizando um método off-policy, o Soft Actor-Critic (SAC), no contexto do Aprendizado Profundo (DL). A abordagem proposta utiliza informações visuais do ambiente e também de multiplos sensores embarcados, bem como o Autoencoder (AE) para reduzir a dimensionalidade das informações visuais coletadas no ambiente. O trabalho foi desenvolvido no ambiente de simulação CoppeliaSim utilizando Pyrep. Nesse cenário, o trabalho investigou a representação dos estados da aeronave e sua navegação em ambientes sem e com obstáculos, fixos e móveis. Os resultados mostram que a politica aprendida foi capaz de realizar o controle de baixo nível do VANT em todos os cenários analisados. As políticas aprendidas demonstraram boa capacidade de generalização, mesmo em ambientes complexos. Palavras-chave: Inteligência artificial. Aprendizado de máquina. Aprendizado por reforço. Visão computacional. Redes neurais artificiais. Sistemas embarcados. Drones. Abstract Unmanned Aerial Vehicles (UAV) have received increasing attention in recent years mainly due to their breadth of application in complex and costly activities, such as surveillance, agriculture, and entertainment. All of this market and academic interest has highlighted new challenges that the platform will confront. Among these challenges is the complexity of navigation in unknown environments where there is the presence of multiple agents with unknown movement dynamics. New learning techniques have been proposed for these and other tasks in recent years. Particularly, model-free algorithms based on the process of exploration and autonomous learning have been highlighted in this domain, like the Reinforcement Learning (RL), that seeks appropriate behavior for the robot through a trial and error approach and mapping input states to commands in actuators directly. The present work aims to investigate the navigation of UAVs using an off-policy method, the Soft Actor-Critic (SAC), in the Deep Learning (DL) context. The proposed approach employs visual information from the environment and multiple embedded sensors and the Autoencoder (AE) method to reduce the dimensionality of the visual data collected in the environment. This work was developed using the CoppeliaSim simulator and Pyrep. In this scenario, we investigated the aircraft state representation and the resulting navigation in environments with or without obstacles, fixed and mobile. The results showed that the learned policy was able to perform the low-level control of the UAV in all analyzed scenarios. The learned policies demonstrated good generalization capabilities, even in complex environments. Keywords: Artificial intelligence. Machine Learning. Computer vision. Artificial neural networks. Embedded systems. Drones. List of Figures Figure 2.1 – A simple mathematical model for a neuron. The unit’s output activation is aj = g( ∑n i=0 ωi,jai), where ai is the output activation of unit i and ωi,j is the weight on the link from unit i to this unit. Source: [1]. . . . . . . . . . . . . 21 Figure 2.2 – (a) Threshold Function; (b) Sigmoid function; (c) Hyperbolic Tangent func- tion; (d) Rectifier Transfer Function. Source: [1] [2]. . . . . . . . . . . . . . 22 Figure 2.3 – (a) Single layer network; (b) Multilayer network (Multilayer Perceptron - MLP). Source: [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Figure 2.4 – Sparse autoencoder structure. Source: [3]. . . . . . . . . . . . . . . . . . . 25 Figure 2.5 – Basic structure of a CNN. Source: [4]. . . . . . . . . . . . . . . . . . . . . 26 Figure 2.6 – An agent interacting with the environment. Source: [5]. . . . . . . . . . . . 29 Figure 2.7 – A simple deterministic world. Source: [5]. . . . . . . . . . . . . . . . . . . 30 Figure 2.8 – Partially Observable Environment. Source: [6]. . . . . . . . . . . . . . . . . 31 Figure 2.9 – The Actor-Critic setup. Source: [7]. . . . . . . . . . . . . . . . . . . . . . . 32 Figure 2.10–A multimodal Q-function. Extracted from: [8] . . . . . . . . . . . . . . . . 35 Figure 4.1 – Diagram of the proposed framework using SAC and the Autoencoder. . . . . 42 Figure 4.2 – Interfaces to Coppelia Simulator [9]. . . . . . . . . . . . . . . . . . . . . . 44 Figure 4.3 – Coppelia Simulator Default UAV - AR Parrot [10]. . . . . . . . . . . . . . . 45 Figure 4.4 – Structure and dynamics of the quadcopter body - Font:[11]. . . . . . . . . . 46 Figure 4.5 – CoppeliaSim Robotics Simulator - Scene empty. . . . . . . . . . . . . . . . 47 Figure 4.6 – CoppeliaSim Robotics Simulator - Scene free. . . . . . . . . . . . . . . . . 48 Figure 4.7 – CoppeliaSim Robotics Simulator - Scene with fixed obstacles. . . . . . . . . 48 Figure 4.8 – CoppeliaSim Robotics Simulator - Scene with dynamic obstacles. . . . . . . 49 Figure 4.9 – AE Learning Curve - Assessment 1. . . . . . . . . . . . . . . . . . . . . . 54 Figure 4.10–AE Learning Curve - Assessment 2. . . . . . . . . . . . . . . . . . . . . . 54 Figure 4.11–AE Train - Assessment 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Figure 4.12–AE Train - Assessment 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Figure 4.13–AE Train - Assessment 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Figure 4.14–AE Test - Assessment 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Figure 4.15–AE Test - Assessment 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Figure 4.16–AE Test - Assessment 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Figure 5.1 – Average Reward - Epoch 4.250 - Empty scenario . . . . . . . . . . . . . . . 60 Figure 5.2 – SC0 - Path chosen by the UAV - Epoch 4.250 - Empty environment . . . . . 60 Figure 5.3 – SC0 - Final State Accuracy - [x, y, z] axes . . . . . . . . . . . . . . . . . . 61 Figure 5.4 – SC0 - Angular Velocity - [φ̇, θ̇, ψ̇] - Roll, Pitch, Yaw . . . . . . . . . . . . . 62 Figure 5.5 – SC1 - Path chosen by the UAV - Epoch 7,250 - Free environment . . . . . . 63 Figure 5.6 – SC1 - Cartesian plane - Path chosen by the UAV - Epoch 7.250 - Free Envi- ronment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Figure 5.7 – SC1 - Final State Accuracy - [x, y, z] axes . . . . . . . . . . . . . . . . . . 64 Figure 5.8 – SC1 - Angular Velocity - [φ̇, θ̇, ψ̇] - Roll, Pitch, Yaw . . . . . . . . . . . . . 65 Figure 5.9 – SC2 - Learning Evolution - Epoch 9.500 - Fixed Obstacle environment . . . 66 Figure 5.10–Average Reward Evolution with the State Change . . . . . . . . . . . . . . 67 Figure 5.11–SC2 - Cartesian Plane - Learning Evolution - Epoch 9.500 - Fixed Obstacle environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Figure 5.12–SC2 - Final State Accuracy - [x, y, z] axes . . . . . . . . . . . . . . . . . . 69 Figure 5.13–SC2 - Angular Velocity - [φ̇, θ̇, ψ̇] - Roll, Pitch, Yaw . . . . . . . . . . . . . 70 Figure 5.14–SC3 - Learning Evolution - Epoch 13,500 - Dynamic environment . . . . . 72 Figure 5.15–SC3 - Cartesian Plane - Learning Evolution - Epoch 13,500 - Dynamic envi- ronment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Figure 5.16–SC3 - Final State Accuracy - [x, y, z] axes . . . . . . . . . . . . . . . . . . 74 Figure 5.17–SC3 - Angular Velocity - [φ̇, θ̇, ψ̇] - Roll, Pitch, Yaw . . . . . . . . . . . . . 75 List of Tables Table 3.1 – Applications in UAVs and the evolution of the adopted control techniques. . . 40 Table 4.1 – Representation of states. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Table 4.2 – Parameters - SAC Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 53 Table 4.3 – Parameters - Autoencoder Algorithm. . . . . . . . . . . . . . . . . . . . . . 57 Table 5.1 – Sequence of enabled states. . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Table 5.2 – Learning Evolution - Fixed Obstacle Environment. . . . . . . . . . . . . . . 66 Table 5.3 – Learning Evolution - Fixed Dynamic Environment. . . . . . . . . . . . . . . 71 List of Abbreviations and Acronyms AI Artificial Intelligence UAV Unmanned Aerial Vehicle RF Radio Frequency SAR Search and Rescue ML Machine Learning DL Deep Learning DRL Deep Reinforcement Learning ANN Artificial Neural Network MLP Multilayer Perceptron CNN Convolutional Neural Network DNN Deep Neural Network DBN Deep Belief Network RBM Restricted Boltzmann Machine CDBN Convolutional Deep Belief Network FCN Fully Convolutional Network MSE Mean Square Error RNN Recurrent Neural Network RL Reinforcement Learning MDP Markov Decision Process POMDP Partially Observable Markov Decision Process SE State Estimator NFQ Neural-fitted Q DQN Deep Q-Network NAF Normalized Advantage Function GSP Guided Search Policy TRPO Trust Region Policy Optimization GAE Generalized Advantage Estimation DPG Deterministic Policy Gradient DDPG Deep Deterministic Policy Gradient A3C Asynchronous Advantage Actor-Critic PID Proportional-Integral-Derivative IMC Internal Model Control SLC Successive Loop Closure RLS Recursive Least Squares SVSF Smooth Variable Structure Filter KF Kalman Filter AFC Adaptive Filter Controller SMC Sliding Mode Control FL Feedback Linearization RGB Red-Green-Blue PPO Proximal Policy Optimization MTRL Multi-Task Regression-Based Learning ESDF Euclidean Signed Distance Field SLAM Simultaneous Localization and Mapping TLD-KCF Tracking Learning Detection - Kernelized Correlation Filter ARC Aspect Ratio Change GAK-Means Genetic Algorithm Based K-Means FANET Flying Ad-Hoc Networks AFRL Adaptive Federated Reinforcement Learning CTANS Centralized Task Allocation Network System GPS Global Positioning System GNSS Global Navigation Satellite System PWM Pulse-Width Modulation SAC Soft Actor-Critic List of Symbols wi,j Weight associated with the input ai of the neuron i. α Learning rate constant. ai Action i si State i ri Reward / Punishment to transition i η Learning factor (decreased at time) π Policy γ Discount rate H Entropy φ Roll θ Pitch ψ Yaw φ̇ Angular Velocity - Roll θ̇ Angular Velocity - Pitch ψ̇ Angular Velocity - Yaw εt Distance between the target position and the UAV base at time step t ξ Vector difference Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.1 Objectives and contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.1.1 General objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.1.2 Specific objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.2 Text Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2 Theoretical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.1 Artificial Neural Networks (ANNs) . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 Deep Neural Networks (DNNs) . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.1 Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.2 Deep Convolutional Networks (DCNs) . . . . . . . . . . . . . . . . . . 25 2.3 Reinforcement Learning (RL) . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.1 Observable States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.2 Partially Observable States . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4 Deep Reinforcement Learning (DRL) . . . . . . . . . . . . . . . . . . . . . . 32 2.4.1 Deep Q-network (DQN) . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4.2 Policy search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4.3 Soft Actor-Critic (SAC) . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.1 Proposed Approach: overview . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2 Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3 Coppelia Simulator and Pyrep . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.4 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.5 Drone Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.6 Agents/Models/Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.6.1 Drone Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.6.2 Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.6.3 Representation of states . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.6.4 Reward function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.6.5 Episode completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.6.6 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.6.7 Action Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.6.8 Algorithm Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.1 Approaches Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2 SC0 - Empty Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.3 SC1 - Free Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.4 SC2 - Fixed Obstacles Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.5 SC3 - Dynamic Obstacles Environment . . . . . . . . . . . . . . . . . . . . . 71 6 Conclusions and Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 APPENDIX A Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 17 1 Introduction " We share information, we create and pass on knowledge. That’s the means by which humans are able to adjust to new situations, and it’s what differentiates humans from our earlier ancestors, and our earlier ancestors from primates ". [12] The research of new technologies capable of improving people’s quality of life is inherent to human beings. Our ability to think creatively and to imagine novel solutions needed to survive threats proved to be a major asset [12] to humans. Thus, the human brain’s complexity is a great asset to the species. Within an increasingly technological world emerged the natural interest in transferring a certain degree of intelligence to machines. The Turing Machine [13] is a typical example of this interest. In this sense, Artificial Intelligence (AI) emerged as a new field in science and engineering, having more notoriety after World War II, earning that name around 1956 [1]. Among the many possible ways to define AI, Raymond Kurzweil said that AI is: "The art of creating machines that perform functions that require intelligence when performed by people" [14]. In this scenario, we can define Machine Learning (ML) as a subgroup of these intelligent systems that can improve with experience [5]. Machine learning techniques are used in various applications, such as medical diagnostics, fraud detection, stock market analysis, speech and writing recognition, strategy games, and robotics [15]. The use of machine learning techniques in robots, being more specific in unmanned aerial vehicles (UAV), is the main interest of this research. The interest in aerial robots has grown significantly in recent years. This notoriety has been growing due to the UAV application’s breadth, both in the research area and in daily activities, such as the delivery of goods, public and private security, pest monitoring and control, maintenance, monitoring, entertainment and others. In general lines, recent researches and development have focused on vehicle design [16] [17] [18], navigation and control [19] [20], safety [21] [22], risk assessment [23], telecommunication networks [24] [25] [26], multi-vehicle coordination or maintenance [27] [28] and cargo transportation [29] [30]. Currently, global distribution networks – such as Walmart – are investing in research and development of package delivery systems [29]. According to the patent itself, the method includes loading a product in an unmanned aerial vehicle (UAV), directing the UAV to the delivery location, and communicating with the consumer through a portable device. The product will only be delivered after feeling that the consumer is already in the receiving position and thus can lower the product, thus avoiding interception by third parties. The company has other patents that complement the structure of this project, such as a delivery tower for UAVs to enable the vehicle to land [31] [32] [33]. Other works and research in the area address some models of technologies Chapter 1. Introduction 18 that can be used in this type of application, such as the use of laser-guided UAVs [34]. The system would include a navigation system and a sensor that could detect a laser transmission emitted from the surface of a specified location, detecting the frequency and pulse of the laser transmission to identify who is the destination. Amazon, another giant in the distribution of electronic products, has also invested in delivery systems that use UAVs. The company recently filed a patent application that involves techniques applied to the delivery of packages after being released in flight by a UAV [30]. The goal is that the package can be launched vertically by a moving UAV. The package would also be monitored during the descent by the UAV itself, using radio frequency (RF), making it possible to change the descent path if necessary. The patent does not detail this adjustment. Other emerging applications of UAVs involve its use in road networks to assist in emer- gency care caused by road accidents [35]. One of the main proposals is to use the UAV and an emergency ground vehicle to alert vehicles ahead that the ground emergency vehicle is on the way, thus facilitating the vehicle’s access to the accident site. Network security was recently addressed since UAV communication is often based on wireless networks, and messages carry private information about people. Today there is no infallible way to protect UAVs from cyber attacks. Recent works [36] propose an additional encrypted communication channel as a mechanism to prevent external attacks. The use of UAVs to provide communication – for applications in areas with restricted or no communication – is another research focus today. The organization of UAVs in particular topologies could assist in areas of disaster and also in the regions that are far away from a communication infrastructure [37]. The use of aerial vehicles in urban areas could help overcome interference generated by tall buildings or other devices since the topology of the UAVs can be dynamically arranged, and the network could adapt to guarantee the best signal efficiency. The market of UAVs is over $127 billion [38] [39]. Civil infrastructure is the most signif- icant area, reaching $45 billion. There are expected approximated 100.000 new jobs involving UAVs activities in next years [40]. Business Intelligence expects sales of UAVs to reach $12 billion in 2021 [41]. Other civil applications of UAVs are [38]: search and rescue (SAR), remote sensing, construction and infrastructure inspection, precision agriculture, delivery of goods, real- time monitoring of road traffic, surveillance applications of UAVs, providing wireless coverage. In general, the key challenges found in this cases could be summarized in: charging challenges, collision avoidance and swarming, networking and security. Regarding the control techniques of UAVs that can allow these aircraft to perform all these tasks soon, the use of Machine Learning (ML) techniques is a growing tendency. Some of the new approaches are the use of Deep Reinforcement Learning (DRL) [42] or density-based spatial clustering algorithm [43] in the UAVs optimization. An approach addressed to swarming and avoiding collision is shown in [44] with Deep Deterministic Policy Gradient (DDPG) based approach. Other recent works [45] [46] address networking and security based on ML techniques. Chapter 1. Introduction 19 In chapter 3 we will discuss in detail these and other works related to UAV control. Still, we realize that there has been a trend towards using Deep Learning (DL) and Deep Reinforcement Learning (DRL) techniques in recent years, motivating a deeper investigation of both. 1.1 Objectives and contributions To investigate the control of a UAV using DRL, this work has general and specific objectives described in the following sections. 1.1.1 General objective Ourmain goal is to investigate the possibility of learning amodel-free navigation policy for UAVs using the Soft Actor-Critic (SAC) algorithm and visual information from the environment andmultiple embedded sensors.We use anAutoencoder (AE)method to reduce the dimensionality of the visual data collected in the environment, investigating how the aircraft state representation affects navigation in environments with or without obstacles, fixed and mobile. 1.1.2 Specific objectives To allow this investigation, this works proposes: 1. To investigate the representation of the states of UAVs in the DRL context, particularly focusing on the investigation of state representations that can simultaneously carry visual and other sensors information; 2. To investigate the aircraft navigability in environments with or without obstacles, fixed or in motion. This work aims to contribute to the generation of new autonomous navigation techniques for aircraft with applicability in unknown environments. 1.2 Text Organization This work is structured as follows: in chapter 2 the theoretical foundation of artificial intelligence is presented, focusing on machine learning and reinforcement-based techniques. In chapter 3 a review of approaches adopted to control UAVs is presented. Due to the breadth of applications with different purposes, this review is not restricted to flight control but looks to cover distinct application scenarios. In chapter 4 the proposed material and methods for this work are presented, including software and hardware. Results are presented in chapter 5. Finally, conclusions and future works are presented in 6. 20 2 Theoretical Background Artificial Intelligence (AI) is: " The study of mental faculties through the use of computational models. " [47] " The study of the computations that make it possible to perceive, reason, and act. " [48] " The study of the design of intelligent agents. " [49] " ...concerned with intelligent behavior in artifacts. " [50] Some of the well-known definitions of Artificial Intelligence (AI) [1] group their ap- proaches into four categories: Thinking Humanly, Acting Humanly, Thinking Rationally and Acting Rationally. In general lines, we can understand Machine Learning (ML) as a subgroup of artificial intelligence that improves performance with experience [51]. We can also understand machine learning as a computer program to optimize a performance criterion using sample data or previous experiences [6]. The traditional approach to developing an algorithm is based on a system that receives input data and generates the output data. Still, when the output data does not correspond to the expected, it is necessary to reprogram the algorithm and expect the new program to work. In Machine Learning (ML), the paradigm is shifted to a learning algorithm: given a random batch of input data, the system selects the relevant resources and uses it to train the system. In other words, given new input data, we expect the algorithm to achieve the desired output. It is possible to classify learning according to distinct criteria [1] [52]: • Input/output relation: unsupervised learning, supervised learning, semi-supervised learn- ing, reinforcement-learning; • Data/model relation: inductive learning and deductive learning; • Nature of the algorithms: evolutionary learning, deep learning, deep reinforcement learn- ing and so on. A briefing about this, we can say that in unsupervised learning the learning has no teacher. The goal is to be able to identify relations between the data. The main idea is clustering. In supervised learning there is the figure of a teacher, i.e., is assigned the correct label to the training examples. The proposal of semi-supervised learning is improve the performance of algorithm Chapter 2. Theoretical Background 21 through the use of both labeled and unlabeled data. Already, in reinforcement learning, the agent learns from reinforcements from environment, it can be rewards or punishments. We can understand agent as anything that can be viewed in the environment [1], will be talked more. The inductive learning and deductive learning are related to the system obtain or refine knowledge through specific information or data, or simply using logic, respectively. However, it is important to highlight that, in inductive learning, new data can modify the knowledge, whereas deductive knowledge is kept. In evolutionary learning the technique is applicable to heuristic problems, that is, applicable to solving problems that would not be easily resolved using a polynomial approach. In deep learning (DL), the idea is to learn feature levels of increasing abstraction with minimum human contribution [53]. Finally, the deep reinforcement learning (DRL), according to [7], can be defined with the use of deep learning algorithms within RL. The DRL and DL will be covered in greater depth in this work. Towards this path, some structures of algorithms that will be important for understanding DRL will be approached. 2.1 Artificial Neural Networks (ANNs) Artificial Neural Networks (ANNs) remain one of the most popular and effective learn- ing algorithms. The inspiration for the approach comes from the brain and its skills, such as information processing, vision, speech recognition, and learning. Understanding how the brain performs such functions would allow us to develop algorithms capable of performing these tasks on a computer [6]. ANNs can be understood as a collection of individual units – the neurons – connected [1]. The properties of the network are determined by its topology and the properties of the neurons. The most typical neuron in ANNs is the perceptron. The mathematical model of this neuron is shown in figure 2.1. The value wi,j is the weight associated with the input ai of the neuron i, and the set of weightsW is the free parameter that the learning algorithm must properly tune. Each unit inj calculates a weighted sum of its inputs, as shown in equation 2.1: Figure 2.1 – A simple mathematical model for a neuron. The unit’s output activation is aj = g( ∑n i=0 ωi,jai), where ai is the output activation of unit i and ωi,j is the weight on the link from unit i to this unit. Source: [1]. Chapter 2. Theoretical Background 22 inj = n∑ i=0 wi,jai. (2.1) The activation function g is applied on this weighted sum to generate the neuron output, as show in equation 2.2: aj = g(inj) = g( n∑ i=0 wi,jai). (2.2) The activation (or transfer) function [54] is responsible for generating the neuron final output value. The perceptron typically uses a mathematical function similar to the threshold function, and the most usual functions are the logistic (sigmoid) function and the tanh (hyper- bolic tangent) function, both differentiable. The rectifier transfer function is also adopted in some cases. These functions are shown in figure 2.2. Figure 2.2 – (a) Threshold Function; (b) Sigmoid function; (c) Hyperbolic Tangent function; (d) Rectifier Transfer Function. Source: [1] [2]. The connection among the processing units in a network can be made in two distinct ways: the feedforward network or the recurrent network. In feedforward networks, connections flow in a single direction (from the network input to the network output). In contrast, in recurrent networks, outputs typically feed back into the network inputs. We will employ feedforward networks in this work. These networks are organized in layers, and each unit receives stimuli only from the units that immediately precede it. In a single-layer neural network all inputs are connected directly to the outputs. This network is beneficial for processing linearly separable functions like AND and OR but cannot learn a function that is not linearly separable like XOR. We can overcome this limitation by adding a layer between the input and output layers, called the hidden layer. This kind of network, known as multilayer perceptron (MLP), can be a tool for nonlinear regression [6]. If we can calculate the derivatives of the output expressions concerning the weights, it is possible to use the gradient-descent loss minimization method to train the network. In figure 2.3 we can see a single layer network and a neural network with one hidden layer. Chapter 2. Theoretical Background 23 The network is not limited to just one hidden layer. There may be more hidden layers with their respective neurons and weights, thus computing over the values of the previously hidden layer and thus implementing more complex functions. However, with a single hidden layer, large enough [1], it is possible to represent any continuous function of the entries with arbitrary precision, with two hidden layers until discontinuous functions. Some works have shown that when the hidden layer contains many hidden units, it may be wise to add hidden layers, preferring "long and narrow" networks to "short and fat" networks [6]. Figure 2.3 – (a) Single layer network; (b) Multilayer network (Multilayer Perceptron - MLP). Source: [1]. The learning in multilayer networks the output vector of a MLP can be express in form [ai,aj]. Similarly, a target vector can be [yi,yj]. The error found in ai and aj depends on all the weights of the input layer, so an update of the weights depends on the errors ai and aj . For a loss function L2, where L2 is the squared loss function, with a weigth w we have, equation 2.3: ∂ ∂w Loss(w) = ∂ ∂w [y − hw(x)]2 = ∂ ∂w ∑ k (yk − ak)2 (2.3) It is simple to compute the error in the hidden nodes of the network since we only know the expected value in the output layer. Fortunately, we can reflect the error of the output layer for hidden layers. The process is known as backpropagation, [55] [56] emerges directly from a derivation of the general error gradient. The backpropagation algorithm can be summarized as [1]: • Compute the values for the output units, using the observed error. • Starting with output layer, repeat the following for each layer in the network until the earliest hidden layer is reached: • Propagate the values back and update the weights Chapter 2. Theoretical Background 24 2.2 Deep Neural Networks (DNNs) Extending this concept, Deep Neural Networks (DNN) are networks with more hidden layers andmany neurons in each layer, which is the opposite of a shallow neural network consisting of only one hidden layer. Therefore, DNNs can learn more complicated functions, and their abstraction power increases as the number of hidden layers grows. Deep learning methods are attractive because the algorithm can discover all that is necessary by itself, assuming that we have a considerable amount of data and enough computation power. The idea in Deep Learning is to learn features of increasing abstraction levels with a minimal human contribution. This process improves the system dynamics, allowing the automatic discovery of features during training and, therefore, allowing the network to increase its abstraction power and the learning over more general descriptions [53]. Most machine learning methods are described as discriminative, generative, or Hybrid models [57]. According to [58], usually, we can do deep learning as follows. 1. Construct a network consisting of an input layer and a hidden layer with necessary nodes 2. Train the network 3. Add another hidden layer on the top of the previously learned network to generate a new network 4. Retrain the network 5. Repeat adding more layers and after every addition, retrain the network We present next a summary of mainstream deep machine learning approaches. 2.2.1 Autoencoder An autoencoder [59] is a neural network with the same number of input and output units, where the number of hidden units is smaller than the number of inputs/outputs. Its training process forces the input data to be equal to the output data, leading the hidden units to represent the input data in a code with a reduced number of dimensions. In this way, the first layer acts as an encoder stage of the input data, and the output layer acts as a decoder stage, reconstructing the original signal from its encoded representation [6]. A MLP with a large number of neurons is usually adopted to implement autoencoders. However, supervised learning is not adopted in this case and is replaced by unsupervised learning since the training process does not require labeled data. In [58] the structure of the learning algorithm can be developed as follows, for each input x: Chapter 2. Theoretical Background 25 1. Do a feedforward pass to compute activation functions provided at all the hidden layers and output layers 2. Find the deviation between the calculated values with the inputs using an appropriate error function 3. Backpropagate the error to update weights 4. Repeat the task till satisfactory output Autoencoder networks are typically adopted in compression and dimensionality reduction tasks and have been particularly useful in image compression. Figure 2.4 shows an autoencoder structure. Figure 2.4 – Sparse autoencoder structure. Source: [3]. 2.2.2 Deep Convolutional Networks (DCNs) Convolutional Neural Networks (CNNs), sometimes also called Deep Convolutional Networks (DCNs), were designed for two-dimensional data, such as images and videos, in addition to being the first genuinely successful robust Deep Learning technique [60]. DCNs work by abstracting small pieces of information and combining that deeply into the network. One way to understand it is to imagine that the first layer is responsible for identifying edges of the image, thus forming identification models. The following layers tried to combine this information in simpler forms and, eventually, creating models that vary the object’s position, lighting, scale, etc. Chapter 2. Theoretical Background 26 Thus the final layers will correspond to the input image with all previous models, and the final prediction is like a weighted sum of all of them [58]. In [4], the author states that pattern recognition by machine involves four primary stages: acquisition, pre-processing, feature extraction, and classification. Feature extraction is usually the most difficult problem to solve, but CNNs offer an adequate alternative, using large sample databases, called training sets. The challenge is to extract features automatically from a portion of the database to allow generalization to other similar images. The figure 2.5 shows the basic structure of CNN that is fundamental to all of them. One stage of CNN is composed of three volumes: input maps features maps and pooled features maps. The fundamental operation performed at each stage of a CNN is convolution, which justifies its name. Figure 2.5 – Basic structure of a CNN. Source: [4]. Generally, the volume convolution is performed in CNNs, and there is no change in the volume of the convolution kernel (or filter). It is important to observe in figure 2.5 that the depth of each kernel volume is equal to the depth of the input volume. Thus the volume convolution is simply the sum of the individual 2D convolution. Assuming a kernel volume, the convolution between this kernel and a map of specific features is just the sum of the products of the kernel weights and the map elements that coincide, respectively. The convolutional volume is obtained from the sum of products operation between each respective 2-D kernel (K) since each sum of the product is a scalar. Therefore the K represents the depth of the input volume. The equation 2.4 represent the result of volume convolutional in coordinates (x, y), where wi and vi are kernel weights and values of corresponding element, respectively. In figure 2.5 the equation 2.4 represents the result at point A. The point B can be represented by adding a scalar bias, b in Chapter 2. Theoretical Background 27 2.4, resulting in a zx,y. The point C can be obtained using an activation function nonlinear. convx,y = ∑ i wivi (2.4) The complete feature map, with all activation values, is also referred to as an activation map has one kernel volume and one bias associated with it. The objective is to learn the weights of each of the kernel volumes and biases through training data. According to [4], a pooled map is simply a feature map of lower resolution. Its method is to replace the values of every neighborhood with the average of the values in the neighborhood. The consequence of this is significant data reduction. Still, the disadvantage is that map size also decreases significantly every time pooling is performed, and when the number of layers is large, it is a problem. Two others pooling methods are the max pooling and L2 pooling, the first replace the neighborhood value by the maximum value and the second replace with the square root of the sum of their values squared. Still according [4] the CNNs are structured generally in two ways: a fully convolutional network (FCN) and an image classification. The major application of FCN is image segmentation, i.e., each pixel of an input image is labeled. The FCN can be connected "end to end", allowing the map to decrease first due to convolution and, using an identical network, the reverse process can be done. This allows the output image to be the same size as the input image, but with the pixels labeled and grouped into regions [61]. The image classification is the widest use of CNNs. In this case, the output maps are fed into an FCN to classify it within several predefined classes. The interface between a CNN and an FCN converts 2-D arrays to vectors. The propagation of a pattern vector towards the output of the neural network is called Feedforward. At the same time, the training of a network is done by feedforward and back- propagation which is responsible for adjusting the weights and biases throughout the process. Performance can then be measured using an error or cost function. The most commonly used is the mean square error (MSE) between the current and the desired output. The MSE is described by equation 2.5, where aj(L) is the activation value of jth neuron in the FCN output layer. E = 1 2 nL∑ j=1 (rj − aj(L))2. (2.5) The training aims to adjust the weights and biases whenever an error classification is found, thus minimizing errors in the output. This is done using gradient descent for both, equations 2.6 and 2.7, α is the learning rate constant. wij(l) = wij(l) + α ∂E ∂wij(l) . (2.6) bi(l) = bi(l) + α ∂E ∂bi(l) . (2.7) Chapter 2. Theoretical Background 28 2.3 Reinforcement Learning (RL) In some applications, the system output is obtained after a sequence of actions. In these cases, the important thing is not the immediate result but the policy adopted to achieve the objective through a sequence of correct actions. As it is complex to evaluate the best action in an intermediate state of the system, the machine learning algorithm must learn and evaluate the actions taken, leading to the choice of the best sequence of actions that led to the final objective. Reinforcement learning (RL) can be defined as: " In reinforcement learning, the learner is a decision-making agent that takes actions in an environment and receives a reward (or penalty) for its actions in trying to solve a problem. After a set of trial-and-error runs, it should learn the best policy, which is the sequence of actions that maximize the total reward. " [6] The agent needs to receive a reward when it reaches or gets closer to the goal and receive a punishment when it deviates from it; hence the term reinforcement which can be received during or at the end of the process, will depend on the application, so an optimal policy maximizes the reward received [1]. 2.3.1 Observable States The figure 2.6 illustrates an agent that interacts with the environment by executing a action (ai) that takes to a new state (si), thus receiving a immediate reinforcement (ri) (reward / punishment) for this transition. Such is the setting of reinforcement learning. Unlike the past methods where we typically had a teacher, now learning is done with a critic that, unlike the supervised method, is not known the right action, just how well we have been doing in the past. In a simple RL problem where there is only one state and a finite number of possible actions, the value of our action Q(a) is quickly known. If the reward is deterministic, we have Q(a) = ra, so to maximize the value we choose themaxaQ(a). On the other hand, if the reward is stochastic we define Qt(a) from the probability distribution p(r|a) at time t, the equation 2.8 defines an online update. Qt+1(a) = Qt(a) + η[rt+1(a)−Qt(a)] (2.8) Where: Chapter 2. Theoretical Background 29 Figure 2.6 – An agent interacting with the environment. Source: [5]. • Qt+1(a) is the expected value of action a at time (t+ 1) • Qt(a) is the current prediction • η is the learning factor (decreased at time) • rt+1(a) is the reward received after taking action a at time (t+ 1) The RL problem is modeled using Markov Decision Process (MDP) where the re- wards and the next state are based on the respective probability distribution p(rt+1|st, at) and P (st+1|st, at), depending only on the current state and action. The sequence of actions from initial state to terminal state is an episode or a trial. The policy defines the behavior of the agent, that is, the action taken in any state st : at = π(st). The value of the policy represents the expected cumulative reward as long as it remains on the policy, V π(st), starting at the state st. We can work with models of finite or infinite episodes. For finite models the value of policy π is showed in equation 2.9 and infinite equation 2.10, where T is the next step and 0 ≤ γ < 1 is the discount rate. V π(st) = E[ T∑ i=1 rt+i] (2.9) V π(st) = E[ ∞∑ i=1 γi−1rt+i] (2.10) Known as Bellman’s equation [6], equation 2.11, works with the state-action value, Q(st, at)which denotes how good the performance of at in the state st, instead than denoting how Chapter 2. Theoretical Background 30 good it is for the agent to be in the state st, as is the case with V (st) seen previously. The policy π is taking the action a∗t that give us the highest value of Q∗(st, at). According [7] it’s similar to Vπ, except that the initial action at is provided and π is only followed from the succeeding state onward . Q∗(st, at) = E[rt+1] + γ ∑ st+1 P (st+1|st, at)maxat+1Q ∗(st+1, at+1) (2.11) In amodel-based learning all parameters of the environment model are known, and there is no need for exploration once we can solve it through dynamic programming. However, the most practical application of reinforcement learning is when we do not have the model: (model-free learning). The temporal difference learning considers the value of the next state and the reward for updating the current state value. An exploration strategy is based on randomly choosing an action in the number of options, using search ε−greedy with probability ε. To continue exploring indefinitely when we have enough exploration, we start exploitation with a high ε value and gradually decrease it. Figure 2.7 illustrates a simple deterministic world. Notice that each grid represents a state, the arrows represent possible actions and their reward value, and G represents the goal. In this scenario, equation 2.11 is reduced to equation 2.12. In non-deterministic cases, we use equation 2.11, where the same state and action can lead to different rewards and new states; thus, it is important to keep a running average. This is known as the Q-learning algorithm. Figure 2.7 – A simple deterministic world. Source: [5]. Q(st, at) = rt+1 + γmaxat+1Q(st+1, at+1) (2.12) On-policy methods estimate the value of the policy used to select the agent’s behavior. In off-policy methods, the behavior policy selects actions, whereas another policy, the estimation policy, is evaluated and improved. The Q-learning on-policy version is the Sarsa algorithm. Chapter 2. Theoretical Background 31 In some applications it is not possible to store the Q(s, a) or V (st) in a lookup table due to a large number of states and actions or situations where the discretization of the data results in an error or still the search space size. In these cases, according [1], it’s interesting to consider this as a regression problem, Q(s, a|θ), with s and a inputs and parameterized by θ to learn the Q values. 2.3.2 Partially Observable States In some applications, the agent does not know the status exactly, but it can receive indications that lead to predicting the most probable state. This can be done through sensors, cameras, and so on. Despite the similarity with the MDP, the difference is that after acting at, the new state st+1 is not known. However for an observation ot+1 we arrive at a stochastic function p(ot+1|st, at) called partially observable MDP (POMDP). The action is multiplied by the probability of the possible states that are added to the end. However, the state uncertainty can lead to loss of performance that is measured by the cumulative reward. In this case, the use of Recurrent Neural Networks (RNNs) can be interesting for maintaining the state and not forgetting past observations. Actions can take place to get information, thus reducing uncertainties, this is known as value of information. According [6], the agent uses an internal belief state Bt that considers your experiences, this state estimator updates Bt+1, based on observations, actions and previous belief states. The figure 2.8 illustrates what was said, state estimator (SE), that keep a internal belief state b applying the policy π. Figure 2.8 – Partially Observable Environment. Source: [6]. The belief state-action pair values is show in equation 2.13. Q(bt, at) = E[rt+1] + γ ∑ bt+1 P (bt+1|bt, at)V (bt+1) (2.13) Chapter 2. Theoretical Background 32 Instead of bootstrapping value functions using dynamic programming methods, Monte Carlo method is a method that estimates the return from the average of several policy implemen- tations, and can be applied in non-Makovian environments. The best of both methods combines TD learning and Monte Carlo policy assessment. Anothermethod shown in [7] is theActor-Criticmethods that combines the value function and an explicit representation of the policy. Figure 2.9 shows the actor-critic setup. The actor (policy) and the critic (value function) receives a state from the environment. The actor acts, and the critic, using the reward resulting from the previous interaction, uses the TD error calculated to update itself and the actor. Figure 2.9 – The Actor-Critic setup. Source: [7]. 2.4 Deep Reinforcement Learning (DRL) Despite the practical application of the RL technique, according to [7] it still lacked scalability and was limited to low-dimensional problems due to computational, sample, and memory complexities. However, with DL, these limitations can be overcome, with their use within the RL defining the field of DRL. 2.4.1 Deep Q-network (DQN) According [7], DQN was the first RL algorithm that worked from raw visual inputs in several environments. It emerged from neural-fitted Q (NFQ) that combined a deep autoencoder to reduce the dimensionality of the inputs with a separate branch to predict Q-values, as shown in Chapter 2. Theoretical Background 33 [62]. To allow for a better choice of actions, argmaxaQπ(s, a), after a single forward pass of the network, allows the network to encode action-independent knowledge in the lower, convolutional layers. With the simple objective of maximizing the reward, DQN learns to extract salient visual characteristics, jointly coding objects, movements, and interactions. The strength of the DQN is in the ability to compact high-dimensional observations and the Q-function using deep neural networks. According to [7] DQN addresses fundamental problem of instability through function approximation in RL using two techniques: experience replay and target networks. Experience replaymemory reduces the number of interactions with the environment and reducing the variance of learning updates through sampling batches of experience. The transition storage has the form (st, at, st+1, rt+1) in a cyclic buffer, enabling the RL agent to sample from and train on previously observed data offline. Some works [63] showed that prioritizing samples based on errors TD is more effective than uniform sampling for learning. The Target network starts with the weights of the network that implements the policy. Still, instead of calculating the TD error based on its estimates of Q values, the policy network uses the fixed destination network. During training, the weights of the target network are updated to match the network policy after a fixed number of steps. One of the main benefits of DQN is the function approximator for the Q-function, generating significant improvement in RL. So, the Q-learning rule can be updated using a single or double estimator or even using the target network from the DQN algorithm that generates a better result with small updates. Another way to adjust the DQN architecture is to decompose the Q-function into mean- ingful functions, that is, to calculate the state-value function V π and advantage function Aπ in separate layers [64]. The dueling DQN benefits from a single baseline for the state (V π) and easier-to-learn relative values (Aπ). The combination of dueling DQN and experience replay is one of the state-of-the-art techniques in discrete action settings. Another modification of the DQN that made it possible to work over sets of continuous actions is the normalized advantage function (NAF) algorithm, being one of several state-of-the-art techniques in continuous control problems [65]. 2.4.2 Policy search Gradient-free or gradient-basedmethods are commonly used as policy search methods. Several successful DRL methods have chosen to use the evolutionary algorithms, according [7], which can be used to train large networks, becoming the first deep neural network to learn an RL task [66]. The interest in evolutionary methods for RL is justified because it can potentially be distributed on larger scales than techniques that depend on gradients. The backpropagation is the basis of DRL, allowing neural networks to learn stochastic policies, computing the loss gradient and weights of the network for a single input-output example. It can help, for example, to decide where to look in an image, which reduces the necessary computational resource. The use of RL to make stochastic decisions over inputs is known as hard Chapter 2. Theoretical Background 34 attention with many applications outside traditional RL domains. Searching for a network with multiple parameters can be extremely difficult in addition to suffering from multiple locations. To work around this problem, one way would be to use a guided search policy (GSP) which takes advantage of some action sequences from another controller. Thus, through supervised learning and considering the importance of the sample, it is possible to minimize cost and optimize the policy, using a region of trust to avoid that the policy update deviates too much from the current one. In this line of work we have the Trust Region Policy Optimization (TRPO) [67] applicable for high-dimension inputs. Combined with the generalized advantage estimation (GAE) [68] technique it can be very useful in continuous control. Application with DRL critical-actor methods proved effective in real robotic visual navigation tasks through the image pixel [69]. In this context, deterministic policy gradients (DPGs) extend the standard policy gradient theorems for stochastic policies to deterministic policies [7]. DPGs integrate only over the state space, requiring fewer samples in problems with large areas of action. Unlike the stochastic policy gradients that integrate over the spaces of state and action, again in that context, the deep DPG uses neural networks to operate at high dimensions. Another very popular and recent DRL technique is the asynchronous advantage actor-critic (A3C) that combines the advantage of the actor-critic, the asynchronously updated policy and value function networks trained in parallel over several processing threads. A structure to train several DQNs in parallel, obtaining better performance and reduced training time. Another interesting approach is when the agent learns from the demonstration, this is known as behavioral cloning. 2.4.3 Soft Actor-Critic (SAC) The Soft Actor-Critic (SAC) was introduced by[70]. According to the authors, it is an off-policy actor-critic DRL algorithm based on the maximum entropy reinforcement learning framework. The actor aims to maximize expected reward and the entropy, combining off-policy updates with a stable stochastic actor-critic formulation, outperforms prior on-policy and off- policy methods. The RL standard is that the sum of the reward is maximized, so:∑ t E(st,at)→ρπ [r(st, at)] (2.14) The SAC consider a more general maximum entropy (see e.g [71]), equation 2.15. The α deter- mines the relative importance of the entropy term. J(π) = T∑ t=0 E(st,at)→ρπ [r(st, at) + αH(π(.|st))] (2.15) Chapter 2. Theoretical Background 35 [70] shows that soft policy iteration converges to the optimal policy within a set of policies that might correspond, for instance, to a set of parameterized densities. And that, large continuous domains require us to derive a practical approximation to soft policy iteration. To do this, they used function approximators for both the Q function and the policy. The soft value function is trained to minimize the squared residual error through more complex calculations that are presented in his work. To understand the skills acquired through maximum entropy in the reinforcement learning (RL) scenario, it is important to remember that RL employs a stochastic (π) policy to select actions, and thus find the best policy that maximizes the cumulative reward that is collected through an episode of length T, Equation 2.16: π∗ = argmax π Eπ [ T∑ t=0 rt ] (2.16) Thus, conventional RL approaches use a unimodal distribution policy centered on the maximum Q-value exploring its neighbor within the probability function. Refining learning policy to the most promising state and ignoring the least likely states. Imagine that in Figure 2.10, the gray curve represents two high-level decisions that the agent must make. The red distribution specifies traditional RL approaches. Figure 2.10 – A multimodal Q-function. Extracted from: [8] Another high-level solution would be to ensure that the agent explores all promising states, prioritizing the most promising state. The formalization of this idea can be given in Equation 2.17, which defines the policy directly in terms of the exponentiated Q-values, represented by the green curve in Figure 2.10. π(a|s) ∝ exp Q(s, a) (2.17) We can show that the policy defined through the energy form is an optimal solution for the maximum-entropy, Equation 2.18 RL objective, which simply augments the conventional RL objective with the entropy of the policy [72]. π∗MaxEnt = argmax π Eπ [ T∑ t=0 rt +H(π(.|st)) ] (2.18) Chapter 2. Theoretical Background 36 An organized description of the algorithm was made by [73], [74] and [75], the Algorithm 2.1 will be adopted in this work. Algoritmo 2.1: SAC - Soft Actor-Critic 1 Initialize parameter vector (networks) ψ, ψθ, φ ; 2 for each epoch do 3 for each environment step do 4 at ∼ πθ(at|st) 5 st+1 ∼ p(st+1|st, at) 6 D← D ∪ {(st, at, r(st, at), st+1)} 7 end 8 for each gradient step do 9 ψ ← ψ − λV∇ψJV (ψ) 10 θi ← θi − λQ∇θiJQ(θi) for i ∈ {1, 2} 11 ψ ← Tψ + (1− T)ψ 12 end 13 end 37 3 Related Work The task of controlling a UAV usually refers to a lot o different challenges (stability, trajectory following, path planning, obstacle avoidance, prediction, etc.) encountered in many different scenarios and for which many other techniques have been applied. In this way, approaches for controlling UAVs can be grouped in many different ways. This section presents a review of the most recent techniques grouped in the following way: i) classical approaches; ii) intelligent approaches. The classical approaches are usually more close to the control theory and related tech- niques. In this context, a usual research focus is the stability control problem. Classical techniques such as PID and Internal Model Control (IMC) [76] [77] are very useful, but they depend on prior knowledge of the system model. Techniques such as Successive Loop Closure (SLC) can be applied together to the PID to adjust the gains [78]. Wen considering the wind on stability prob- lems, the H2 optimal control theory has been applied [79] achieving satisfactory results. Other techniques explored were the Recursive Least Squares (RLS) and Smooth Variable Structure Filter (SVSF) [80] [81] to estimate UAV control dynamics variables, hardware failure detection variables and to prevent cyber attacks. The result achieved by [80] demonstrated a better conver- gence of estimation by RLS than in SVSF, although both have proven to be effective. Other works [82] applied the Extended Kalman Filter (EKF) in an autonomous multi-rotor system flying in external and unknown environments predicting the UAV trajectory based on empirical data measured with a certain degree error. The EKF is a nonlinear version of the Kalman Filter (KF), a robust prediction control technique. Other works also apply nonlinear control methods generating a more dynamic control system [83] [19]. Other works [83] focus on the application of Adaptive Filter Controller (AFC) in modeling and controlling the stability of UAVs using the Lyapunov Function to satisfy the stability analysis. Another approach [19] adopts control strategies based on Sliding Mode Control (SMC) – a method that alters the dynamics of a nonlinearnonlinear system that forces the system to slide along a cross-section of the system’s normal behavior – and the Feedback Linearization (FL) that transforms a nonlinearnonlinear system into an equivalent linear system. The results showed greater robustness to interferences using FL and a faster adjustment using SMC. All previous approaches can be classified as belonging to classic control, optimal control, and adaptive control. However, in the last years, techniques related to intelligent approaches that increase the level of autonomy of the UAVs have aroused. Some works [20] adopts degrees of truth to land the UAV, an approach that is possible using a mathematical model based on Fuzzy Logic, achieving satisfactory results. One of the most important tendencies in last years in the intelligent approaches is the use of techniques related to the machine learning (like artificial neural networks and reinforcement learning), that typically aims to improve their performance in Chapter 3. Related Work 38 some task with training. To achieve autonomous navigation in a closed environment, [84] used the Deep Neural Network (DNN) to filter an RGB image provided by a camera attached to the aircraft to allow its navigation in the environment in a controlled manner. A technique that recently become widely used in machine learning approaches is the Reinforcement Learning (RL) [85] [86] [87] [88] [89], in some cases used jointly with other techniques like Recurrent Neural Network (RNN), CNN and Fuzzy Logic. In [90] to improve UAV performance, the authors used the Deep Q-Network (DQN) with Noise Injection, applied and tested in a simulation environment. Other works [91] used the Proximal Policy Optimization (PPO) algorithm and stochastic policy gradient to make a quadrotor to learn a reliable control policy. This work shows the viability of using model-free reinforcement learning to train a low-level control of a quadrotor without supervision [91]. According to some authors, the PPO presents a better sampling efficiency when compared to other algorithms like the Trust-Region Policy Optimization (TRPO) [92], besides being much simpler to implement. In [93] the authors developed, according to them, the first open-source neural network-based flight controller firmware, basically a toolchain to train a neural network in simulation and compile it to run on embedded hardware. Despite the evident contribution, the main objective of the work, according to the author, is to improve the altitude control of the UAV traditionally done by a PID controller. New approaches have aroused in applications like combat and reconnaissance missions. Some works [94] adopted a strategy based on Deep Learning (DL) and Multi-Task Regression- Based Learning (MTRL) for navigation and exploration of forests, regardless of the presence of trails and GPS. The technique consists of two subnets with a convolutional layer each. Some works [95] focused on improving UAV’s decision autonomy on battlefields. They applied the Deep Belief Network (DBN) with Q-learning and a decision-making model based on Genetic Algorithms (GA), achieving satisfactory results. Already in the combat context, some works looked to identify who is controlling an opponent aircraft using surveillance images and a CNN architecture to learn human interactions with the relevant object (possible controller) in the scene [96]. Other works focus on the objective of learning reactive maneuvers in one-one aerial combat between UAVs based on the Asynchronous Actor-Critic Agents (A3C) algorithm and RL [21]. When navigating in unknown environments, an autonomous aircraft must have the ability to detect obstacles, thus avoiding a collision. Several methods became available in the literature in recent years. Some works [97] adopted an approach of Deep Deterministic Policy Gradient (DDPG), with continuous action-space, able to train the UAV to navigate through or over obstacles to achieve a target. The DDPGwas designed as an extension of deepQ-network (DQN), combining the actor-critic approach with insights from DQN [98]. The reward function was designed to guide the UAV through the best course while penalizing any crash. In [99], the authors applied the free gradient-based planning framework called Euclidean Signed Distance Field (ESDF). It significantly reduces the computational cost since the collision term in the penalty function is Chapter 3. Related Work 39 formulated by comparing the collision trajectory with the collision-free guided path, leading to a robust and high-performance algorithm. In some works [100] [101], looking to allow a UAV to perform an autonomous operation in an internal environment, the Simultaneous Localization and Mapping (SLAM) technique was used through a grid map by Monte Carlo to estimate the 2D position of the vehicle and the map of the environment while moving. The Kalman Filter is used to track the vertical altitude and velocity. In [102] the Kalman Filter was also used, but now to estimate motion and speed in real-time. The proposal is that the UAV can navigate in an external foliage environment without using GNSS, using only a 2D laser range finder. According to the authors, the experiment demonstrated successful autonomous navigation in both indoor and outdoor environments. In [103] the Reinforcement Learning approach is applied now to avoid collisions and investigate the optimal trajectory for the UAV based on the Traveling Salesman problem. In [104] authors adopted a Deep Reinforcement Learning approach using an algorithm derived from POMDP based on the Actor-Critic architecture to allow autonomous navigation in complex environments. When considering the best trajectory, some approaches [105] uses Q-learning to address the problem, and others [106] use a Dijkstra algorithm together to image processing technique and greedy breadth-first search technique, both achieving good results for outdoor environments. Still considering UAV applications for external environments, some authors focus on target search in complex scenarios based on Optical Flow-Based Method that uses the concept of apparent motion of objects caused by relative motion between an observer, and a scene [22]. This approach proved capable of estimating a rotorcraft 3D position and velocity signals compared to a reference. To enable a UAV to act in a complex disaster scenario, some authors [107] adopted a Deep Reinforcement Learning-based technique inspired by the good results of this technique when applying in an ancient game puzzle Nokia snake. Other applications such as tracking of moving targets [108] use the vision-based SLAM method, already mentioned in other applications in this work. The author’s goal is to use tracking in both indoor and outdoor environments. Another interesting technique is the Tracking Learning Detection - Kernelized Correlation Filter (TLD-KCF) in which a conditional scale adaptive algorithm is adopted [109]. Other Reinforcement Learning approaches [110] were considered together with computer vision techniques to improve the accuracy in UAV tracking considering Aspect Ratio Change (ARC). Results showed to be capable of significantly improve the tracking performance at a low computation cost. Another important research focus is the joint and collaborative use of these aircraft. Among the possible applications, we can cite wireless internet connectivity, data transfer, and information sharing among UAVs. In most of the works, Reinforcement Learning techniques [111] [112], Deep Reinforcement Learning [26] [113], Deep Deterministic Policy Gradient [114] [24] [115] [28] and Deep Q-Network [116] [25] [117] are the most applied. Other techniques such as Genetic Algorithm Based K-Means (GAK-means) with Q-Learning were used [118] to Chapter 3. Related Work 40 allow a dynamic movement of multiple UAVs. The results showed fast convergence with a low number of iterations and better results than other algorithms such as K-means and Iterative-GAK. Looking to establish mutual attention between an outdoor UAV and a human, that is, a dynamic of mutual interaction between both, some works [119] adopted the Kalman Filter and computer vision techniques. Some authors [120] applied a DNN called TrailNet to maintain the trail center using label smoothing and reward entropy for autonomous navigation on a forest trail alerting users about environmental awareness. The UAV achieved stable and robust navigation, validating the technique. In wireless networks, the UAV is typically vulnerable to interference that can affect its performance and security. In [121] the authors addressed this problem using the Adaptive Federated Reinforcement Learning (AFRL) - based technique, which proved to be 40% better than other methods used. Summarizing this literature review, table 3.1 presents the applications in UAVs and the evolution of the adopted control techniques. This analysis shows a clear trend towards using techniques related to the DL and DRL in the last years, stimulating deeper investigation about these techniques. Table 3.1 – Applications in UAVs and the evolution of the adopted control techniques. Evolution of the use of techniques for each application Technique Dynamics and Stabil- ity Control Better trajectory and avoid collision Target location / tracking / recogni- tion Information Sharing and Connectivity PID [76] Dijkstra [106] ROSGPS+ CTANS [27] TLD+KCF [109] ESDF [99] SVSF+RLS [80] [81] Fuzzy Logic [20] KF [102] EKF [82] [101] AFC [83] H2 op- timal control [79] SMC [19] EA/GA [118] SLAM [100] [108] FQL [87] Chapter 3. Related Work 41 RL [85] [86] [88] [89] [103] [111] [112] Q- Learning [78] [105] [122] RL+VC [110] CNN [93] [96] DBN [95] DNN [84] DQN [117] [25] [116] DRL [90] [104] [94] [107] [113] [26] DDPG [28] [24] [114] [115] PPO [91] SAC [123] A3C [21] 42 4 Materials and Methods This chapter presents the approach proposed to achieve our goals, detailing the UAV dynamics, simulation environment, hardware, agent parameters, models, networks, and algorithm. The experiments proposed are also described. 4.1 Proposed Approach: overview In this work, we propose to investigate DRL-based algorithms – particularly the SAC algorithm – to train a low-level controller for a quadrotor using a set of visual and non-visual sensors. In other words, we propose to investigate the use of visual information together with the multiple sensors embedded in the aircraft to create the state space for the DRL algorithm. A key question in this approach is: how can we model the system states (S) to allow accurate control of the UAV?. We will address this question in the section 4.6. The diagram of the SAC algorithm and the Autoencoder (AE) Network is shown in Figure 4.1, the structure proposed by [75] was used, adding the Autoencoder. The current policy is used to interact with the environment at each epoch of our training, storing inside a replay buffer in the format (st, st+1, rt, at), which is used to estimate a value to the state, a Q− value to transition from st→at→st+1, using Q− value to weight our optimizing policy for actions that increase the value of Q. The Autoencoder is used to reduce the dimensionality of 4 images, from 64x64 pixels to 4x4 pixels. Figure 4.1 – Diagram of the proposed framework using SAC and the Autoencoder. Chapter 4. Materials and Methods 43 4.2 Proposed Framework The work was developed in the Coppelia Simulator [9] and Pyrep framework [124], thus it was possible to increase the simulation performance by speeding up the process by about 20x, compared to the remote API provided by the Coppelia Simulator. The default quadcopter model available in the Coppelia Simulator was used, the UAV mass and the moment of inertia were adjusted to the same used by [75]. using 0.10 kg and [5.217x10−3, 6.959x10−3, 1.126x10−2] kg.m2, respectively. 4.3 Coppelia Simulator and Pyrep The Coppelia Simulator has a wide variety of models in its libraries, mesh manipulation at runtime, and different physical engine options for the user [9], such as: • Support for platforms: Linux, Windows, and macOS; • Physics engine used for calculations: Bullet, ODE, Vortex and Newton; • Outputs: Videos, graphics and text files; • Library: Wide variety of robots (mobile and fixed), sensors, and actuators; • Operation with mesh: Allows mesh manipulation at runtime. Imports meshes as element groups, providing flexibility in handling the imported model’s materials, appearances, and textures; • Programming: offers six different approaches. Coppelia Simulator, in general lines, is a simulation environment that allows testing prototypes and algorithms without involving the constructive costs of a real robot. With its integrated development environment (IDE), it is possible to create scenes to control systems (robots, mats, cameras, sensors, and others) through several scripts in the same scene or through external interfaces. According to Coppelia Robotics, the simulator is an integrated development framework designed for a distributed control architecture. Within a scene, Figure 4.8, it is possible to assign to each object/model, independently, a control encoding in the form of an embedded script, plugin, ROS or BlueZeroNode, remote API, or through a customized solution such as the Pyrep [124]. The Figure 4.2 illustrate these communication modes. We used the PyRep that is a toolkit for robot learning research, built on top of Coppelia Simulator, plugin sped the process up approximately 20x, compared the other communication modes such as the Remote API, seen in [75]. As explained previously, the default quadcopter model of Simulator was used, Figure 4.3, and all parameters such as mass, the moment of inertia, the velocity-thrust function obtained from Chapter 4. Materials and Methods 44 Figure 4.2 – Interfaces to Coppelia Simulator [9]. experiments described in [125], and applied in [91] and [75] will be maintained. The function of propeller thrust force Tr(pwm) is described by equation 4.1. Tr(pwm) = 1.5618 ∗ 10−4 ∗ pwm2 + 1.0395 ∗ 10−2 ∗ pwm+ 0.13894 (4.1) 4.4 Hardware The experiments were performed on 2 (two) machines and their specifications are: Machine 1: • CPU: Intel®- CoreTM i7-7700U CPU - 3.60GHz • RAM: 16GiB • GPU: NVIDIA - GeForceTM GTX 1080 (8gb) Machine 2: • CPU: Intel®- CoreTM i7-4510U CPU - 2.00GHz • RAM: 8GiB • GPU: Intel - Haswell-ULT Integrated Graphics Controller Chapter 4. Materials and Methods 45 Figure 4.3 – Coppelia Simulator Default UAV - AR Parrot [10]. The library open source chosen of deep reinforcement learning was Pytorch [126], based in Torch library frequently used in vision computation. 4.5 Drone Dynamics Since we will use a model-free DRL algorithm, the description of all drone dynamics is out of the scope of this work. More details are available at [11], [127], [128]. We are interested in the position, velocity, and angle of the aircraft used in our model variables. Consider the UAV body frame (B) at the center of the coordinate axis [XB, YB, ZB], with a weight vector given bymg in the opposite gravity direction, the torque performed in the UAV propellers represented by [T1, T2, T3, T4] and the angular movement velocity of the propellers, which can be given by vector [R1, R2, R3, R4]. This model structure can be seen in Figure 4.4. Note that propellers 2 and 3 are on the right side of the X-axis while propellers 1 and 4 are on the left side. We emphasize this because it is important to make sure that the propellers on the same side spin in opposite directions. The propellers diagonally opposites spin in the same direction, i.e., 1 and 3 in one direction, while the 2 and 4 spins in the opposite direction. The algorithm can learn this. For our approach, we consider that the global position of our UAV in the environment is Chapter 4. Materials and Methods 46 Figure 4.4 – Structure and dynamics of the quadcopter body - Font:[11]. given by [x, y, z], so the linear velocity will be given by [ẋ, ẏ, ż]. Other important parameters are the Euler angles of the aircraft axes φ, θ and ψ, in axes x, y and z, respectively, which are also referred to as roll, pitch and yaw [φ, θ, ψ]. Consequently our angular velocities are given by [φ̇, θ̇, ψ̇]. The Rotation Matrix is another important element responsible for convert coordinates from the body frame to the world frame, as can be seen in Equation 4.2. All computation and logic used are performed within the algorithm developed by us. 4.6 Agents/Models/Networks 4.6.1 Drone Agent It has been defined that the time horizon of UAV remains until it suffers a reset event, such as collision, go out the global limit, distance from target greater than 19.5 meters or epochs greater than 250 time steps. The standard routine adopted was: • Reset mode that applies a new initial state or a previous state and can restart the simulation; • A shutdown method, that stop the episode, when necessary; • The global_limit which is responsible for returning if the UAV is within the global limit; Chapter 4. Materials and Methods 47 • The step method, that is responsible for obtaining and applying new actions on propellers, requesting environment observation states, verifying if the uav reached the objective, weighing the chosen path and receiving the value of the reward function, thus returning these values to the network. 4.6.2 Scenarios The proposed scenes were built to explore the autonomy of the UAV in different environ- ments. For this, it is important to observe the stability of the aircraft and measure whether it can maintain its stable flight along the trajectory until it reaches the target base. All scenes have 7 (seven) landing/takeoff bases, [B1, B2, B3, B4, B5, B6, B7], and 4 (four) vertical rods in the corners that sets the limits of the test platform, [corner1, corner2, corner3, corner4]. We will add pipelines and some people to the scene to create scenarios with fixed and mobile obstacles. 1. Empty environment - SC0. The first scene is the same one used by [75], the reference used is the green target, a dummy object that serves as a geographic point in the environment and the target position for the aircraft. The worked scene can be seen in the Figure 4.5; Figure 4.5 – CoppeliaSim Robotics Simulator - Scene empty. 2. Free environment - SC1. The second scene intends to investigate the robustness of the flight in a horizontal free displacement. The main behaviors observed were flight stability, accuracy, chosen trajectory, and whether it achieved the objective. The worked scene can be seen in the Figure 4.6; Chapter 4. Materials and Methods 48 Figure 4.6 – CoppeliaSim Robotics Simulator - Scene free. 3. Environment with fixed obstacles - SC2. We will position obstacles (like coastal and land bases, pipes, and so on) in the aircraft’s path. With this, we aim to verify the decision autonomy to avoid collisions and maintain an efficient route. The worked scene can be seen in the Figure 4.7; Figure 4.7 – CoppeliaSim Robotics Simulator - Scene with fixed obstacles. 4. Environment with mobile obstacles - SC3. This is the hardest challenge for the aircraft. The objective of the UAV is the same as in previous scenarios (to reach a particular destination). Still, obstacles that keep moving – in this case, some people – will be inserted in the trajectory. The proposal is to evaluate the autonomy of the controller under dynamic conditions. The worked scene can be seen in Figure 4.8. Chapter 4. Materials and Methods 49 Figure 4.8 – CoppeliaSim Robotics Simulator - Scene with dynamic obstacles. 4.6.3 Representation of states The representation of states was structured according to the Table 4.1, the 22 states defined in [75] and [91] were maintained and 32 more were added, highlighted by the column accumulated in Table 4.1. The UAV _Position_Target refer the global position of the UAV base relative to target position, represented by the coordinates [x, y, z]. Therefore, the UAV_Linear_Velocity is defined as [ẋ,ẏ,ż]. UAV_Rotation_Matrix is responsible for convert coordinates from body frame to world frame and vice versa, it is a scalar product between individual axis rotation matrices, it can be seen in Equation 4.2. We can define that our orientation in the world frame as [φ, θ, ψ] (roll, pitch, yaw), that represent the Euler angles of the body axes, thus the UAV_Angular_Velocity can be [φ̇, θ̇, ψ̇]. Rx(φ) = 1 0 0 0 cosφ − sinφ 0 sinφ cosφ  Ry(θ) =  cos θ 0 sin θ 0 1 0 − sin θ 0 cos θ  Rz(ψ) = cosψ − sinψ 0 sinψ cosψ 0 0 0 1  Chapter 4. Materials and Methods 50 R3 = Rx(φ)Ry(θ)Rz(ψ). (4.2) The UAV_Propellers_Action represent the actions chosen to stabilize and move the UAV. Distance sensors were added to the aircraft, with one on top of the UAV, one below the UAV, and eight other sensors monitoring around the device structure, distributed equidistantly from each other, thus monitoring a wider area. The sensors were configured to capture any body or object from a distance of three meters with a volume of type randomized ray, where 500 rays will scan a cone-shaped volume at random. For measuring these sensors, we added the UAV_Ultrasonic_Sensors in the states. Other important states are: UAV_Global_Limit which verifies whether the UAV remains within the pre-defined region for the flight, limited by the corner objects of the scene; UAV_Travelled_Path measuring the path taken by the UAV be- fore reaching the target position, suffering a collision, leaving the pre-defined limit or reach 250 time steps. The UAV is also equipped with two monocular cameras in front and below it. The cameras are responsible for capturing images during each instant of time, which has a dimension of 64 x 64 pixels. We propose to use these images to assist the aircraft navigation and to identify obstacles. However, to solve the high dimensionality in the states, we use an autoencoder. The size of each image after the encoder is 2 x 2 pixels. To enable the UAV to recognize the displacement within the environment, we used two images per camera that refer to its last and current frames. Therefore, states UAV_Last_Floor_Image, UAV_Last_Front_Image, UAV_Currently_Floor_Image and UAV_Currently_Front_Image for the captured images were added. How we are using an autoencoder, it is important to observe the accuracy of the loss rate in these images, soUAV_Autoencoder_Loss_Ratewas also considered a state to be observed. Finally, we also consider UAV position relative to the environment as an important state to observe, so the UAV _Position_Env has been added. In general, these were the states used. Table 4.1 – Representation of states. Observation States Item States Number of Ele- ments Accumulated 1 UAV_Position_X_Y_Z 3 3 2 UAV_Rotation_Matrix 9 12 3 UAV_Angular_Velocity 3 15 4 UAV_Linear_Velocity 3 18 5 UAV_Propellers_Action 4 22 6 UAV_Ultrasonic_Sensors 10 32 7 UAV_Global_Limit 1 33 8 UAV_Travelled_Path 1 34 Chapter 4. Materials and Methods 51 9 UAV_Last_Floor_Image 4 38 10 UAV_Last_Front_Image 4 42 11 UAV_Currently_Floor_Image 4 46 12 UAV_Currently_Front_Image 4 50 13 UAV_Autoencoder_Loss_Rate 1 51 14 UAV_Position_Env 3 54 4.6.4 Reward function The reward function is an important parameter in the performance of the learned policy. However, it is not an elementary definition since all UAV elements’ abstraction, and their behavior in the environment can be complex. Several attempts were made, considering approaches as: • Divide the reinforcement into groups, related the proximity of the UAV and the target position; • Strong punishments for collision and go out global limit; • To punish high speed of roll, pitch and yaw [φ̇, θ̇, ψ̇]; • To punish long paths to the target position; • To reward the UAV flight height. After applying these approaches, unsuccessfully, the best result was still the one used by [91], defined by Equation 4.3, so this approach will be maintained. We take into account stability, robustness and precision. Thus, the reward function used in this work is defined by Equation 4.3. rt(s) = ralive − 1.0||εt(s)|| − 0.05||φ̇|| − 0.05||θ̇|| − 0.1||ψ̇|| (4.3) The ralive is a constant, which serves to ensure that the UAV earns a reward when flying within a defined region, in this case the ralive = 1.5. The εt refers the distance between the target position and the UAV base at time step t, which can be seen by Equation 4.4. εt(s) = √ ξ2target(t) − ξ2uav(t) εt(s) = √ (xtarget(t) − xuav(t))2 − (ytarget(t) − yuav(t))2 − (ztarget(t) − zuav(t))2 (4.4) Chapter 4. Materials and Methods 52 We added a cost for the absolute value of the relative angular velocities. We applied a higher penalty to the ψ̇ since it was most responsible for the vibration (ringing effect) of our aircraft. Note that since our ralive = 1.5 and the horizontal time is 250, the maximum reward received can reach the value of 375, an important reference when discussing the results. 4.6.5 Episode completion The agent resumes an episode and restarts another under the conditions listed below: 1. The was a collision; 2. The distance from target is greater than 19.5 meters; 3. The number of steps in an epoch exceed 250 time steps; 4. The UAV exitedd the defined global space. 4.6.6 Initialization To initialize the UAV state at each episode, we used the Discretized Uniform initialization, proposed by [75]. I1: Initialization - Discretized Uniform We defined a discrete uniform distribution in an array and that can parameterize how many pieces it would be divided. The dimension of the scenario was the parameter considered to define the size of distribution num_discretization and its limit bound_of_distribution, as shown below: • For [x] num_discretization = 7 and bound_of_distribution = [-3.000, 5.850]. Defining thus ([-3, -1.52, -0.05, 1.42, 2.9, 4.37, 5.85]) • For [y] num_discretization = 7 and bound_of_distribution = [-2.125, 6.875]. Defining thus ([-2.12, -0.62, 0.88, 2.38, 3.88, 5.38, 6.88]) • For [z] num_discretization = 5 and bound_of_distribution = [1, 2.5]. Defining thus ([1, 1.38, 1.75, 2.12, 2.5]) • For [φ, θ, ψ] num_discretization = 11 and bound_of_distribution = [-0.785, 0.785]. Defining thus ([-0.78, -0.63, -0.47, -0.31, -0.16, 0, 0.16, 0.31, 0.47, 0.63, 0.78]) Chapter 4. Materials and Methods 53 4.6.7 Action Space The action space is composed by the actions of each propeller, defined from a PWM range from 0 to 100. This action space is given by Ap={a1, a2, a3, a4} and it is applied to each propeller through the Equation 4.1. 4.6.8 Algorithm Parameters In this section, the settings of the SAC algorithm and Autoencoder will be presented. Soft Actor-Critic (SAC) The SAC algorithm settings follow the ones proposed by most open-source implemen- tations, like [75]. However, some adaptations were necessary due to the significant increase of sensors, change of scenarios, increase in observation states, and complexity. The final hyper- parameters are listed in Table 4.3. As we have several start points in our application, a good approach is to increase the batch size. Thus the task can be explored/evaluated from many configurations using the same trained policy πt. These were the hyper-parameters that generated the best results so far. Table 4.2 – Parameters - SAC Algorithm. SAC Algorithm Parameter Value Batch size 4,000 Buffer size 5,000,000 Discount (γ) 0.99 Learning rate α 10−4 Num train loops per step 1 Policy network (64, tanh, 64, tanh) Value and Soft-Q networks (256, relu, 256, relu) Autoencoder We defined the autoencoder parameters from tests carried out directly in the scenes proposed in this work. The following parameters were considered the learning rate, network size, loss rate, batch size. To reduce the computational cost of the algorithm, only four images were recorded, two current images and two previous images, seen by the floor and front cameras. Figures 4.9 and Chapter 4. Materials and Methods 54 4.10 represent the evolution of learning for some of these tests, in which our learning rate was defined as 0.001, batch size 4, and a maximum amount of 10,000 episodes. Figure 4.9 – AE Learning Curve - Assessment 1. Figure 4.10 – AE Learning Curve - Assessment 2. It can be seen in Figures 4.11, 4.12 and 4.13 the encoder (a) and decoder (b) of the networks right after training, using random images, but known by the network. With this, it was possible to achieve a decoder accuracy of 99.1%. To validate the learning, we used a new database with 2.000 images from the same environment, not necessarily known by the network, so we selected 5 random images and verified the accuracy of the encode in these new images, which can be seen in the Figures 4.14, 4.15 and 4.16. We achieved an accuracy between 98.4% and 99.1%. Chapter 4. Materials and Methods 55 Figure 4.11 – AE Train - Assessment 1. Figure 4.12 – AE Train - Assessment 2. Thus, after several experiments, considering algorithm’s precision and efficiency, the parameters that best met the expectations are defined in Table 4.3. Since some images, during testing, did not achieve the expected accuracy, each new batch of images will be forced to have an accuracy of 99.6% or a maximum value of 30 epochs of AE. Chapter 4. Materials and Methods 56 Figure 4.13 – AE Train - Assessment 3. Figure 4.14 – AE Test - Assessment 1. Figure 4.15 – AE Test - Assessment 2. Chapter 4. Materials and Methods 57 Figure 4.16 – AE Test - Assessment 3. Table 4.3 – Parameters - Autoencoder Algorithm. Autoencoder Algorithm Parameter Value Original image size 64x64 Image- Original/Converted RGB / Grayscale Batch size 4 Learning rate α 10−3 Code networks (32x32, relu, 16x16, relu, 8x8, relu, 4x4, relu, 2x2, relu) Decode networks (2x2, relu, 4x4, relu, 8x8, relu, 16x16, relu, 32x32, relu) Loss rate 0.005 Max episodes 30 58 5 Results In this chapter, we will present and discuss the results per scenario, assessing how the learning was affected per model proposed. We will discuss the influence of parameters, the resulting aircraft behavior, and the approaches used. 5.1 Approaches Overview We tested several model configurations, parameters, reward strategies, observed states, and initialization strategies until the UAV reached a good behavior. We list some as: 1. First attempt, the aircraft should learn the stability and displacement in the environment simultaneously. Within the most challenging scenario - the SC3 Dynamic Obstacles. We used the same approach in the other scenarios but also without success. 2. Different terms in the reward function, as mentioned in Section 4.6.4. 3. Added different states like the target distance, global target position, and length timestep. 4. Fixed initialization to a specific global position and orientation. 5. Fixed initialization to a specific global position but with a variation in orientation. All these approaches did not indicate a learning evolution. Therefore, we will not detail them further. As a step-by-step approach proved to be more efficient in the learning process, we separated it into four steps. The scenario adopted in this first stage is SC0 - Empty Scenario, which previously already been performed successfully in [91] and [75]. In this step, we train the algorithm to stabilize the UAV in the empty scenario. We consider that by the end of this stage, the flight stability and accuracy have already reached an acceptable error rate, enabling a free displacement close to the ideal that can be verified by the SC1 - Free scenario. Thus, the expectation in the third stage is that the aircraft learns to avoid fixed obstacles for the scenario SC2 - Fixed Obstacle. Finally, in the last stage, the UAV is expected to learn to avoid dynamic obstacles. We will use the SC3 - Dynamic Obstacle scenario for this. In order not to compromise UAV learning, the states will be partially enabled, evolving according to the stage. This evolution can be seen in table 5.1. More details will be presented in the section. Chapter 5. Results 59 Table 5.1 – Sequence of enabled states. Enabled States States Unit Empty Scenario Free Sce- nario Obstacle Scenario Dynamic Scenario UAV_Position_X_Y_Z 3 X X X X UAV_Rotation_Matrix 9 X X X X UAV_Angular_Velocity 3 X X X X UAV_Linear_Velocity 3 X X X X UAV_Propellers_Action 4 X X X X UAV_Ultrasonic_Sensors 10 X X X UAV_Global_Limit 1 X X X UAV_Travelled_Path 1 X X UAV_Last_Floor_Image 4 X UAV_Last_Front_Image 4 X UAV_Currently_Floor_Image 4 X UAV_Currently_Front_Image 4 X UAV_Autoencoder_Loss_Rate 1 X UAV_Position_Env 3 X X X 5.2 SC0 - Empty Scenario In Figure 5.1, it is verified that the policy learned by the DRL was able to maximize the reward value, showing stability close to 125,000 timesteps, which represented 4,250 episodes which are approximated 3 days. To ensure the stability and accuracy of the UAV flight, the learned policy is submitted to the same scenario, but starting in different positions, global [x, y, z] and angular [φ, θ, ψ]. Tests performed demonstrated an average reward per step of 1.17, reaching a total reward of 293.7. Figure 5.2 shows the behavior of the UAV in a random test, showing the trajectory performed. Note that despite the trajectory not being perfect, the aircraft maintained a behavior close to expected, achieving good accuracy in x and y and a small difference in z axis, verified by the graph - Final State Accuracy, seen in Figure 5.3. The UAV achieved a good dynamic movement quickly, attesting to a satisfactory degree of robustness. Still, it requires a little more precision due to the irregular amplitudes noted in the angular velocity curves, seen in Figure 5.4. In this first stage with 4,250 epochs the UAV reached a satisfactory robustness degree. However, it is expected that with more training the DRL can achieve better accuracy, as achieved Chapter 5. Results 60 Figure 5.1 – Average Reward - Epoch 4.250 - Empty scenario (a) SC0 - Copppelia view (b) SC0 - Cartesian plane Figure 5.2 – SC0 - Path chosen by the UAV - Epoch 4.250 - Empty environment by [75]. However, this improvement can be achieved through training in the next scenarios, e.g. the SC1 - Free Scenario, which is the next to be explored. Thus, the policy learned in this stage will be transfer to the next and its new behavior will be checked. 5.3 SC1 - Free Scenario In this scenario, the UAV must adapt the learned policy to the addition in the states of ultrasonic sensors, as we insert new objects into the scene, previously unknown for the UAV. The global limit will be reduced on all axes, adapting to the arena’s space. Chapter 5. Results 61 (a) SC0 - x axis (b) SC0 - y axis (c) SC0 - z axis Figure 5.3 – SC0 - Final State Accuracy - [x, y, z] axes In this scenario, the aircraft must adapt to the new environment, fine-tuning the policy learned through previously unknown input variations. Since sudden variations in states can lead to inappropriate UAV behavior, including losing what has already been learned, we vary the states gradually, verifying if the learned behavior is performing as expected. At this stage, the UAV trained over 3,000 episodes, totaling 7,250 elapsed episodes. The learning analysis follows the same methodology applied in the previous scenario. In Figure 5.5, we can see that the policy learned by the DRL enabled a dynamic behavior suitable for displacement within the free environment. The average rewards obtained in the tests were 0.706 per timestep and 176.52 per episode, which are good results considering the amount of additional training performed and the extra complexity of the environment. The view in the Cartesian plane of the path taken by the UAV can be analyzed in Figure 5.6. Although the path is not ideal, it reached 83.13% efficiency, comparing the distance covered and the shortest distance, which is certainly an encouraging value. The precision on the x and y axes was maintained, but the expected steady-state error reduction on the Z axis did not occur, which can b