UNIVERSIDADE ESTADUAL PAULISTA “JÚLIO DE MESQUITA FILHO” Instituto de Ciência e Tecnologia de Sorocaba MARIA FERNANDA TEJADA BEGAZO A LEARNING-BASED MODEL-FREE CONTROLLER FOR DECOUPLED HUMANOID ROBOT WALKING Sorocaba - SP 2020 MARIA FERNANDA TEJADA BEGAZO A LEARNING-BASED MODEL-FREE CONTROLLER FOR DECOUPLED HUMANOID ROBOT WALKING MASTER’S THESIS Text submitted to the Graduate Program in Electri- cal Engineering (PGEE) of the Institute of Science and Technology (ICT) of Sorocaba as part of the re- quirements to obtain the title of Master in Electrical Engineering. Sorocaba - SP 2020 To my advisor, for all his patience, dedication and time in the most difficult moments, and for inspiring me to look outside the bubble of the common to innovation. B416l Begazo, Maria Fernanda Tejada A learning-based model-free controller for decoupled humanoid robot walking / Maria Fernanda Tejada Begazo. -- Sorocaba, 2020 94 p. : il., tabs. Dissertação (mestrado) - Universidade Estadual Paulista (Unesp), Instituto de Ciência e Tecnologia, Sorocaba Orientador: Alexandre da Silva Simões 1. Genetic algorithms. 2. Artificial intelligence. 3. Intelligent control systems. 4. Androids. I. Título. Sistema de geração automática de fichas catalográficas da Unesp. Biblioteca do Instituto de Ciência e Tecnologia, Sorocaba. Dados fornecidos pelo autor(a). Essa ficha não pode ser modificada. UNIVERSIDADE ESTADUAL PAULISTA Câmpus de Sorocaba CERTIFICADO DE APROVAÇÃO TÍTULO DA DISSERTAÇÃO: "A learning-based model-free controller for decoupled humanoid robot waking" AUTORA: MARIA FERNANDA TEJADA BEGAZO ORIENTADOR: ALEXANDRE DA SILVA SIMÕES Aprovada como parte das exigências para obtenção do Título de Mestra em ENGENHARIA ELÉTRICA, área: Sistemas Eletrônicos pela Comissão Examinadora: Prof. Dr. ALEXANDRE DA SILVA SIMÕES (Participaçao Virtual) Departamento de Engenharia de Controle e Automação / Instituto de Ciência e Tecnologia - UNESP - Câmpus de Sorocaba Professor Titular REINALDO AUGUSTO DA COSTA BIANCHI (Participaçao Virtual) Centro Universitário Fundação Educacional Inaciana Pe. Saboia de Medeiros - FEI Prof. Dr. MARCOS RICARDO OMENA DE ALBUQUERQUE MAXIMO (Participaçao Virtual) Divisão de Ciência da Computação, Departamento de Metodologias de Computação / Instituto Tecnológico de Aeronáutica Sorocaba, 30 de dezembro de 2020 Prof. Dr. Alexandre da Silva Simões Orientador Instituto de Ciência e Tecnologia - Câmpus de Sorocaba - Três de Março, 511, 18087180, Sorocaba - São Paulo http://www.sorocaba.unesp.br/#!/pos-graduacao/--engenharia-eletrica-local/CNPJ: 48031918003573. Acknowledgements I thank all the family, friends, teachers and employees of the Sorocaba Institute of Science and Technology, who contributed directly or indirectly to the realization of this work. In particular, I offer my thanks: • To my parents Cristina and Armando and my brother Anthony for their love, support and encouragement; • To my aunt Tania for the support and encouragement at all time; • To Prof. Dr. Alexandre and Dra. Esther, for all the teaching, encouragement, trust and guidance; • To my friends and colleagues in the laboratory who helped me directly or indirectly, especially Vitor, Nayari, Murillo and Rafael, for the help and work done together; • To my friends Renata, Claudia, Kevin and Elizabeth for their support and encouragement. • This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil (CAPES) – Finance Code 001. “O sol é para todos, mas a sombra é para quem chega primeiro.” Geremias Ludu Abstract Robotics has advanced rapidly in recent decades. Initially, it focused on automotive industry production after the second industrial revolution; it rapidly expanded to new exciting domains that prompted robots to perform tasks with tools and in environments designed for humans. However, controlling the movements of such robots is becoming more and more challenging due to the growing complexity of their mechanical structure and dynamics. This difficulty has stimulated the investigation of novel methods for robots controlling that may not depend on the kinematic or dynamic models of the robots. The Marta humanoid robot was designed and built by the LaRoCS and GASI research groups for humanoid walking investigation and present some novel concepts like small feet – instead of large ones – and a 3-DOF spherical joint in the waist that allows the design of decoupled controlling strategies. Taking advantage of the mechanical architecture of this robot, this work presents a model-free controller for humanoid robot walking composed by i) a classical controller for the waist spherical joint that focuses on robot stability, and ii) a trajectory controller for the remaining joints based on a Truncated Fourier Series (TFS) its parameters tuned by a Genetic Algorithm (GA) that focus on the implementation of a coordinated robot walking pattern. Results have shown that the decoupled robot control strategy has the potential to confer additional robustness to the bipedal robots walking while compared to traditional techniques. Keywords: Genetic algorithms. Artificial intelligence. Intelligent control systems. Androids. Resumo A robótica avançou rapidamente nas últimas décadas. Inicialmente focada na produção industrial de automóveis depois da segunda revolução industrial; ela rapidamente se expandiu para novos e excitantes domínios que demandaram dos robôs a realização de tarefas com ferramentas e em ambientes projetados para seres humanos. No entanto, controlar o movimentos desses robôs está se tornando cada vez mais desafiador devido ao crescimento da complexidade de sua estrutura mecânica e dinâmica. Essa dificuldade tem estimulado a investigação de novos métodos para o controle de robôs que possam não depender do modelo cinemático ou dinâmico dos robôs. A robô humanóide Marta foi projetada e construída pelos grupos de pesquisa LaRoCS e GASI para a realização de investigações de caminhada de robôs humanóides, e apresenta alguns novos conceitos tais como os pés pequenos - ao invés de pés grandes - e uma junta esférica de 3 DOF na cintura que permite o projeto de estratégias de controle desacopladas. Aproveitando a arquitetura mecânica deste robô, este trabalho apresenta um controlador livre de modelo para a caminhada de robôs humanóides composta por i) um controlador clássico para a junta esférica da cintura que foca na estabilidade do robô e ii) um controlador de trajetória para as demais juntas baseado em uma Série Truncada de Fourier (TFS) com seus parâmetros ajustados por um algoritmo genético (GA) que foca na implementação de um padrão de caminhada coordenada do robô. Os resultados mostraram que a estratégia de controle descomposta tem potencial para conferir robustez adicional à caminhada de robôs bípedes em relação a técnicas convencionais. Palavras-chave: Algoritmos genéticos. Inteligência artificial. Sistemas inteligentes de controle. Andróides. List of Figures Figura 1 – Marta humanoid robot: a) Links; b) Joints (texts) and links (lines connecting joints). . . . . . . . . 24 Figura 2 – The world coordinates ∑w and local coordinate of arm ∑a. Vectors and coordinates system related to the initial (blue) and final (red) position after the rotation of the arm. Inspired by (KAJITA et al., 2014). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Figura 3 – Direct kinematics model of the Marta humanoid robot arm. The reference coordinate system (∑M) in the pelvis joint (in red), and the local coordinate system (∑21) in the elbow joint (in green). . . . 27 Figura 4 – Calculating Inverse Kinematics of the right leg. . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Figura 5 – The definition, vertical force and horizontal force of ZMP (KAJITA et al., 2014). . . . . . . . . . 30 Figura 6 – Relationship between CoM, ZMP and the support polygon (KAJITA et al., 2014). . . . . . . . . . 31 Figura 7 – Bi-dimensional inverted pendulum shows the simplest model for a robot adapted from (KAJITA et al., 2014). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Figura 8 – General framework for feedback control (DURIEZ; BRUNTON; NOACK, 2017). . . . . . . . . . 38 Figura 9 – Open-loop control architecture (DURIEZ; BRUNTON; NOACK, 2017). . . . . . . . . . . . . . . 39 Figura 10 – Closed-loop control architecture (DURIEZ; BRUNTON; NOACK, 2017). . . . . . . . . . . . . . 39 Figura 11 – PID control scheme (DURIEZ; BRUNTON; NOACK, 2017). . . . . . . . . . . . . . . . . . . . . 40 Figura 12 – Unit-step response of a physical system (OGATA, 2010) . . . . . . . . . . . . . . . . . . . . . . . 41 Figura 13 – Examples of the frequency response (PINTO, 2014) . . . . . . . . . . . . . . . . . . . . . . . . . 42 Figura 14 – Main features of the Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Figura 15 – The agent-environment interaction in Reinforcement Learning (SUTTON; BARTO, 2018) . . . . . 50 Figura 16 – Classification of State of Art: most relevant aspects in humanoid robot walking literature. . . . . . 53 Figura 17 – Phases in the quasi-static walk (CUEVAS; ZALDÍVAR; ROJAS, 2004). . . . . . . . . . . . . . . 54 Figura 18 – Human walking cycle (FIGUEROA; MEGGIOLARO, 2016). . . . . . . . . . . . . . . . . . . . . 55 Figura 19 – Human walking angular trajectory of the hip and knee (YANG et al., 2007). . . . . . . . . . . . . 57 Figura 20 – Generic pattern of walk elaborated from human walks features (YANG et al., 2007) . . . . . . . . 58 Figura 21 – Controller decoupled approach for the humanoid robot Marta. a) represents all of Marta’s joints, b) the upper part determined by the spherical joint (waist controller), and c) the part of the arms and legs (TFS controller). The red dots represent the joints that will not be controlled. . . . . . . . . . 60 Figura 22 – The pelvis pitch, knee pitch and pelvis roll trajectory (SHAFII; REIS; LAU, 2010). . . . . . . . . . 61 Figura 23 – Genetic algorithm steady state. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Figura 24 – Marta physical robot (left) and DOFs (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Figura 25 – Programming environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Figura 26 – GALib Class Hierarchy (WALL, 1996). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Figura 27 – Class diagram of the developed code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Figura 28 – Marta’s virtual model in V-REP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Figura 29 – Sequence diagram between the controller and the simulator. . . . . . . . . . . . . . . . . . . . . . 71 Figura 30 – Angles of the waist joints after the process of tuning of the PID parameters: a) start position of the robot; b) Angles of the spherical joints over time. Similar to the inverted pendulum, the robot reached the stability with the PID controller after the tuning process. . . . . . . . . . . . . . . . . 76 Figura 31 – Body trajectories of the typical individuals selected by the Genetic Algorithm (GA) using distinct fitness functions: a) Forward shift (FF1); b) Forward shift and time (FF2); c) Discount for diagonal shifts (FF3); d) Stimulated diagonal (FF4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Figura 32 – Average fitness value during the training process for distinct ankles control strategies. . . . . . . . 78 Figura 33 – Walking patterns for distinct ankles control strategy. a) Shafii equations (ANK-01); b) IMUs placed at robot’s feet (ANK-02); c) IMUs placed at robot’s feet and pelvis (ANK-03). . . . . . . . . . . . 79 Figura 34 – General evaluation of the decoupled controller strategy. Average values of the fitness function for each generation in the learning phase for EXP-01 to EXP-05. . . . . . . . . . . . . . . . . . . . . 79 Figura 35 – Movements learned by Marta robot over time in EXP-01 (no DOFs in robot waist). Without an active control the robot is naturally unstable and was unable to learn a walking pattern. . . . . . . 80 Figura 36 – Movements learned by Marta robot over time in EXP-02 (pitch DOF in waist spherical joint). The robot was able to learn a few steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Figura 37 – Movements learned by Marta robot over time in EXP-03 (pitch and yaw DOFs in waist spherical joint). The robot was able to learn a few steps and has a more pronounced lateral displacement. . . 82 Figura 38 – Movements learned by Marta robot over time in EXP-04 (pitch and roll DOFs in waist spherical joint). The robot was able to learn a few steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Figura 39 – Movements learned by Marta robot over time in EXP-05 (pitch, yaw and roll DOFs in waist sphe- rical joint). Robot was able to learn a more structured walking pattern before falling. . . . . . . . . 84 Figura 40 – Movements learned by Marta robot over time in Final Experiment. Robot was able to learn a more structured walking pattern before falling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 List of Tables Tabela 1 – Ziegler-Nichols tuning rule based on step response. . . . . . . . . . . . . . . . . . . . . . . . . . 41 Tabela 2 – Ziegler-Nichols tuning rule based on frequency response. . . . . . . . . . . . . . . . . . . . . . . 42 Tabela 3 – Quasi-static walking works review. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Tabela 4 – Dynamic walking works review. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Tabela 5 – Specifications of the dimensions of Marta (CHENATTI et al., 2018). . . . . . . . . . . . . . . . . 66 Tabela 6 – Limits for each genome parameters of the GA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Tabela 7 – PID parameters selected for the experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 List of Abbreviations and Acronyms 3D-FPBIPM 3D force-pattern based inverted pendulum model AD Anno Domini AI Artificial Intelligence BC Before Christ CPU Central Processing Unity CPG Central Pattern Generator CoM Center of Mass DOF Degrees Of Freedom EA Evolutionary Algorithms GASI Automation and Integrated Systems Group GA Genetic Algorithm GUI Graphical User Interface IMU Inertial Measurement Unit IPM Inverted Pendulum Model LaRoCS Laboratory of Robotics and Cognitive Systems LIPM Linear Inverted Pendulum Model MDP Markov Decision Process ML Machine Learning NUC Next Unity of Computing OS Operational System PID Proportional-Integral-Derivative RL Reinforcement Learning TLIPM Triple Linear Inverted Pendulum Model TFS Truncated Fourier Series UNESP Universidade Estuadual Paulista UNICAMP Universidade Estadual de Campinas VLIPM Virtual Linear Inverted Pendulum Model V-REP Virtual Robotic Experimentation Platform ZMP Zero-Moment Point List of Symbols ∑w World coordinates system of the humanoid robot ∑i Local coordinates system of the j-th joint pi Vector position of the j-th joint in the ∑w p j i Vector position of i-th joint viewed from the local coordinates system of the j-th joint (∑ j) r Vector resultant of the pi and p j eic The c axis of the coordinate system of ∑i Rc(θ) Represent the θ of rotation around axis c Ri Rotation matrix of the i-th joint in the ∑w Rnm The position n, m in the rotation matrix Tj The homogeneous transformation matrix of the j-th joint in the ∑w T i j The homogeneous transformation matrix of the j-th joint viewed the local coordinates system of the i-th joint c∗ cos(∗) s∗ sin(∗) b j Vector between the ∑w and ∑ j qi Angle of rotation the i-th joint M Total mass of the robot C Robot’s center of mass mi The mass of the i-th joint pi Position of the i-th mass point P Momentum of the robot L Angular momentum of the robot Lr Angular momentum about the referent point r L̇ Rotational movement about the origin of the joint f External force applied to the robot that is not gravity fall All external forces applied to the robot τ External torque applied to the robot τall All external torque applied to the robot g Gravitation force f int i j Force applied to the mass of the i-th joint from j-th f ext i j Force external applied to the mass of the i-th joint from j-th vi Velocity of the i-th joint ω Angular velocity I Coefficient matrix lc Length on c axis wr Reference signal of the control system wd External disturbance of the control system wn Sensor noise of the control system e(∗) Error function Kp Proportional gain of the PID system Ki Factor integral of the PID system Kd Factor derivative of the PID system Ti Integral time of the PID system Td Derivative time of the PID system Lz Delay time of the Ziegler-Nichols based on step response Tz Time constant of the Ziegler-Nichols based on step response Ku Critical gain of the Ziegler-Nichols based on frequency response Tu Critical oscillatory period of the Ziegler-Nichols based on frequency response p(i) Probability of an i-th individual of the Genetic Algorithm f (i) Fitness value of an i-th individual of the Genetic Algorithm N(0,σ) Gaussian distribution πt Policy on the t time of the Reinforcement Learning π∗ Optimal policy of the Reinforcement Learning St State on the t time of the Reinforcement Learning At Action on the t time of the Reinforcement Learning Rt Reward on the t time of the Reinforcement Learning Vπ(∗) Value function of the Reinforcement Learning p(∗) Probability distribution of the Reinforcement Learning Gt Commutative reward in the t time of the Reinforcement Learning α Learning rate factor of the Reinforcement Learning γ Discount factor of the Reinforcement Learning Contents List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.1 OBJECTIVE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.1.1 MAIN OBJECTIVE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.1.2 SPECIFIC OBJECTIVES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.2 WORK ORGANIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2 FUNDAMENTALS OF HUMANOID ROBOTS . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1 HUMANOID ROBOTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.1 KINEMATICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.1.1.1 DIRECT KINEMATICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.1.1.2 INVERSE KINEMATICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.1.2 ZERO-MOMENT POINT (ZMP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.1.3 DYNAMICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.1.4 MOMENTUM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.2 INVERTED PENDULUM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3 MACHINE LEARNING CONTROL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1 CONTROL THEORY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1.1 FEEDBACK CONTROL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1.2 PID CONTROL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1.2.1 ZIEGLER-NICHOLS TUNING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2 MACHINE LEARNING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.1 OPTIMIZING WITH EVOLUTIONARY ALGORITHMS . . . . . . . . . . . . . . . . . . . . . 43 3.2.2 PRINCIPLES OF THE EVOLUTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.3 GENETIC ALGORITHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.3.1 GENETIC REPRESENTATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2.3.2 INITIAL POPULATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2.3.3 FITNESS FUNCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2.3.4 SELECTION AND REPRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2.3.5 GENETIC OPERATORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2.3.6 SCALING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.4 REINFORCEMENT LEARNING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.4.1 MARKOV DECISION PROCESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.4.2 AGENT’S OBJECTIVE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.4.3 POLICY AND VALUE FUNCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4 HUMANOID ROBOT WALKING APPROACHES . . . . . . . . . . . . . . . . . . . . . . . . 53 4.1 CONTROL ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.1.1 QUASI-STATIC WALKING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.1.2 DYNAMIC WALKING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.1.2.1 BASED ON MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.1.2.2 MODEL-FREE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5 PROPOSED APPROACH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.1 DECOUPLED CONTROLLER APPROACH . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.1.1 TFS CONTROLLER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.1.2 WAIST CONTROLLER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2 MODEL-FREE PARAMETERS LEARNING . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6 MATERIALS AND METHODS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.1 MARTA HUMANOID ROBOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.1.1 ACTUATORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.1.2 SENSORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.1.3 PROCESSOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.2 PROGRAMMING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.3 ROBOTICS SIMULATOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.3.1 COMMUNICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.4 EXPERIMENTAL PROCEDURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.4.1 GENETIC ALGORITHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.4.2 FITNESS FUNCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.4.3 REFERENCE POSITION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.4.4 DAMPING FACTOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.4.5 ANKLE CONTROL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7 RESULTS AND DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.1 PRELIMINARY TESTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.1.1 WAIST PID TUNING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.1.2 WAIST RL TUNING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.1.3 INFLUENCE OF THE FITNESS FUNCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7.1.4 INFLUENCE OF THE ANKLES CONTROL STRATEGY . . . . . . . . . . . . . . . . . . . . . 76 7.2 DECOUPLED CONTROLLER EVALUATION . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7.3 DECOUPLED CONTROLLER WITH RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 8 CONCLUSIONS AND FUTURE WORKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 1 INTRODUCTION "I’ve never expected a miracle. I will get things done myself." (Guts character) Berserk Humans have dreamed of creating skillful and intelligent machines, which could help them in tedious or dangerous activities for a long time. This dream began to come true at the end of the second industrial revolution, where machines were typically used in repetitive tasks that harmed human health, especially in automobile factories. These early machines were called manipulator’s arms, opening the way for exploration of the field of robotics. Over the past few decades, robotics has spread to many areas allowing machines to perform activities outside of industries in medicine, customer service, military, exploration, rescue, education, etc. Even considering the dissemination of robotics nowadays, interacting with environments and tools designed for humans still represents a significant challenge for robots. In this context, humanoid robots – robots that resemble a human appearance – occupy a prominent place due to their ability to adapt easier in these challenging scenarios. Since such robots must have the ability to move from one point to another as close as possible to humans, legged robots have attracted a particular interest of the researchers. This context justifies the interest in researching new controllers for humanoid robots that can expand their capabilities in the bipedal walking task. The control of the humanoid robot walking is usually divided into two approaches: i) the quasi-static walk and ii) the dynamic walk. The quasi-static walk focuses on a kinematic modeling of the robot that is typically based on the manipulator arms theory. The main problem with this approach is the reduction in walking speed since the robot must perform movements slow enough to depreciate the robot’s dynamics. On the other hand, dynamic walk modeling focuses on obtaining the stability of the robot with the dynamics of the environment. If the controller is based on the robot dynamic model, it typically focuses on the internal and external forces that act on the humanoid during the movement. This technique comes up against the complexity problem of modeling robot dynamics that is not trivial. There are two approaches in this dynamic walk; the based on model approach requires the robot dynamic model in the environment. On the other hand, model-free is not necessary the robot dynamic, since it empirically discovers the dynamic model. This last approach allows us to work with various types of humanoids with minor modifications to the controller. Recently, the working group – the Automation and Integrated Systems Group (GASI) at Unesp, and Laboratory of Robotics and Cognitive Systems (LaRoCS) at Unicamp – has designed and built-in physical and simulated environ- ments – a humanoid robot called Marta with twenty-five DOF designed for walking research, which presents at least three particular concepts when compared to most commercial standard robots: i) the small dimension of the feet, ii) the articulation of the toe, and iii) a 3-DOF spherical joint placed in robot waist that allows the design of decoupled con- trolling strategies. These constructive particularities bring new challenges and possibilities to the humanoid dynamic walk control for maintaining the robot’s balance during the walk. Taking advantage of the mechanical architecture of this robot, this work proposes a model-free controller for humanoid robot walking composed by i) a classical controller for the wast spherical joint that focuses on robot stability, and ii) a trajectory controller for the remaining joints based on a Truncated Fourier Series (TFS) with its parameters tuned by a Genetic Algorithm (GA) that focus on the implementation of a coordinated robot walking pattern. 1.1 OBJECTIVE 1.1.1 MAIN OBJECTIVE The main goal in this work is to investigate the possibility of controlling the gait of a bipedal robot – particularly the Marta humanoid robot – using a strategy based on the decoupling of the robot joints into two sets: i) waist and ii) remaining DOFs, with each set guided by a distinct controller. A first controller is designed for the 3-DOF spherical joint placed in the robot waist focusing on improving the humanoid’s stability during the walk. Simultaneously, a second model-free controller generates a set of equations for controlling the remaining joints’ trajectories based on the Truncated Fourier Series (TFS), with its parameters properly tuned through a machine learning process based on a Genetic Algorithm (GA). 1.1.2 SPECIFIC OBJECTIVES The specific objectives of this work are: • To implement the computational framework that can allow the work with the simulated version of the Marta humanoid robot; • To implement a classical controller on the waist of Marta robot and to evaluate the stability of the robot during walk; • To implement the optimization of the TFS parameters with a Genetic Algorithm for controlling the DOFs of the Marta robot; • To compare the walking pattern achieved with the implemented algorithm with others available in the literature. 1.2 WORK ORGANIZATION This work has been organized as follows: • Chapter 2 details the theoretical foundations of humanoid robots. The section focus on the mathematical models available in the literature to describe a humanoid robot; • Chapter 3 presents the machine learning control foundations. We explain control theory and Ziegler-Nichols tuning. Also, We detail machine learning in the approaches of the evolution algorithm theory and reinforcement learning; • Chapter 4 presents a review of the state of the art on biped walking controllers. Two types of controllers are examined: those based on models and those that are model-free; • Chapter 5 describes the proposed approach in more details; • Chapter 6 examines the materials that we propose to use in this work: the Marta humanoid robot and the simulator that will be used in the experiments and for the evaluation of the algorithm; • Chapter 7 presents experiments performed, results and discussions; • Chapter 8 presents the conclusions in this work and directs future works. 22 2 FUNDAMENTALS OF HUMANOID ROBOTS "Once you eliminate the impossible, whatever remains, no matter how improbable, must be the truth." (Sherlock Holmes) Arthur Conan Doyle 2.1 HUMANOID ROBOTS Humanoid robotics is the field of robotics that deals with robots that resemble the human body structure. The first mention of a physical mechanism that can be associated with the concept of robot goes back to Greek mythology with the legend of Talos, a giant bronze built by Zeus to protect the island of Crete from invaders (1500 BC) (SMITH, 1849). This concept is also present in the statues of the Egyptian oracle machines that symbolized a deity able to answer questions. It is similar to a chat-bot with a physical body (HRASTINSKI et al., 2019). The first automated mechanical device capable of measuring the time by regulating the flow of liquids from one vessel to another, known as the clepsydra, existed in Babylon, Egypt, and Persia (TURNER, 1984). This mechanical device was improved with a water and levers system by Hero of Alexandria. He had various mechanisms such as birds that drank water, statues that served wine, automatic doors, in the other (10 AD - 70 AD). The most relevant machine he made was "The Automated Theater"which represents the Trojan War through a mechanical puppet theater. The development of automaton continued. In Asia, the Karakuri is a wooden mechanism that can represent tra- ditional myths and legends (YOKOTA, 2009). In the eighteen century, Pierre Jaquet-Droz built three sophisticated automatons: the musician, the drawer, and the writer (VOSKUHL, 2007). In 1920 the writer Karel Capek created the term robot in his science fiction work "Rossum’s Universal Robots (R.U.R.)"(CAPEK, 2004). This term comes from the Czech word robota that means forced labor. Furthermore, Capek’s work is the starting point for the creation of stories with artificial human bodies. In the twenty century, the development of integrated circuits, digital computers, and miniaturized components al- lowed computer-controlled robots to develop. Robotics established itself as the field of science related to the design and production of robots. In 1948, the Argonne National Laboratory developed the first robot that manipulated radioactive elements (PAUL, 1984), inaugurating the branch of manipulator robots. These robots became essential components in the industry, especially in the automotive industry. More recently, robots have found new applications outside of the industry (BARRIENTOS et al., 1997), in areas such as cleaning, entertainment, surveillance, search and rescue, underwater, space, and medical applications For a long time, human beings have tried to replicate their natural experience in their environment by imitating these events with artificial mechanisms. Those experiences related to their own bodies are some of the most challenging, considering that the functioning of the human body and mind is still a mystery in many ways. Not surprisingly, this field has been an inspiration for artists, engineers, and scientists. In this perspective, we can understand humanoid robotics as the field concerned with the creation of machines that operate in the real world and exhibit the human form and behavior (SICILIANO; KHATIB, 2019). A robot with a body similar to the human body can most easily adapt to the human environment. In this way, we expect these robots can perform simple tasks like walking in a room without colliding with an object, for example. A task like these involves different challenges such as i) creating a bipedal mechanism with several of DOFs, ii) endow this mechanism with a considerable sensory capacity to make possible its interaction with the environment, iii) developing techniques to control the stability of the robot during the walk, and others. In modern robotics, the starting point to the project of such mechanisms is kinematics and dynamics. These concepts will be detailed in the following subsections. 2.1.1 KINEMATICS Humanoid robots interact and move in 3D space, and some strategy is necessary to model this interaction. A humanoid robot is usually modeled as a collection of rigid parts (links) and joints (the connection between two or more links). The kinematics is the theory usually adopted to analyze the relationship between the position and attitude of a link and the joint angles of a mechanism (KAJITA et al., 2014). Figure 1 presents Marta humanoid robot structure with its joints names (CHENATTI et al., 2018). NeckYaw LeftElbowPitch RightKneePitch LeftKneePitch RightFingersPitch LeftFingersPitch HeadPitch RightShoulderPitch RightShoulderRoll LeftShoulderPitch LeftShoulderRoll RightElbowPitch HipPitch HipYaw HipRoll RightLegYaw LeftLegYaw LeftPelvisPitch LeftPelvisRoll RightPelvisPitch RightPelvisRoll RightAnklePitch RightPelvisRoll LeftAnklePitch LeftAnkleRoll (26) HeadPitch (25) NeckYaw 22) LeftShoulderPitch(19) RightShoulderPitch 23) LeftShoulderRoll(20) RightShoulderRoll (24) LeftElbowPitch (12) LeftLegYaw(5) RightLegYaw (21) RightElbowPitch (2) HipRoll (3) HipPitch (4) HipYaw (6) RightPelvisRoll (7) RightPelvisPitch (13) LeftPelvisRoll (14) LeftPelvisPitch (8) RightKneePitch (9) RightAnklePitch (10) RightAnkleRoll (11) RightFingersPitch (15) LeftKneePitch (16) LeftAnklePitch (17) LeftAnkleRoll (18) LeftFingersPitch (1) MartaBody y z x (a) (b) Figure 1 - Marta humanoid robot: a) Links; b) Joints (texts) and links (lines connecting joints). The humanoid robot and its parts have a physical position in the world (absolute position), which can be defined by a fixed coordinate systems (x, y, z) in a world coordinate system (∑w) (CRAIG, 2006). A local coordinate system (∑a) can be attached to any moving part, for example, to the left shoulder (Figure 2). The absolute position of the shoulder is defined by vector pa, and vector r shows the position of the wrist in relation to the shoulder. It can be expressed in the following way: ph = pa + r. (1) The absolute wrist end position can be expressed by introducing another vector r′ which points the lifted wrist from 24 x y z eax eaz eax eay eaz eay ∑ a φ r pa ph ph r′ ∑ w Figure 2 - The world coordinates ∑w and local coordinate of arm ∑a. Vectors and coordinates system related to the initial (blue) and final (red) position after the rotation of the arm. Inspired by (KAJITA et al., 2014). the shoulder. The hand lifts by the vector’s rotation from r to r′ while the shoulder keeps the same position pa: ph = pa + r′. (2) Unlike the world coordinate system which is fixed to the ground, the local coordinate system ∑a moves together with the attached link. This coordinate system is defined by vectors eax, eay and eaz, which are parallel to ∑w in the initial position of the arm. When the arm rotates by an angle φ on the axis eax, the local coordinates also rotates. The new vectors can be obtained by the following equations: eax = 1 0 0  eay =  0 cos(φ) sin(φ)  eaz =  0 −sin(φ) cos(φ)  . (3) These vectors define the matrix Ra ([eax,eay,eaz]), which describes the relationship between r and r′: r′ = Ra r. (4) In the example shown in Figure 2. we can observe: • The position of the end of wrist ph viewed from ∑w. • The position of the end of wrist pa h viewed from ∑a, where the upper index of p indicate the local coordinate. The relationship between ph and pa h can be described from Equations (2) and (4) as: 25 ph = pa +Ra pa h (5)[ ph 1 ] = [ Ra pa 0 0 0 1 ][ pa h 1 ] . (6) The homogeneous transformation matrix is obtained from the rotation matrix and the translation matrix, so in this system Ta converts the points described in arm local coordinates to world coordinates:[ p 1 ] = Ta [ pa h 1 ] . (7) If we consider the rotation around the x, y and z axes, respectively called Roll (α), Pitch (β ) and Yaw (γ), the rotations will be given by (SPONG; HUTCHINSON; VIDYASAGAR, 2004): Rx(α) = 1 0 0 0 cos(α) −sin(α) 0 sin(α) cos(α)  (8) Ry(β ) =  cos(β ) 0 sin(β ) 0 1 0 −sin(β ) 0 cos(β )  (9) Rz(γ) = cos(γ) −sin(γ) 0 sin(γ) cos(γ) 0 0 0 1  . (10) Now, it is possible to introduce the Z-Y-X Euler angles notation, frequently used to model attitudes of ships, airplanes, and robots (BARRIENTOS et al., 1997). The Rotation Matrix will be given by: Rrpy(α,β ,γ) = Rz(γ) Ry(β ) Rx(α), (11) Rrpy(α,β ,γ) = cγ cβ −sγ cα + cγ sβ sα sγ sα + cγ sβ cα sγ cβ cγ cα + sγ sβ sα −cγ sα + sγ sβ sα −sβ cβ sα cβ cα  , (12) where: c∗ ≡ cos(∗), s∗ ≡ sin(∗), and so on. 2.1.1.1 DIRECT KINEMATICS Direct Kinematics focuses on finding the relative position and orientation of a joint from a reference coordinate system, where the values of each joint and the geometric parameters of the robot are known (PAUL, 1984). In a few words, a robot is a kinematic chain formed by rigid objects joined together by joints, therefore the direct kinematic problem is reduced to finding a homogeneous transformation matrix T associating the position and orientation of the joint solicited (CRAIG, 2006). The kinematic chain can have different representations, the most used is Denavit-Hartenberg (DH). DH defined i-th link by four transformations (BARRIENTOS et al., 1997): • θi: Rotation around the zi−1-axis; 26 • di: Translation along the zi−1-axis; • ai: Translation along the xi-xis; • αi: Rotation around the xi-axis. The relative position and orientation between the systems associated with two consecutive links of the robot are usually called matrix T i−1 i , so the matrix T 0 k is the result of the product of the matrices T i−1 i where i ∈ [1,k], which represents the total or partial kinematic chain that forms the robot. Illustrate the previous definitions, the matrix T 1 21 of the elbow of Marta’s left arm (Figure 3) is defined by: T 1 21 = T 1 19 T 19 20 T 20 21 (13) T 1 21 =  cθ19 −cα19 sθ19 sα19 sθ19 a19cθ19 sθ19 cα19 cθ19 −sα19cθ19 a19sθ19 0 sα19 cα19 d19 0 0 0 1  T 19 20 T 20 21 (14) T 1 21 = [ R1 21 p1 21 0 0 0 1 ] , (15) where: T 1 20 is the homogeneous transformation matrix of the shoulder pitch to the reference coordinate system ∑M , T 19 20 is the transformation matrix of the shoulder roll to the shoulder pitch coordinate system, and T 20 21 is the transformation matrix of the elbow pitch to the shoulder roll coordinate system; c∗ ≡ cos(∗), s∗ ≡ sin(∗); R1 21 and p1 21 is the relative orientation and position of the elbow pitch. (26)hdP (25)nckY (22)LshP(19)RshP (23)LshR(20)RshR (24)LlbwP (21)RlbwP (2)hpR (3)hpP (4)hpY (1)MartaBody ∑ M x y z p25, R25 p21, R21 ∑ W x y z z x y ∑ 21 Figure 3 - Direct kinematics model of the Marta humanoid robot arm. The reference coordinate system (∑M) in the pelvis joint (in red), and the local coordinate system (∑21) in the elbow joint (in green). 27 2.1.1.2 INVERSE KINEMATICS Inverse kinematics consists of finding the values that the robot’s joints q = [q1,q2, . . . ,qn] T must adopt so that one end of the robot is positioned p∗ and oriented R∗ according to a determined spatial location. In the previous section, direct kinematics was systematically procedure as it is independent of the robot configuration. However, the inverse kinematics obtaining of the equations are strongly dependent on the robot configuration (BARRIENTOS et al., 1997). Generic procedures have been developed based on the knowledge of the robot kinematics and numerical methods. (KAJITA et al., 2014) exemplifies this method by focusing on the robot’s leg, so we focus it on the Marta robot shown in Figure 4. The position and orientation of HipYaw (p4, R4) is known, and where we want to position and orientation RightAnkleRoll (p10, R10). The inverse kinematics following steps: First, the position of the RightLegYaw (p5) is given by: p5 = p4 +R4 0 D 0  , (16) where: D is the distance between HipYaw and the RightLegYaw. Next, we calculate the position of the crotch viewed from the RightAnkleRoll coordinate space. r = RT 7 (p5− p10) (17) r ≡ [rx ry rz] T . (18) From this we calculate the distance between the RightAnkleRoll and the HipYaw, which we will define as: C = √ r2 x + r2 y + r2 z . (19) In the Figure 4 b), if it used the cosine rule in the triangle ABC it get the angle of the knee (q8). q8 =−arccos ( A2 +B2−C2 2AB ) +π. (20) If we define the angle at the lower end f the triangle as α , from the sine rule we get, C sin(π−q8) = A sinα (21) α = sin−1 ( Asin(π−q8) C ) . (22) Now, it will focus on the RightAnkleRoll local coordinates, as shown in Figure 4, its possible calculate the ankle roll and pitch angles with vector r. So: q10 = atan2(ry,rx) (23) q9 = −atan2 ( rx,sign(rz) √ r2 y + r2 z ) −α. (24) And, we still need to compute is the Yaw, Roll and Pitch angles at the base of the leg. From the equations that define each joint. 28 (5)RlgY (10)RnklR (1)MartaBody z y x ∑ w a) Definition b) Knee Angle A B r π − q8 q8 α y x y z q9 q10 ry rx rz α c) Ankle Angle (11)RfgnP (9)RnklP (8)RknP (7)RplvP (6)RplvR (4)HpY (3)HpP (2)HpR r A B D p10, R10 p10, R10 p4, R4 Figure 4 - Calculating Inverse Kinematics of the right leg. R10 = R4Rz(q5)Rx(q6)Ry(q7)Ry(q8 +q9)Rx(q10) (25) Rz(q5)Rx(q6)Ry(q7) = RT 4 R10Rx(−q10)Ry(−q8−q9). (26) When solving the previous expression, we have the following equation:c5c7− s5s6s7 −s5c6 c5s7 + s5s6c7 s5c7 + c5s6s7 c5c6 s5s7− c5s6c7 −c6s7 s6 c6c7 = R11 R12 R13 R21 R22 R23 R31 R32 R33  . (27) Finally, using the left hand side of this equation, we obtain: q5 = atan2(−R12,R22) (28) q6 = atan2(R32,−R12s5 +R22c5) (29) q7 = atan2(−R31,R33). (30) 2.1.2 ZERO-MOMENT POINT (ZMP) Most industrial robots are fixed to the ground, so their workspace is limited to the range that their joints can reach. However, humanoid robots have to move while maintaining an indispensable condition is to keep their feet in contact with the ground. ZMP is usually used to accomplish this purpose. 29 Zero-Moment Point (ZMP) is a concept introduced by Miomir Vukobratovic in 1968, which denotes a point that the dynamic reaction force produced by contact of the foot with the ground does not produce any moment in the horizontal direction (VUKOBRATOVIC et al., 2012). In other words, ZMP is a point where the sum of all the horizontal forces, the inertia, and the gravity is equal to zero. In 1972, Vukobratovic and Stepaneko wrote the first article on the control of bipedal robots using ZMP (VUKOBRATOVIC; BOROVAC, 2004). The Figure 5 shows the distribution of force across the foot. The charge has the same sign over the entire surface, so it can be reduced to a single resultant force R, so ZMP is the point where the force R acts. ZMP R a) Definition x1 x2 ξ ρ(ξ) b) Vertical Force c) Horizontal Force x1 x2 ξ σ(ξ) z x Figure 5 - The definition, vertical force and horizontal force of ZMP (KAJITA et al., 2014). Another important concept related to ZMP is the support polygon. In Figure 6, the support polygon is the area formed by all the contact points between the robot and the ground. It is relevant to examine Vukovratovic’s definition for ZMP: "the point that exists within the limit of the foot". The relationship between Center of Mass (CoM), ZMP, and the support polygon are illustrated in Figure 6. Figure 6a shows a support polygon formed by two feet – are in contact with the ground – and the projection of the CoM on the field coincides with ZMP, which means that the robot is in equilibrium. On the other hand, Figure 6b shows a single support polygon; and the CoM projection is outside the support polygon; consequently, its control is complicated. 2.1.3 DYNAMICS In the previous subsection, we observed the CoM, ZMP, and the support polygon concepts for the humanoid robot to keep in balance. Now, we analyze the relationship between the reaction force of the ground and the robot’s movement, so follow Kajita’s deduction (KAJITA et al., 2014). A humanoid robot has a structure similar to a human, so we are going to define the following ten physical parameters that allow the robot to balance and classified into four groups: • Mass: Total mass of the robot. M[Kg] • Center of Mass: Robot’s center of mass. C ≡ [x y z]T [m] • Momentum: Measure of a robot’s translational motion. P ≡ [Px Py Pz] T [Ns] • Angular Momentum: Measure of robot’s rotational motion about the origin. L ≡ [Lx Ly Lz] T [Nms] The dynamics are given by these physical laws, the principle is expressed by: 30 CoM ZMP Support Polygon a) A standing human b) A human in action Figure 6 - Relationship between CoM, ZMP and the support polygon (KAJITA et al., 2014). Ċ = P M (31) Ṗ = fall (32) L̇ = τall , (33) where: fall and τall denote the sum of all forces and torque applied to the robot from the outside. The Equation (31) concerning the translational motion gives the relationship between the velocity of the center of mass and the momentum. Meanwhile, Equation (32) for translational motion shows how the momentum changes due to external forces. And, the Equation (33) shows the angular momentum changes according to the sum of the moments. Analyzing the force applied from outside the robot, we have the gravitational force applied to each component of the robot, and its sum can be considered as the force of Mg applied to the center of mass of the robot C . On earth, the vector of gravitational acceleration is given by g = [0 0 −9.8]T [m/s2]. So the equation will be given by: fall = Mg+ f (34) Ṗ = Mg+ f , (35) where: f denotes the force other than gravity, for example, the ground reaction force. When a robot stands still, the momentum change is zero, as the gravitational force balances out with the ground reaction force. Free fall occurs when the ground reaction force disappears, the moment the robot increases rapidly downward due to gravity. We take into account the dynamics of the rotational motion that is given by the Equation (33). Furthermore, the moment generated by the gravitational force is given by: τg = C ×Mg. (36) 31 If τ is without the effect of gravity, the total moment applied to the robot is observed, given by: τall = C ×Mg+ τ. (37) The equation of the rotational movement about the origin can be expressed as follows: L̇ = C ×Mg+ τ. (38) The τ moment considers only the moment of reaction to the ground applied to the robot. For a robot to stop, the moment must balance with the gravitational force. If the ground reaction moment is not balanced, the angular momentum increases rapidly, causing the robot to fall. 2.1.4 MOMENTUM A humanoid robot is composed of N points of mass. So, mi is the mass of the i-th point (KAJITA et al., 2014): M = N ∑ i=1 mi (39) C = N ∑ i=1 mipi/M (40) Ċ = N ∑ i=1 miṗi/M (41) P = N ∑ i=1 miṗi. (42) Equation (40) represents the position of the robot’s center of mass, where pi is the position of the i-th mass point. Equation (42) shows the total momentum of the robot where miṗi is the momentum of the i-th mass point. So, by combining Equations (41) and (42): Ċ = P/M. (43) The dynamics of the robot’s momentum is given by the equation of motion of the mass at the i-th point: mip̈i = N ∑ i=1 f int i j + f ext i , (44) where: f int i j and f ext i denote the force applied to the i-th point mass from the j-th one and the force applied to the i-th point mass from the outside the robot. However, due to the law of action and reaction obtain: f int i j =− f int ji (i 6= j). This equation together to Equation (44) allow us to conclude that the robot’s momentum depend only on the external forces. N ∑ i=1 mi p̈i = N ∑ i=1 f ext i (45) Ṗ = fall . (46) 32 On the other side, the angular momentum of the i-th point mass about the origin is defined by: Li = pi×Pi. (47) In a robot we also have rigid bodies that are connected by joints. Therefore, it can be taken into account that it is a body that floats in space and rotates without being affected by external force, so the rotational speed is expressed by the angular velocity vector (ω). Assuming that the origin of the reference coordinate system coincides with the center of mass. Therefore, the velocity at a point of the rigid body is expressed by: υi = υ(pi) = ω×pi. (48) In addition, the total angular momentum of a rigid body about the reference point r can be calculated by: L r = ∑ i pi× (miω×pi) (49) L r = NN ∑ i=1 mipi× (−pi×ω) (50) L r = ( NN ∑ i=1 mip̂ip̂ T i )ω, (51) where: NN is the total of points mass about the reference r. The angular momentum seen above is expressed by the angular velocity vector multiplied by a coefficient matrix (I), called the inertia tensor. L = (∑ i=1 mip̂ip̂ T i )ω (52) L = Iω (53) I ≡ ∑ i=1 mip̂ip̂ T i , (54) where: L is the total angular momentum of a rigid body. We do not need to calculate the inertia tensor of any objects with uniform density since the inertia tensor with typical shape. For example, for a cuboid which length of each edge is lx, ly and lz and mass is m: I =  m 12 (l 2 y + l2 z ) 0 0 0 m 12 (l 2 x + l2 z ) 0 0 0 m 12 (l 2 x + l2 y )  . (55) 2.2 INVERTED PENDULUM The inverted pendulum is one of the most popular systems adopted in the control field. The purpose of the control is to stabilize a naturally unstable open-loop system (ROBERGE, 1960). We should analyze this system in a two- dimensional world since the robot’s movement to the sagittal plane defined by the axis of the direction of walking and the vertical axis. For this reason, we consider that all of the robot’s mass is centralized in Center of Mass (CoM) and that the robot has massless legs that touch the ground at the individual rotary joints (Figure 7). The robot dynamics control is a challenging task, and the problem can be modeled based on the linear inverted pendulum. It is shown in Figure 7 that the fixed point is ZMP, and the horizontal component of the force f remains 33 f r M θ τ z x + f cos θ f sin θ f −Mg Figure 7 - Bi-dimensional inverted pendulum shows the simplest model for a robot adapted from (KAJITA et al., 2014). while the vertical component is canceled by gravity. The horizontal component accelerates the CoM horizontally is given by (KAJITA et al., 2014): Mẍ = f sinθ (56) Mẍ = Mg cosθ sinθ (57) Mẍ = Mg tanθ (58) Mẍ = Mg x z , (59) where: x and z give the CoM of the inverted pendulum. By rewriting the above equation, it is obtain a differential equation for the horizontal dynamics of the CoM: xZMP = x− g z ẍ. (60) Assuming that our inverted linear pendulum model exists the constant z, we can solve the ordinary differential equation: x(t) = x(0)cosh(t/Tc)+Tcẋ(0)sinh(t/Tc) (61) ẋ(t) = x(0)/Tc sinh(t/Tc)+ ẋ(0)cosh(t/Tc) (62) Tc ≡ √ z/g, (63) where: Tc is the time constant depending on the height of the CoM and the acceleration of gravity. The initial position and velocity are given by x(0) and ẋ(0), which together are called initial conditions. 34 In this chapter, we emphasize the kinematic and dynamic concepts of humanoid robots. So, we have the theoretical foundations to understand the complexity of the controllers. In the next chapter, we will focus on the concepts and algorithms of controllers for humanoid robots. 35 3 MACHINE LEARNING CONTROL "Keep your love of nature, for that is the true way to understand art more and more" Vincent Van Gogh Control can be defined as ”to act, to implement decisions that guarantee that device behaves as desired" (ANDREI, 2006). This concept is one of the fundamental concepts of robotics and a possible solution to slavery. For example, we can see it in the book "Politics"by Aristotle (384-322 BC) (BENNETT, 1979), where wrote: ”... if every instrument could accomplish its own work, obeying or anticipating the will of others ... if the shuttle weaved and the pick touched the lyre without a hand to guide them, chief workmen would not need servants, nor masters slaves". The text shows us the idea of having a slave machine that works automatically, a description similar to the origin of the word robot presented in Capek’s work (CAPEK, 2004). Therefore, the concept of control an object is inherent in human history. Then, we will detail the relevant investigations of control theory. The first significant work was on James Watt’s centrifugal governor controller in the seventh century. It is a device that automatically controls the speed of steam engines; however, it has an oscillatory behavior as the machines got bigger and more powerful (LEIGH, 2012). In 1868, James Clerk Maxwell developed a formal analysis of the dynamics of the centrifugal governor. He described and analyzed self-oscillation, that is, delays in the system lead to overcompensation and unstable behavior (OGATA, 2010). In 1895, Hurwitz and his colleagues created the Routh- Hurwitz theorem that allows the verification of the stability of the hydroelectric power station in Davos, Switzerland (LEIGH, 2012). This theorem determines whether a linear dynamical system is stable without solving the system. The works of Minorsky, Hazen, and Nyquist began the foundations of the control theory. Minorsky (1922) worked on automatic controllers to pilot ships and demonstrated this stability with differential equations. In 1932, Nyquist developed a simple procedure for determining the closed-loop system stability based on the open-loop response of steady-state sinusoidal inputs. Hazen (1934) introduced the term servomechanisms for position control systems (OGATA, 2010). The investigators of the Servomechanisms Laboratory group observed the difficulty analyze high-order servome- chanisms with the differential equations approach, thus investigated the frequency response methods (BENNETT, 1993). This method allowed the design of a closed-loop linear control system. In the 1940s, Ziegler and Nichols crea- ted rules for tuning PID control used to control industrial systems (OGATA, 2010). From 1948 to 1950, Evans worked on his proposal of the root-locus method that allows a rapid system evaluation through the closed-loop characteristics that vary with changes in gain (BENNETT, 1993). The frequency response methods and the root-locus method are the core of classical control theory, providing a stable system that satisfies a set of performance requirements. In the 1960s, digital computers made available time domain analysis in complex systems, modern control theory, based on time-domain analysis and synthesis using state variables. This theory solved the increasing complexity of modern plants and the stringent requirements of precision, weight, and cost (OGATA, 2010). Modern control theory simplifies the design of the control system based on the physical system model. However, the system stability is sensitive to the error between the model and the system. Therefore, the system may not be stable, which can be avoided by setting an error range and keeping the control system within that range. This method is called robust control (OGATA, 2010). Robust control designs the system model with a priori uncertainty and guarantees stability; however, the uncer- tainty doesn’t update during the system operation (SKOGESTAD; POSTLETHWAITE, 2007). The field of machine learning is the study of computer algorithms that improve automatically through experience (MITCHELL, 1997). Con- sequently, the machine learning control originates that allows merging these fields. This control focuses on the design control systems as regression problems (BÄCK; SCHWEFEL, 1993), identify and optimizes control parameters (LEE et al., 1997; ALFARO-CID et al., 2008), and determines the control law with continuous updating from the system performance changes (BARTO, 1994). We have briefly detailed the relevant works on control theory and its evolution. Therefore, we will explain in the following section: control theory and machine learning. Control theory will expound concepts, and PID control. On the other hand, machine learning will explain evolutionary algorithms and reinforcement learning. 3.1 CONTROL THEORY Control theory defines control rules to achieve a goal in a certain period; their concepts are simple and independent of the application. Consequently, control theory is applied from everyday to complex tasks (LEIGH, 2012). For example, controlling a robot to move involves calculating the robot’s position, analyzing possible collisions, estimating movement’s trajectory, and executing the movement. To understand the control systems, we must know the following terminologies (OGATA, 2010): • Plant: physical object to perform a particular operation, such as a mechanical device, an industrial machine, or a robot. • Process: Method to do or do something. In this case, it will be the objective operation of the control. • System: Combination of components that act in conjunction and meet an objective. A system can be a physical or abstract phenomenon. • Disturbance: A signal that negatively affects the output of the system. • Feedback control: A process of creating a feedback loop to modify the behavior of a system through action that determined by system measurements. 3.1.1 FEEDBACK CONTROL Figure 8 illustrates a feedback control system. The blue box and the yellow box represented the plant and the control law, respectively. The control law input is the sensor measurement (y), and the actuators (u) modified the plant behavior. The plant is subject to disturbances (w) and allows optimizing a cost function (J) (DURIEZ; BRUNTON; NOACK, 2017). There are two types of control loops: open-loop control and closed-loop control (feedback). Physical System Control Law Disturbance Cost SensorActuators u y Jw Figure 8 - General framework for feedback control (DURIEZ; BRUNTON; NOACK, 2017). 38 Figure 9 illustrates the open-loop control architecture. For this strategy, actuation signal b is chosen based on knowledge of the system to produce the desired output that matches the ordered reference signal (wr) (DURIEZ; BRUNTON; NOACK, 2017). However, the external disturbances (wd) and sensor noise (wn) degrade the control- ler’s performance. Time-dependent systems use this architecture. For example, the washing machine controller is determined by periods of soaking, washing, and rinsing. Physical System Control Law Actuators u Reference Signal wr External Disturbance Sensor Noise Sensor y wd wn + + Figure 9 - Open-loop control architecture (DURIEZ; BRUNTON; NOACK, 2017). The advantages of open-loop control systems are simple construction, easy to maintain, and a feasibly economical system for outputs that are difficult to measure accurately. However, disturbances and calibration changes cause errors in the system output (OGATA, 2010). On the other hand, the closed-loop system involves sensor feedback, which allows the stabilization of an unstable system by reducing the actuation error (OGATA, 2010). The error e is the difference between the sensor signal y (feedback) and the reference signal wr (Figure 10). Also, closed-loop control can compensate for external disturbances wd and attenuate noise wn (DURIEZ; BRUNTON; NOACK, 2017). Physical System Control Law Actuators u Reference Signal wr External Disturbance Sensor Noise Sensor y wd wn + +Error+ − e Feedback Signal Figure 10 - Closed-loop control architecture (DURIEZ; BRUNTON; NOACK, 2017). 3.1.2 PID CONTROL The Proportional-Integral-Derivative controller is the most widely used. This controller solved approximately 90 to 95% of control problems. The operation of this controller relies on the feedback signal to eliminate existing drifts (LEVINE, 1996). PID control is generally the sum of three terms (Figure 11). Therefore, this control law is described as: y(t) = uP(t)+uI(t)+uD(t), (64) where uP is the proportional part, uI the integral part, and uD the derivative part. The proportional part is simple feedback (Equation (65)). This part is proportional to its input, the error signal as a function of time (e(t)). The error is the difference between the reference signal (desired value) and the feedback signal (process variable). Furthermore, there is a factor Kp (proportional gain) that allows reversing the error calculation. This 39 Physical System Actuators u(t) Reference Signal wr(t) Sensor y(t) Error+ − e(t) Feedback Signal uP (t) uI(t) uD(t) ∑ Control Law Figure 11 - PID control scheme (DURIEZ; BRUNTON; NOACK, 2017). action does not modify the shape of the input signal. It only introduces a scaling factor (amplifies or reduces) the value applied at the plant input (PINTO, 2014). uP(t) = Kpe(t), (65) Proportional control has a steady-state error. A manually adjustable reset term can be added to the control signal to eliminate the steady-state error (LEVINE, 1996). The proportional controller by Equation (65) becomes: uP(t) = Kpe(t)+ub(t), (66) where ub is the reset term. The integral part is an automatic setting of the reset term. It filters out the low-frequency of the error signal and adds it to the proportional part. This part is given by: uI(t) = Ki t∫ 0 e(t)dt. (67) uI(t) = Kp Ti t∫ 0 e(t)dt, (68) where Ki is the integral factor, and Ti is the integral time. The derivative part provides anticipated action, given by: uD(t) = Kd de(t) dt . (69) uD(t) = KpTd de(t) dt , (70) where Kd is the derivative factor, and Td is the derivative time. The parameters of the controller are selecting with controller tuning. That, in the next subsection, we will explain the Ziegler and Nichols method. 40 3.1.2.1 ZIEGLER-NICHOLS TUNING Ziegler and Nichols propose rules to determine the proportional gain (Kp), integral time (Ti), and derivative time (Td). This method is based on the characteristics of the plant dynamics. There are two methods for performing the Ziegler- Nichols adjustment: step response method and frequency response method (OGATA, 2010). The first method determines Kp, Ti, Td according to Table 1. Figure 12 shows the input of a unit-step signal that produces the output of an S-shaped curve. This curve is determined by two constant: the delay time Lz and the time constant Tz (OGATA, 2010). Table 1 - Ziegler-Nichols tuning rule based on step response. Type of Controller Kp Ti Td P Tz Lz ∞ 0 PI 0.9 Tz Lz Lz 0.3 0 PID 0.9 Tz Lz 2 Lz 0.5 Lz Physical System y(t) K 0 t Lz Tz Tagent line at inflection point u(t) I t u(t) y(t) Figure 12 - Unit-step response of a physical system (OGATA, 2010) The second method is obtained by the proportional gain that is incremental until obtaining an oscillatory response with constant amplitude. At this point, the critical gain (Ku) and the critical oscillatory period (Tu) are determined. Figure 13 shows an example of the process response. The Ku and Tu values are used in Table 2 to obtain the PID settings. 41 y(t) Tu 0 t Figure 13 - Examples of the frequency response (PINTO, 2014) Table 2 - Ziegler-Nichols tuning rule based on frequency response. Type of Controller Kp Ti Td P 0.5 Ku ∞ 0 PI 0.45 Ku 1 1.2 Tu 0 PID 0.6 Ku 0.5 Tu 0.125 Tu 3.2 MACHINE LEARNING The field of Artificial Intelligence (AI) emerged at a workshop at Dartmouth College in 1956, making use of concepts from many other science and engineering disciplines. This field is typically concerned with problems like playing games, proving mathematical theorems, writing poetry, driving a car in the street, and diagnosing diseases. Histori- cally, robotics walks side-by-side with AI focusing on building robots that can perform complex tasks in dynamic, unpredictable, and unknown environments. One of the branches of AI that has experienced a significant increase in the last decades is Machine Learning (ML). This field studies algorithms and techniques for automating solutions to complex problems that are hard to pro- gram using conventional programming methods (REBALA; RAVI; CHURIWALA, 2019). ML is based on the premise that systems improve from experience, that is, a computer program is said to learn from experience E concerning some class of task T and performance measure P, if its performance at tasks in T , as measured by P, improves with experience E (MITCHELL, 1997). ML agglutinates different families of algorithms divided according to the input/output nature, data/model, or the algorithms’ theoretical nature. Considering the relationship between input and output, there are four types of learning: supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning. In supervised learning, the program receives a labeled data set, from which it has to learn the fundamental cha- 42 racteristics in each data set to create a general model so that the next time new data is provided, it can respond with a correct label. For example, the machine receives thousands of images classified as mammals, birds, fish, and reptiles. The machine learns the key characteristics that determine the set to which each image belongs. If an image of a new animal is presented to the machine, it should determine the set to which the animal belongs. On the other hand, unsupervised learning receives unlabeled data set, so this family of ML algorithms usually focuses on identifying similarity trends. For example, a toy store has customer information and needs to know the trend of customers buying teddy bears. So, the machine analyzed the information and determined that male customers between the ages of 18 to 25 are the ones who buy it. In reality, the machine determined a purchase pattern for all customers in the store, which determined this response. Semi-supervised learning based on between supervised and unsupervised learning. The machine receives a large data set, where only a part of the data is labeled. Typically, an algorithm of this family focus on the use of clustering techniques (unsupervised learning) to determine groups within the data set and then determine the corresponding label using the labeled data for each given group (REBALA; RAVI; CHURIWALA, 2019). The most obvious benefit is the less effort and time employed to label data. For example, a company has 1 billion images of good and bad parts, and it is specified that the machine determines which parts are bad. You would have to label some of these images and run the algorithm for the machine to learn. Reinforcement learning addresses how an agent learns to make decisions based on perception and execution in its environment. This process does not need a training set and is based on trial and error. For example, a robot needs to go in a straight line from one point to another in a room. The robot will start to move in this environment. If the robot falls, the environment will give it a penalty. Otherwise, it receives a reward. Another important family of ML algorithms is the one related to optimization. Optimization is the process of se- lecting the hypothesis of models so that an objective function – that describes the quality of the solution – is optimized. The optimal solution is the one in which the variables maximizes or minimizes the objective function. Evolutionary Algorithms (EA) are typical representatives of this class of algorithms. The next sections will describe in more detail two families of ML algorithms that are particularly relevant in this work: i) Evolutionary Algorithms and ii) Reinforcement Learning. 3.2.1 OPTIMIZING WITH EVOLUTIONARY ALGORITHMS In the last decades, there has been a growing interest in engineering problem-solving systems based on evolution principles since the natural evolution process allows biological systems to be sophisticated, robust, and adaptable (MITCHELL, 1998), (FLOREANO; MATTIUSSI, 2008). Holland and Rechenberg introduced this field of study. They created optimization algorithms that are based on a population of possible solutions, which through a process of selecting the fitness of individuals and some genetic operators allows obtaining an optimal solution (RECHENBERG, 1978), (HOLLAND et al., 1992). The basic principles of evolutionary systems are derived from the principles of life written by Darwin in 1859, which describes the following (ROTHLAUF, 2006): • There is a population of individuals with different properties and abilities. • Nature creates new individuals with properties similar to existing individuals. • Promising individuals were selected more frequently for reproduction by natural selection. 43 3.2.2 PRINCIPLES OF THE EVOLUTION Natural evolution theory is based on four pillars: population, diversity, inheritance, and selection. Two or more indi- viduals define the population; an individual is a way of life that is made up of cells formed for genomes. A genome is a complete set of chromosomes, which are the carriers of hereditary biological characteristics. Chromosomes are made up of genes that are responsible for both the structure and processes of organisms. An allele is any variation of a gene, which generally arises through the mutation. It is responsible for the heritable variation (NEAPOLITAN; JIANG, 2018). In summary, evolutionary systems define an individual through their genes, so there cannot be an individual gene- tically the same as another (diversity). Individuals can reproduce so that individual characteristics are passed on to the next offspring (inheritance). Darwinian theory shows that the individual with the greatest aptitude has the highest probability of reproducing due to natural selection. The new generations adapt to the changing environment, and the new behavioral characteristics, morphology, functionality, and natural niches allow for greater adaptation to changing environments. Furthermore, natural selection ensures that negative traits are eliminated in future generations, and only beneficial traits are propagated (FLOREANO; MATTIUSSI, 2008). It is observed that the basic principles of evolutionary theory are sufficiently explanatory and relatively simple to compact them in a model and apply them in several areas. Several formal models have been developed to address specific problems, mainly in the field of optimization (FLOREANO; MATTIUSSI, 2008). 3.2.3 GENETIC ALGORITHM In 1975, John Holland, in the book "Adaptation in Natural and Artificial Systems", presented the Genetic Algo- rithm (GA) as an abstraction of biological evolution. Holland relied on the formal study of the phenomenon of adap- tation in nature and then developed mechanisms of natural adaptation that could be imported into computer systems (MITCHELL, 1998). GA is a method with a population that can be represented by a string of bits with m individu- als. This population uses the ”natural selection” determined by an evaluation function that selects the most suitable individuals for reproduction that, together with operators inspired by genetics - mutation, combination, and inversion - produce individuals’ next generation. Figure 14 shows an example of the GA pipeline. The fundamental part of GA is the selection of individuals from a population for their reproduction. Also, each individual has an associated fitness, which determines how better an individual is. The fitness function f (∗) defined the aptitude of the individual. For example, we must obtain the parameters of each joint for a humanoid robot to remain standing. We have as input the position of the robot. Our fitness function is determined by the error between the initial and current positions of our robot. GA’s problem arises in choosing an appropriate fitness function. The selection of the individual that can reproduce is based on the fitness function, and there are several strategies for this selection. The first idea is focused on choosing those who have high aptitudes. However, there is a certain risk of loss of diversity and convergence to a sub-optimal solution in this strategy. Frequently, a proportional selection is applied where the probability (p(i)) of an individual i-th reproduce is given by the relationship between its fitness value ( f (i)) and the sum of the fitness values of all individuals in the population (∑ i f (i)) (MITCHELL, 1998), given by: p(i) = f (i) m ∑ i f (i) . (71) The expected number of individuals of the next-generation (m) is usually the same as the previous population. Biological mutations inspired the creation of a genetic operator, whereby have a wide variety of operators. The selection of operators depends on the individual’s structure. The crossover operator allows exchanging sub-parts of 44 1 0 0 1 0. . . 0 1 0 1 0. . . 1 1 0 0 0. . . 0 0 1 1 1. . . Ind1 Ind2 Indm−1 Indm 1 1 0 0 0. . . 1 0 0 1 0. . . 0 0 1 1 1. . . 0 1 0 0 0. . . 0 0 0 1 0. . . 0 1 1 0 1. . . f (∗) Indm−1 Ind2 Ind1 . . . Indm Selection Evaluation Genetic Operator New Generation Figure 14 - Main features of the Genetic Algorithm two individuals, that mimicking biological recombination between two organisms. The mutation operator randomly changes a part of an individual. Finally, the inversion reverses the order of the individual’s genes. 3.2.3.1 GENETIC REPRESENTATIONS A suitable genetic representation must be devised for the genetic operators to be used in the Genetic Algorithm, thus allowing a high probability of generating increasingly better individuals and covering the space of optimal solutions to the problem. There are various representations, so the most used will be detailed (ROTHLAUF, 2006). The discrete representation describes an individual using the sequence of l discrete values extracted from an alpha- bet with cardinality k. One of the most used is the binary alphabet 0,1 with cardinality k = 2. The binary representation can be interpreted directly. In most cases, a different transformation must be performed to be interpreted (FLOREANO; MATTIUSSI, 2008). Another representation is the floating-point representation that is based on a set of n numbers that belongs to the real numbers. This representation is suitable for solutions that require high precision parameter optimization. The solution space for the problem must be taken into account (MITCHELL, 1998). The tree representation is suitable for describing hierarchical structures with branching points and conditions, which is used to describe computer circuits, construction procedures, and planning of experiments or compilers. This representation is based on a mapping directly onto a tree (ROTHLAUF, 2006). 45 3.2.3.2 INITIAL POPULATION The initial population must be large and diverse enough. Population size depends on the search space’s properties and the computational cost of evaluating all individuals over several generations. The literature often finds populations ran- ging from a hundred to a few thousand individuals. If an individual’s evaluation requires real-world experimentation, as is the case with the evolution of a robot, the population size is usually less than one hundred individuals (MITCHELL, 1998). In the binary and real value representations, a random sequence is generated. However, the real value representation must be within the predefined range. Another way to initialize the population is to seed copy mutations in one or more alleles known to correspond to good or promising individuals. However, this strategy carries the risk that the population is not diverse enough and that evolution remains a sub-optimal solution (FLOREANO; MATTIUSSI, 2008). 3.2.3.3 FITNESS FUNCTION The fitness function is associated with a numerical score for each individual in the population and indirectly assesses the development of the learning process’s quality. There are two critical aspects of designing a fitness function: the choice and combination of fitness components. It is possible to optimize multiple objectives of a problem into a single fitness function. The optimize multiple objectives is based on the properties of the search space and find the relationship between the objectives (REEVES; ROWE, 2002). The evaluation of fitness is the slowest part of a genetic algorithm since the solution’s quality depends on how exhaustive the individual’s evaluation has been (RECHENBERG, 1978). The fitness subjective exists when observers rate individual performance through visual inspection. This fitness is often used in the artistic field, architectural structures, and music. It is difficult to formalize aesthetic qualities into objective aptitude functions. 3.2.3.4 SELECTION AND REPRODUCTION The selection should assign a more significant number of descendants to the best individuals in the population. The high selection pressure means that only a small percentage of individuals will be selected for reproduction. This approach results in a rapid increase in aptitude. However, it risks the diversity of the population. Therefore, a balance must be maintained between selection pressure and diversity generation factors (REEVES; ROWE, 2002). The proportional selection is defined by the probability of an individual making a copy of their genome based on the relationship between their fitness value and the sum of the entire population’s fitness values. The selection of individuals can be visualized as if it were a roulette wheel where the slot size depends on the probability of each individual. This selection method does not work well when all the individuals have almost similar fitness values or obtain a very high probability compared to the others. In the first case, all individuals will have an almost equal probability of having offspring, and evolution is equivalent to a random search. In the second case, almost all the offspring will be copies of the best-fit individual, whose genes will dominate the population and cause premature convergence (FLOREANO; MATTIUSSI, 2008). There are two solutions to these problems, the first is to scale the fitness values before selection to emphasize or reduce differences; this procedure requires additional parameters. On the other hand, it is possible to classify all the individuals from best to worst and to assign reproductive probabilities proportional to the individual’s rank; therefore, no matter how small the difference is between two individuals, the best of the two will always have a greater chance of having offspring (MITCHELL, 1998). Another type of selection is the truncated selection based on ranges. This selection takes the first n individuals from the list classified by the fitness function and copies these individuals m times until completing the population. It should 46 be considered that n cannot be small since it would cause premature convergence. This method guarantees it is used in populations that do not need an exhaustive evaluation (FLOREANO; MATTIUSSI, 2008). Selection by tournament consists of organizing a tournament among a small subset of individuals to generate each offspring. k individuals are randomly selected from a population, where k is known as the tournament size. The individual with the best aptitude of the k individuals generates an offspring. The individuals that were not chosen return to the population to participate again in other tournaments. Tournament selection allows selection pressure and genetic diversity to be maintained (MITCHELL, 1997). The generational change is based on the fact that the offspring generated replaces the entire previous population. However, in a complex search space, the fitness assessment can be very noisy or that genetic mutations strongly affect the individual, such that a better individual is lost. For this case, there is the strategy of elitism, which consists of keeping the n best individuals from the previous population. However, it is possible to relax the entire generational replacement by inserting only a few descendants into the population, which is called gradual generational change (FLOREANO; MATTIUSSI, 2008). 3.2.3.5 GENETIC OPERATORS Genetic operators capture the effects of biological mutations. This subsection will describe the most commonly used operators in the genetic representations seen above. The main task of genetic operators is to introduce diversity into the population and the exploration of solutions. The crossover operator is the most used, which allows the offspring to inherit the parents’ characteristics through the recombination of the genomes of two selected individuals. The newly created offspring are randomly paired for gene crossing. This operator is based on the benefit of the synergistic effect of the combination of the solutions found by the parents. However, genetic recombination is equivalent to a large random mutation and has a deleterious effect on the individual’s fitness (MITCHELL, 1998). One-point crossover can be applied to discrete, real-value representations. It consists of randomly selecting a crossing point in each of the two chains and exchanging genetic material between individuals around this point. Multi- point crossing consists of randomly selecting n crossed points in the two chains and exchanging genetic material between these points (REEVES; ROWE, 2002). For real value representations, smooth and arithmetic crossovers are options. The uniform crossover consists of exchanging the genetic content in n positions chosen at random. Instead, arithmetic crossing creates a unique gene by taking the average of n randomly chosen positions from the two gene chains. For tree-based representations, the crossover randomly selects one node in each parent and changes the two corresponding sub-trees (FLOREANO; MATTIUSSI, 2008). On the other hand, mutations are small random modifications that allow explorations of variations of existing solutions. Mutations are useful for escaping from local minimal and making further progress in the highly convective population where genetic recombination has little effect. However, the number and size of mutations must be relatively low to avoid the loss of previously discovered solutions (BACK, 1996). Mutation in binary representations consists of alternating selected bit values. In the real value representation, the individual selected a position and added a random value based on the Gaussian distribution N(0,σ), where 0 is the mean and σ is the variance. In binary representations, the mutation consists of alternating the selected bit values. In real-value representations, a selected position is modified by adding a random value drawn from a Gaussian distribution N(0,σ), where 0 is the mean and σ is the variance, to produce few large mutations. In tree-based representations, a selection node swaps its 47 content with another from the same subset. If the selected node is a terminal, it will be replaced by another element chosen at random from the terminal sets (FLOREANO; MATTIUSSI, 2008). 3.2.3.6 SCALING Regulation of which individuals reproduce is critical in GA with small populations. At first, we may have some extraordinary individuals in a population with multiple mediocre individuals. The underlying problem is that the early strongest candidates may be in or around a local peak fitness space that can be difficult to get out of. In contrast, the weakest candidates could be closer to the global maximum. The aggressive population pruning compromises genetic diversity, reducing search space and search speed. Therefore, we may want to scale fitness so that selection pressure remains the same throughout the run (GOLDBERG, 1989). Linear scaling is a simple procedure where a linear relationship is defined between the raw fitness f and the scaled fitness f ′ ( f ′ = a f + b). The coefficients a and b can be chosen in several ways; however, in all cases, it is desired that the scaled average fitness f ′avg be equal to the average raw fitness favg to ensure that each member of the average population contributes an expected offspring to the next generation. Linear scaling misbehaves with negative aptitudes. Sigma truncation scaling can overcome the presence of poor individuals. This method uses information from the population variance to pre-process the objective function’s values ( f ) before scaling. Sigma truncation scaling is given by: f ′ = f − ( favg− c∗ fdev), (72) where favg is the mean of the objective functions in the population. The constant c is chosen as a multiple of the population standard deviation ( fdev), and the negative results are arbitrarily set to zero. 3.2.4 REINFORCEMENT LEARNING In Reinforcement Learning (RL), the agent learns by interacting with the environment to maximize its gain. Imagine a pet learning to bark for food. If we feed our pet every time it barks (action), the pet is likely to repeat this action to get a response from us. The reward determines the agent’s behavior. For this example, our pet (agent) learns an action to achieve a goal without any previous teaching. Therefore, Reinforcement Learning resided on the agent behavior who seeks to act positively in its environment to obtain a reward from it. Formally, the Reinforcement Learning (RL) agent interacts with its environment, producing a sequence of actions a1,a2, . . . , over time. These actions affect the agent’s environment, resulting in a reward or punishment rt at each time step t. The algorithm’s goal is to learn to act in a way that can maximize some measure of utility in the future (NEAPOLITAN; JIANG, 2018). Reinforcement Learning comprises the following elements (SUTTON; BARTO, 2018): • Policy (πt ): defines the agent’s behavior in time t. Policy maps the perceived states of the environment to the actions it must take in that state. • Reward (Rt ): define the objective of an RL problem, that is, which events are good and bad for the agent. The agent’s goal is to maximize or minimize the total reward he receives in the long run. The reward is unalterable but can be used as a basis for modifying policies. • Value Function (Vπ(∗)): determines what is better in the long term, although the reward indicates what is better in the immediate sense. Generally, the total amount of reward that an agent can expect to accumulate in the future, starting from that state. 48 • Environment model: allows it to make inferences about how the environment will behave. The model is used to plan for future situations before they happen. This element may not exist, as there are free-model methods that learn from trial and error. One of RL’s challenges is the trade-off between exploration and exploitation. To obtain a greater reward, an agent must prefer the actions that it has tried in the past, resulting in reward (exploit). However, it falls into an over- exploitation problem, where it only focuses on a particular group of actions falling to a local solution. On the other hand, if the algorithm explores too much, it may never really learn the problem’s desired solution. This dilemma is called exploration-exploitation, which has been studied intensively. These two paradigms must be balanced, which is very difficult to carry out (SUTTON; BARTO, 2018). Besides, we have a strand of RL that has helped in the last decades increasing investigations in this area, which are the deep reinforcement learning algorithms. These methods employ deep neural networks and large-scale computing power to improve the capacity to store the experiences of Reinforcement Learning (RL) agents. This new capability has allowed solving problems in various fields, such as board games or video games. Another field that is strongly using this aspect is robotics so that robots learn complex tasks for dynamic environments (REBALA; RAVI; CHURIWALA, 2019). The difference between the Evolutionary Algorithms and Reinforcement Learning relies on the fact that RL algo- rithms can remember and use what it has learned over time. In contrast, evolutionary systems make decisions in each generation based on information from the current generation. 3.2.4.1 MARKOV DECISION PROCESS Markov Decision Process (MDP)s is a mathematical model that formalizes sequential decision-making, where actions influence not only the immediate reward but also subsequent situations or states, and through future rewards. Thus, MDPs involve delayed reward and the need to compensate for immediate and delayed reward (SZEPESVÁRI, 2010). Figure 15 shows a robot responsible for making decisions, called agent. The thing it interacts with, comprising everything outside the agent, is called the environment. The agent continuously selects actions and the environment responds to these actions and presents new situations to the agent. The environment also generates rewards, which are numerical values that the agent aims to maximize over time through its actions’ choice. This process is mathematically defined as a Markov Decision Process (MDP). Specifically, the agent and the environment interact in each sequence of discrete time steps, t = 0,1,2,3, . . . . At each time step t, the agent receives the representation of the state of the environment, St ∈ S, and the agent selects an action, At ∈ A(s). In the next step, in part as a consequence of the action, the agent receives a reward Rt+1, and finds a new state St+1. So the sequence would be as S0,A0,R1,S1,A1,R2,S2,A2, . . . (SUTTON; BARTO, 2018) (DONG et al., 2020). A finite MDP occurs when the state, action, and reward sets have a finite number of elements. For this case, the variables Rt and St have a defined discrete distribution that depends on the state and the previous action. Therefore, given s′ ∈ S and r ∈ R, there is a probability that these values occur at time t, given the particular values of the preceding state and action, given by: p(s′,r | s,a) . = Pr{St = s′,Rt = r | St−1 = s,At−1 = a}. (73) ∑ s′∈S ∑ r∈R p(s′,r | s,a) = 1, for all s ∈ S, a ∈ A(s), (74) where s′,s ∈ S, r ∈ R, and a ∈ A(s). The function p defines the MDP dynamics. It specifies a probability distribution for each choice of s and a (Eq. 74). 49 Agent Environment state st st+1 rt+1 reward rt action at Figure 15 - The agent-environment interaction in Reinforcement Learning (SUTTON; BARTO, 2018) From p, we can derive the state transition probability, denoted by: p(s′ | s,a) . = Pr{St = s′ | St−1 = s,At−1 = a}. (75) p(s′ | s,a) = ∑ r∈R p(s′,r | s,a). (76) In addition, the expected reward of the state-action pair can also be calculated as: r(s,a) . = E{Rt | St−1 = s,At−1 = a}. (77) r(s,a) = ∑ r∈R r ∑ s′∈S p(s′,r | s,a). (78) An MDP is an abstraction of the problem of goal-directed learning from interaction. MDP is an independent scheme of the goal, the sensory, the memory, and control details. In conclusion, the RL problems can be reduced to three signals between the agent and his environment: the signal to represent the agent’s action, the signal to represent the agent decision (states), and the signal to define the agent’s objective (reward). (SUTTON; BARTO, 2018). 3.2.4.2 AGENT’S OBJECTIVE The agent’s objective is to maximize the cumulative reward it receives in the long term. The sequence of rewards received over time t is denoted Rt+1,Rt+2,Rt+3, . . . . The agent can only maximize the future cumulative reward for the rest of the episode since it cannot go back in time to change past rewards. Gt will be defined as the cumulative reward that the agent receives (REBALA; RAVI; CHURIWALA, 2019): Gt . = Rt+1 +Rt+2 +Rt+3 + · · ·+RT , (79) where T is a final time step. This approach makes sense for applications where the agent-environment interaction naturally breaks into sub-sequences (episodes), for example, board games where you have an ending that can be winning or losing the game, and again return to the initial state of the board. 50 On the other hand, other agent-environment interactions are not divided into identifiable episodes but continue continuously without limit. In this case, the maximization of the reward tends to infinity easily. The concept that is added is the discount, in which the agent tries to select actions to maximize the sum of the discount rewards that he receives in the future, such that (SUTTON; BARTO, 2018): Gt . = Rt+1 + γ Rt+2 + γ 2 Rt+3 + γ 3 Rt+4 + . . . , (80) Gt . = Rt+1 + γ (Rt+2 + γ Rt+3 + γ 2 Rt+4 + . . .), (81) Gt . = Rt+1 + γ Gt+1, (82) where γ is a parameter, 0 ≤ γ ≤ 1, called the discount rate. If γ is less than 1, its value decreases exponentially, therefore, future rewards contribute less percentage of their numerical value in the cumulative rewards, and this avoids the problem of infinite rewards. 3.2.4.3 POLICY AND VALUE FUNCTIONS Reinforcement Learning algorithms need to estimate the value function, that is, state functions that estimate "how good" it is for the agent to be in a state and "how good" it is to perform an action given that state. The notion of "how good" is defined in terms of future rewards that can be expected (expected performance). Value functions are defined concerning a particular way of acting, called policies (SEWAK, 2019). A policy is a mapping of states to selection probabilities of possible actions. If the agent follows the policy π at time t, then π(a | s) is the probability that At = a if St = s. Reinforcement learning methods specify how agent policy is changed as a result of their experience. For any π policy and any s state, the following consistency condition between the value of s and the value of its possible successor states (MITCHELL, 1997) (SUTTON; BARTO, 2018): Vπ(s) . = Eπ [Gt |St = s]. (83) = Eπ [Rt+1 + γ Gt+1 |St = s]. (84) = ∑ a π(a | s)∑ s′ ∑ r p(s′,r | s,a)[r+ γ Eπ [Gt+1 |St+1 = s′] ]. (85) = ∑ a π(a | s)∑ s′,r p(s′,r | s,a)[r+ γ Vπ(s′)], for all s ∈ S (86) where it is implied that the action, a ∈ A(s), that the next state, s′ ∈ S, and that the reward, r ∈ R. Solving a Reinforcement Learning task means finding a policy that achieves many rewards. Infinite MDP, an optimal policy can be precisely defined, which the value function defines a partial order on the policy. The π policy is defined as better than or equal to a π ′ policy if its expected return is greater than or equal to π ′ for all states. We denote all optimal policies by pi∗, where they share the same state value function, called the optimal state value function (v∗), defined as (SUTTON; BARTO, 2018): V∗(s) . = max π Vπ(s), for all s ∈ S (87) Optimal policies also share the same optimal action-value function, denoted as q∗, and defined by: q∗(s,a) . = max π qπ(s,a), for all s ∈ S,a ∈ A(s) (88) q∗(s,a) = E[Rt+1 + γ v∗(St+1) |St = s,At = a] (89) 51 The different Reinforcement Learning algorithms solve the problems of learning value functions, improving poli- tics, or performing both simultaneously, such as the actor-critic method. We detail the theoretical foundations of humanoid robots, and in this chapter, we see the different control strategies. In the next chapter, we summarize the works of literature for the walk of humanoid robots. 52 4 HUMANOID ROBOT WALKING APPROACHES "To know more is to be freer" César Vallejo Walk can be defined as "to move or go somewhere by placing one foot in front of the other on the ground, but without running" (OXFORD, 2015). If we adopt this definition, we must consider that the robot must always have at least one foot in contact with the ground during the walking process. However, how can we make a machine to perform such movements? Bipedal locomotion is approached in the literature in various distinct aspects, including walking patterns, control algorithms, types of controllers, control strategies, optimization techniques, and so on. Some of these aspects will be discussed in this chapter. Figure 16 presents some of the most relevant aspects found in literature considering the state of art. Type of Controller Bipedal Locomotion Torque Position Velocity Controller Strategy Lower Part Whole Decoupled Body Control Algorithm Quasi-static Dynamic Based on Model Free Model Walk Pattern Angular Neuro- Trajectory muscular Optimization Techniques Evolutionary Reinforcement Technique Learni