UNIVERSIDADE ESTADUAL PAULISTA
“JÚLIO DE MESQUITA FILHO”

Instituto de Ciência e Tecnologia de Sorocaba

MARIA FERNANDA TEJADA BEGAZO

A LEARNING-BASED MODEL-FREE CONTROLLER FOR DECOUPLED
HUMANOID ROBOT WALKING

Sorocaba - SP
2020


MARIA FERNANDA TEJADA BEGAZO

A LEARNING-BASED MODEL-FREE CONTROLLER FOR DECOUPLED
HUMANOID ROBOT WALKING

MASTER’S THESIS

Text submitted to the Graduate Program in Electri-
cal Engineering (PGEE) of the Institute of Science
and Technology (ICT) of Sorocaba as part of the re-
quirements to obtain the title of Master in Electrical
Engineering.

Sorocaba - SP
2020


To my advisor, for all his patience, dedication and time in the most difficult moments, and for inspiring me to look

outside the bubble of the common to innovation.


B416l
Begazo, Maria Fernanda Tejada

    A learning-based model-free controller for decoupled humanoid

robot walking / Maria Fernanda Tejada Begazo. -- Sorocaba, 2020

    94 p. : il., tabs.

    Dissertação (mestrado) - Universidade Estadual Paulista (Unesp),

Instituto de Ciência e Tecnologia, Sorocaba

    Orientador: Alexandre da Silva Simões

    1. Genetic algorithms. 2. Artificial intelligence. 3. Intelligent

control systems. 4. Androids. I. Título.

Sistema de geração automática de fichas catalográficas da Unesp. Biblioteca do Instituto de
Ciência e Tecnologia, Sorocaba. Dados fornecidos pelo autor(a).

Essa ficha não pode ser modificada.


UNIVERSIDADE ESTADUAL PAULISTA 
 

Câmpus de Sorocaba 
 
 
CERTIFICADO DE APROVAÇÃO 
 
 
TÍTULO DA DISSERTAÇÃO: "A learning-based model-free controller for decoupled humanoid robot 
waking" 

 
AUTORA: MARIA FERNANDA TEJADA BEGAZO 
ORIENTADOR: ALEXANDRE DA SILVA SIMÕES 

 
Aprovada como parte das exigências para obtenção do Título de Mestra em ENGENHARIA 
ELÉTRICA, área: Sistemas Eletrônicos pela Comissão Examinadora: 

 
Prof. Dr. ALEXANDRE DA SILVA SIMÕES (Participaçao  Virtual) 
Departamento de Engenharia de Controle e Automação / Instituto de Ciência e Tecnologia - UNESP - Câmpus 
de Sorocaba 
 
 
Professor Titular REINALDO AUGUSTO DA COSTA BIANCHI (Participaçao Virtual) 
Centro Universitário Fundação Educacional Inaciana Pe. Saboia de Medeiros - FEI 
 
 
Prof. Dr. MARCOS RICARDO OMENA DE ALBUQUERQUE MAXIMO (Participaçao Virtual) 
Divisão de Ciência da Computação, Departamento de Metodologias de Computação / Instituto Tecnológico de 
Aeronáutica 
 
 
Sorocaba, 30 de dezembro de 2020 
 
 
Prof. Dr. Alexandre da Silva Simões 
Orientador 

 
Instituto de Ciência e Tecnologia - Câmpus de Sorocaba - 
Três de Março, 511, 18087180, Sorocaba - São Paulo 

http://www.sorocaba.unesp.br/#!/pos-graduacao/--engenharia-eletrica-local/CNPJ: 48031918003573. 


Acknowledgements

I thank all the family, friends, teachers and employees of the Sorocaba Institute of Science and Technology, who
contributed directly or indirectly to the realization of this work. In particular, I offer my thanks:

• To my parents Cristina and Armando and my brother Anthony for their love, support and encouragement;

• To my aunt Tania for the support and encouragement at all time;

• To Prof. Dr. Alexandre and Dra. Esther, for all the teaching, encouragement, trust and guidance;

• To my friends and colleagues in the laboratory who helped me directly or indirectly, especially Vitor, Nayari,
Murillo and Rafael, for the help and work done together;

• To my friends Renata, Claudia, Kevin and Elizabeth for their support and encouragement.

• This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil
(CAPES) – Finance Code 001.


“O sol é para todos,

mas a sombra é para quem

chega primeiro.”

Geremias Ludu


Abstract

Robotics has advanced rapidly in recent decades. Initially, it focused on automotive industry production after the
second industrial revolution; it rapidly expanded to new exciting domains that prompted robots to perform tasks with
tools and in environments designed for humans. However, controlling the movements of such robots is becoming more
and more challenging due to the growing complexity of their mechanical structure and dynamics. This difficulty has
stimulated the investigation of novel methods for robots controlling that may not depend on the kinematic or dynamic
models of the robots. The Marta humanoid robot was designed and built by the LaRoCS and GASI research groups
for humanoid walking investigation and present some novel concepts like small feet – instead of large ones – and a
3-DOF spherical joint in the waist that allows the design of decoupled controlling strategies. Taking advantage of the
mechanical architecture of this robot, this work presents a model-free controller for humanoid robot walking composed
by i) a classical controller for the waist spherical joint that focuses on robot stability, and ii) a trajectory controller for
the remaining joints based on a Truncated Fourier Series (TFS) its parameters tuned by a Genetic Algorithm (GA)
that focus on the implementation of a coordinated robot walking pattern. Results have shown that the decoupled robot
control strategy has the potential to confer additional robustness to the bipedal robots walking while compared to
traditional techniques.

Keywords: Genetic algorithms. Artificial intelligence. Intelligent control systems. Androids.


Resumo

A robótica avançou rapidamente nas últimas décadas. Inicialmente focada na produção industrial de automóveis depois
da segunda revolução industrial; ela rapidamente se expandiu para novos e excitantes domínios que demandaram dos
robôs a realização de tarefas com ferramentas e em ambientes projetados para seres humanos. No entanto, controlar
o movimentos desses robôs está se tornando cada vez mais desafiador devido ao crescimento da complexidade de sua
estrutura mecânica e dinâmica. Essa dificuldade tem estimulado a investigação de novos métodos para o controle de
robôs que possam não depender do modelo cinemático ou dinâmico dos robôs. A robô humanóide Marta foi projetada
e construída pelos grupos de pesquisa LaRoCS e GASI para a realização de investigações de caminhada de robôs
humanóides, e apresenta alguns novos conceitos tais como os pés pequenos - ao invés de pés grandes - e uma junta
esférica de 3 DOF na cintura que permite o projeto de estratégias de controle desacopladas. Aproveitando a arquitetura
mecânica deste robô, este trabalho apresenta um controlador livre de modelo para a caminhada de robôs humanóides
composta por i) um controlador clássico para a junta esférica da cintura que foca na estabilidade do robô e ii) um
controlador de trajetória para as demais juntas baseado em uma Série Truncada de Fourier (TFS) com seus parâmetros
ajustados por um algoritmo genético (GA) que foca na implementação de um padrão de caminhada coordenada do
robô. Os resultados mostraram que a estratégia de controle descomposta tem potencial para conferir robustez adicional
à caminhada de robôs bípedes em relação a técnicas convencionais.

Palavras-chave: Algoritmos genéticos. Inteligência artificial. Sistemas inteligentes de controle. Andróides.


List of Figures

Figura 1 – Marta humanoid robot: a) Links; b) Joints (texts) and links (lines connecting joints). . . . . . . . . 24

Figura 2 – The world coordinates ∑w and local coordinate of arm ∑a. Vectors and coordinates system related
to the initial (blue) and final (red) position after the rotation of the arm. Inspired by (KAJITA et al.,
2014). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Figura 3 – Direct kinematics model of the Marta humanoid robot arm. The reference coordinate system (∑M)
in the pelvis joint (in red), and the local coordinate system (∑21) in the elbow joint (in green). . . . 27

Figura 4 – Calculating Inverse Kinematics of the right leg. . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Figura 5 – The definition, vertical force and horizontal force of ZMP (KAJITA et al., 2014). . . . . . . . . . 30

Figura 6 – Relationship between CoM, ZMP and the support polygon (KAJITA et al., 2014). . . . . . . . . . 31

Figura 7 – Bi-dimensional inverted pendulum shows the simplest model for a robot adapted from (KAJITA et
al., 2014). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Figura 8 – General framework for feedback control (DURIEZ; BRUNTON; NOACK, 2017). . . . . . . . . . 38

Figura 9 – Open-loop control architecture (DURIEZ; BRUNTON; NOACK, 2017). . . . . . . . . . . . . . . 39

Figura 10 – Closed-loop control architecture (DURIEZ; BRUNTON; NOACK, 2017). . . . . . . . . . . . . . 39

Figura 11 – PID control scheme (DURIEZ; BRUNTON; NOACK, 2017). . . . . . . . . . . . . . . . . . . . . 40

Figura 12 – Unit-step response of a physical system (OGATA, 2010) . . . . . . . . . . . . . . . . . . . . . . . 41

Figura 13 – Examples of the frequency response (PINTO, 2014) . . . . . . . . . . . . . . . . . . . . . . . . . 42

Figura 14 – Main features of the Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Figura 15 – The agent-environment interaction in Reinforcement Learning (SUTTON; BARTO, 2018) . . . . . 50

Figura 16 – Classification of State of Art: most relevant aspects in humanoid robot walking literature. . . . . . 53

Figura 17 – Phases in the quasi-static walk (CUEVAS; ZALDÍVAR; ROJAS, 2004). . . . . . . . . . . . . . . 54

Figura 18 – Human walking cycle (FIGUEROA; MEGGIOLARO, 2016). . . . . . . . . . . . . . . . . . . . . 55

Figura 19 – Human walking angular trajectory of the hip and knee (YANG et al., 2007). . . . . . . . . . . . . 57

Figura 20 – Generic pattern of walk elaborated from human walks features (YANG et al., 2007) . . . . . . . . 58

Figura 21 – Controller decoupled approach for the humanoid robot Marta. a) represents all of Marta’s joints, b)
the upper part determined by the spherical joint (waist controller), and c) the part of the arms and
legs (TFS controller). The red dots represent the joints that will not be controlled. . . . . . . . . . 60

Figura 22 – The pelvis pitch, knee pitch and pelvis roll trajectory (SHAFII; REIS; LAU, 2010). . . . . . . . . . 61

Figura 23 – Genetic algorithm steady state. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Figura 24 – Marta physical robot (left) and DOFs (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Figura 25 – Programming environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Figura 26 – GALib Class Hierarchy (WALL, 1996). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Figura 27 – Class diagram of the developed code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Figura 28 – Marta’s virtual model in V-REP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

Figura 29 – Sequence diagram between the controller and the simulator. . . . . . . . . . . . . . . . . . . . . . 71

Figura 30 – Angles of the waist joints after the process of tuning of the PID parameters: a) start position of
the robot; b) Angles of the spherical joints over time. Similar to the inverted pendulum, the robot
reached the stability with the PID controller after the tuning process. . . . . . . . . . . . . . . . . 76

Figura 31 – Body trajectories of the typical individuals selected by the Genetic Algorithm (GA) using distinct
fitness functions: a) Forward shift (FF1); b) Forward shift and time (FF2); c) Discount for diagonal
shifts (FF3); d) Stimulated diagonal (FF4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77


Figura 32 – Average fitness value during the training process for distinct ankles control strategies. . . . . . . . 78
Figura 33 – Walking patterns for distinct ankles control strategy. a) Shafii equations (ANK-01); b) IMUs placed

at robot’s feet (ANK-02); c) IMUs placed at robot’s feet and pelvis (ANK-03). . . . . . . . . . . . 79
Figura 34 – General evaluation of the decoupled controller strategy. Average values of the fitness function for

each generation in the learning phase for EXP-01 to EXP-05. . . . . . . . . . . . . . . . . . . . . 79
Figura 35 – Movements learned by Marta robot over time in EXP-01 (no DOFs in robot waist). Without an

active control the robot is naturally unstable and was unable to learn a walking pattern. . . . . . . 80
Figura 36 – Movements learned by Marta robot over time in EXP-02 (pitch DOF in waist spherical joint). The

robot was able to learn a few steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Figura 37 – Movements learned by Marta robot over time in EXP-03 (pitch and yaw DOFs in waist spherical

joint). The robot was able to learn a few steps and has a more pronounced lateral displacement. . . 82
Figura 38 – Movements learned by Marta robot over time in EXP-04 (pitch and roll DOFs in waist spherical

joint). The robot was able to learn a few steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Figura 39 – Movements learned by Marta robot over time in EXP-05 (pitch, yaw and roll DOFs in waist sphe-

rical joint). Robot was able to learn a more structured walking pattern before falling. . . . . . . . . 84
Figura 40 – Movements learned by Marta robot over time in Final Experiment. Robot was able to learn a more

structured walking pattern before falling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85


List of Tables

Tabela 1 – Ziegler-Nichols tuning rule based on step response. . . . . . . . . . . . . . . . . . . . . . . . . . 41
Tabela 2 – Ziegler-Nichols tuning rule based on frequency response. . . . . . . . . . . . . . . . . . . . . . . 42
Tabela 3 – Quasi-static walking works review. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Tabela 4 – Dynamic walking works review. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Tabela 5 – Specifications of the dimensions of Marta (CHENATTI et al., 2018). . . . . . . . . . . . . . . . . 66
Tabela 6 – Limits for each genome parameters of the GA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Tabela 7 – PID parameters selected for the experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75


List of Abbreviations and Acronyms

3D-FPBIPM 3D force-pattern based inverted pendulum model

AD Anno Domini

AI Artificial Intelligence

BC Before Christ

CPU Central Processing Unity

CPG Central Pattern Generator

CoM Center of Mass

DOF Degrees Of Freedom

EA Evolutionary Algorithms

GASI Automation and Integrated Systems Group

GA Genetic Algorithm

GUI Graphical User Interface

IMU Inertial Measurement Unit

IPM Inverted Pendulum Model

LaRoCS Laboratory of Robotics and Cognitive Systems

LIPM Linear Inverted Pendulum Model

MDP Markov Decision Process

ML Machine Learning

NUC Next Unity of Computing

OS Operational System

PID Proportional-Integral-Derivative

RL Reinforcement Learning

TLIPM Triple Linear Inverted Pendulum Model

TFS Truncated Fourier Series

UNESP Universidade Estuadual Paulista

UNICAMP Universidade Estadual de Campinas

VLIPM Virtual Linear Inverted Pendulum Model

V-REP Virtual Robotic Experimentation Platform

ZMP Zero-Moment Point


List of Symbols

∑w World coordinates system of the humanoid robot

∑i Local coordinates system of the j-th joint
pi Vector position of the j-th joint in the ∑w

p j
i Vector position of i-th joint viewed from the local coordinates system of the j-th joint (∑ j)

r Vector resultant of the pi and p j

eic The c axis of the coordinate system of ∑i

Rc(θ) Represent the θ of rotation around axis c

Ri Rotation matrix of the i-th joint in the ∑w

Rnm The position n, m in the rotation matrix
Tj The homogeneous transformation matrix of the j-th joint in the ∑w

T i
j The homogeneous transformation matrix of the j-th joint viewed the local coordinates system of the

i-th joint
c∗ cos(∗)
s∗ sin(∗)
b j Vector between the ∑w and ∑ j

qi Angle of rotation the i-th joint
M Total mass of the robot
C Robot’s center of mass
mi The mass of the i-th joint
pi Position of the i-th mass point
P Momentum of the robot
L Angular momentum of the robot
Lr Angular momentum about the referent point r

L̇ Rotational movement about the origin of the joint
f External force applied to the robot that is not gravity
fall All external forces applied to the robot
τ External torque applied to the robot
τall All external torque applied to the robot
g Gravitation force
f int
i j Force applied to the mass of the i-th joint from j-th

f ext
i j Force external applied to the mass of the i-th joint from j-th

vi Velocity of the i-th joint
ω Angular velocity
I Coefficient matrix
lc Length on c axis


wr Reference signal of the control system
wd External disturbance of the control system
wn Sensor noise of the control system
e(∗) Error function
Kp Proportional gain of the PID system
Ki Factor integral of the PID system
Kd Factor derivative of the PID system
Ti Integral time of the PID system
Td Derivative time of the PID system
Lz Delay time of the Ziegler-Nichols based on step response
Tz Time constant of the Ziegler-Nichols based on step response
Ku Critical gain of the Ziegler-Nichols based on frequency response
Tu Critical oscillatory period of the Ziegler-Nichols based on frequency response
p(i) Probability of an i-th individual of the Genetic Algorithm
f (i) Fitness value of an i-th individual of the Genetic Algorithm
N(0,σ) Gaussian distribution
πt Policy on the t time of the Reinforcement Learning
π∗ Optimal policy of the Reinforcement Learning
St State on the t time of the Reinforcement Learning
At Action on the t time of the Reinforcement Learning
Rt Reward on the t time of the Reinforcement Learning
Vπ(∗) Value function of the Reinforcement Learning
p(∗) Probability distribution of the Reinforcement Learning
Gt Commutative reward in the t time of the Reinforcement Learning
α Learning rate factor of the Reinforcement Learning
γ Discount factor of the Reinforcement Learning


Contents

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.1 OBJECTIVE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.1.1 MAIN OBJECTIVE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.1.2 SPECIFIC OBJECTIVES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.2 WORK ORGANIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2 FUNDAMENTALS OF HUMANOID ROBOTS . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1 HUMANOID ROBOTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.1 KINEMATICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.1.1.1 DIRECT KINEMATICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.1.1.2 INVERSE KINEMATICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.1.2 ZERO-MOMENT POINT (ZMP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.1.3 DYNAMICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.1.4 MOMENTUM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.2 INVERTED PENDULUM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3 MACHINE LEARNING CONTROL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1 CONTROL THEORY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1.1 FEEDBACK CONTROL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.1.2 PID CONTROL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.1.2.1 ZIEGLER-NICHOLS TUNING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2 MACHINE LEARNING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.1 OPTIMIZING WITH EVOLUTIONARY ALGORITHMS . . . . . . . . . . . . . . . . . . . . . 43

3.2.2 PRINCIPLES OF THE EVOLUTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2.3 GENETIC ALGORITHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2.3.1 GENETIC REPRESENTATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.2.3.2 INITIAL POPULATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.2.3.3 FITNESS FUNCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.2.3.4 SELECTION AND REPRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.2.3.5 GENETIC OPERATORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.2.3.6 SCALING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.2.4 REINFORCEMENT LEARNING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.2.4.1 MARKOV DECISION PROCESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.2.4.2 AGENT’S OBJECTIVE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.2.4.3 POLICY AND VALUE FUNCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4 HUMANOID ROBOT WALKING APPROACHES . . . . . . . . . . . . . . . . . . . . . . . . 53


4.1 CONTROL ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1.1 QUASI-STATIC WALKING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1.2 DYNAMIC WALKING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1.2.1 BASED ON MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.1.2.2 MODEL-FREE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5 PROPOSED APPROACH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.1 DECOUPLED CONTROLLER APPROACH . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.1.1 TFS CONTROLLER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.1.2 WAIST CONTROLLER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 MODEL-FREE PARAMETERS LEARNING . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6 MATERIALS AND METHODS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.1 MARTA HUMANOID ROBOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.1.1 ACTUATORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.1.2 SENSORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.1.3 PROCESSOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.2 PROGRAMMING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.3 ROBOTICS SIMULATOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.3.1 COMMUNICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.4 EXPERIMENTAL PROCEDURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.4.1 GENETIC ALGORITHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.4.2 FITNESS FUNCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.4.3 REFERENCE POSITION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.4.4 DAMPING FACTOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.4.5 ANKLE CONTROL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7 RESULTS AND DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.1 PRELIMINARY TESTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.1.1 WAIST PID TUNING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.1.2 WAIST RL TUNING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.1.3 INFLUENCE OF THE FITNESS FUNCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.1.4 INFLUENCE OF THE ANKLES CONTROL STRATEGY . . . . . . . . . . . . . . . . . . . . . 76
7.2 DECOUPLED CONTROLLER EVALUATION . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.3 DECOUPLED CONTROLLER WITH RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

8 CONCLUSIONS AND FUTURE WORKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89


1 INTRODUCTION

"I’ve never expected a miracle. I will get things done myself."
(Guts character)

Berserk

Humans have dreamed of creating skillful and intelligent machines, which could help them in tedious or dangerous
activities for a long time. This dream began to come true at the end of the second industrial revolution, where machines
were typically used in repetitive tasks that harmed human health, especially in automobile factories. These early
machines were called manipulator’s arms, opening the way for exploration of the field of robotics. Over the past few
decades, robotics has spread to many areas allowing machines to perform activities outside of industries in medicine,
customer service, military, exploration, rescue, education, etc.

Even considering the dissemination of robotics nowadays, interacting with environments and tools designed for
humans still represents a significant challenge for robots. In this context, humanoid robots – robots that resemble a
human appearance – occupy a prominent place due to their ability to adapt easier in these challenging scenarios. Since
such robots must have the ability to move from one point to another as close as possible to humans, legged robots
have attracted a particular interest of the researchers. This context justifies the interest in researching new controllers
for humanoid robots that can expand their capabilities in the bipedal walking task.

The control of the humanoid robot walking is usually divided into two approaches: i) the quasi-static walk and ii)
the dynamic walk. The quasi-static walk focuses on a kinematic modeling of the robot that is typically based on the
manipulator arms theory. The main problem with this approach is the reduction in walking speed since the robot must
perform movements slow enough to depreciate the robot’s dynamics. On the other hand, dynamic walk modeling
focuses on obtaining the stability of the robot with the dynamics of the environment. If the controller is based on
the robot dynamic model, it typically focuses on the internal and external forces that act on the humanoid during the
movement. This technique comes up against the complexity problem of modeling robot dynamics that is not trivial.
There are two approaches in this dynamic walk; the based on model approach requires the robot dynamic model in
the environment. On the other hand, model-free is not necessary the robot dynamic, since it empirically discovers the
dynamic model. This last approach allows us to work with various types of humanoids with minor modifications to the
controller.

Recently, the working group – the Automation and Integrated Systems Group (GASI) at Unesp, and Laboratory of
Robotics and Cognitive Systems (LaRoCS) at Unicamp – has designed and built-in physical and simulated environ-
ments – a humanoid robot called Marta with twenty-five DOF designed for walking research, which presents at least
three particular concepts when compared to most commercial standard robots: i) the small dimension of the feet, ii) the
articulation of the toe, and iii) a 3-DOF spherical joint placed in robot waist that allows the design of decoupled con-
trolling strategies. These constructive particularities bring new challenges and possibilities to the humanoid dynamic
walk control for maintaining the robot’s balance during the walk.

Taking advantage of the mechanical architecture of this robot, this work proposes a model-free controller for
humanoid robot walking composed by i) a classical controller for the wast spherical joint that focuses on robot stability,
and ii) a trajectory controller for the remaining joints based on a Truncated Fourier Series (TFS) with its parameters
tuned by a Genetic Algorithm (GA) that focus on the implementation of a coordinated robot walking pattern.


1.1 OBJECTIVE

1.1.1 MAIN OBJECTIVE

The main goal in this work is to investigate the possibility of controlling the gait of a bipedal robot – particularly the
Marta humanoid robot – using a strategy based on the decoupling of the robot joints into two sets: i) waist and ii)
remaining DOFs, with each set guided by a distinct controller. A first controller is designed for the 3-DOF spherical
joint placed in the robot waist focusing on improving the humanoid’s stability during the walk. Simultaneously, a
second model-free controller generates a set of equations for controlling the remaining joints’ trajectories based on
the Truncated Fourier Series (TFS), with its parameters properly tuned through a machine learning process based on a
Genetic Algorithm (GA).

1.1.2 SPECIFIC OBJECTIVES

The specific objectives of this work are:

• To implement the computational framework that can allow the work with the simulated version of the Marta
humanoid robot;

• To implement a classical controller on the waist of Marta robot and to evaluate the stability of the robot during
walk;

• To implement the optimization of the TFS parameters with a Genetic Algorithm for controlling the DOFs of the
Marta robot;

• To compare the walking pattern achieved with the implemented algorithm with others available in the literature.

1.2 WORK ORGANIZATION

This work has been organized as follows:

• Chapter 2 details the theoretical foundations of humanoid robots. The section focus on the mathematical models
available in the literature to describe a humanoid robot;

• Chapter 3 presents the machine learning control foundations. We explain control theory and Ziegler-Nichols
tuning. Also, We detail machine learning in the approaches of the evolution algorithm theory and reinforcement
learning;

• Chapter 4 presents a review of the state of the art on biped walking controllers. Two types of controllers are
examined: those based on models and those that are model-free;

• Chapter 5 describes the proposed approach in more details;

• Chapter 6 examines the materials that we propose to use in this work: the Marta humanoid robot and the simulator
that will be used in the experiments and for the evaluation of the algorithm;

• Chapter 7 presents experiments performed, results and discussions;

• Chapter 8 presents the conclusions in this work and directs future works.

22


2 FUNDAMENTALS OF HUMANOID ROBOTS

"Once you eliminate the impossible, whatever remains, no matter how improbable, must be the truth."
(Sherlock Holmes)

Arthur Conan Doyle

2.1 HUMANOID ROBOTS

Humanoid robotics is the field of robotics that deals with robots that resemble the human body structure. The first
mention of a physical mechanism that can be associated with the concept of robot goes back to Greek mythology with
the legend of Talos, a giant bronze built by Zeus to protect the island of Crete from invaders (1500 BC) (SMITH,
1849). This concept is also present in the statues of the Egyptian oracle machines that symbolized a deity able to
answer questions. It is similar to a chat-bot with a physical body (HRASTINSKI et al., 2019).

The first automated mechanical device capable of measuring the time by regulating the flow of liquids from one
vessel to another, known as the clepsydra, existed in Babylon, Egypt, and Persia (TURNER, 1984). This mechanical
device was improved with a water and levers system by Hero of Alexandria. He had various mechanisms such as birds
that drank water, statues that served wine, automatic doors, in the other (10 AD - 70 AD). The most relevant machine
he made was "The Automated Theater"which represents the Trojan War through a mechanical puppet theater.

The development of automaton continued. In Asia, the Karakuri is a wooden mechanism that can represent tra-
ditional myths and legends (YOKOTA, 2009). In the eighteen century, Pierre Jaquet-Droz built three sophisticated
automatons: the musician, the drawer, and the writer (VOSKUHL, 2007).

In 1920 the writer Karel Capek created the term robot in his science fiction work "Rossum’s Universal Robots
(R.U.R.)"(CAPEK, 2004). This term comes from the Czech word robota that means forced labor. Furthermore,
Capek’s work is the starting point for the creation of stories with artificial human bodies.

In the twenty century, the development of integrated circuits, digital computers, and miniaturized components al-
lowed computer-controlled robots to develop. Robotics established itself as the field of science related to the design and
production of robots. In 1948, the Argonne National Laboratory developed the first robot that manipulated radioactive
elements (PAUL, 1984), inaugurating the branch of manipulator robots. These robots became essential components
in the industry, especially in the automotive industry. More recently, robots have found new applications outside of
the industry (BARRIENTOS et al., 1997), in areas such as cleaning, entertainment, surveillance, search and rescue,
underwater, space, and medical applications

For a long time, human beings have tried to replicate their natural experience in their environment by imitating these
events with artificial mechanisms. Those experiences related to their own bodies are some of the most challenging,
considering that the functioning of the human body and mind is still a mystery in many ways. Not surprisingly, this
field has been an inspiration for artists, engineers, and scientists. In this perspective, we can understand humanoid
robotics as the field concerned with the creation of machines that operate in the real world and exhibit the human form
and behavior (SICILIANO; KHATIB, 2019).

A robot with a body similar to the human body can most easily adapt to the human environment. In this way, we
expect these robots can perform simple tasks like walking in a room without colliding with an object, for example.
A task like these involves different challenges such as i) creating a bipedal mechanism with several of DOFs, ii)
endow this mechanism with a considerable sensory capacity to make possible its interaction with the environment, iii)
developing techniques to control the stability of the robot during the walk, and others.


In modern robotics, the starting point to the project of such mechanisms is kinematics and dynamics. These concepts
will be detailed in the following subsections.

2.1.1 KINEMATICS

Humanoid robots interact and move in 3D space, and some strategy is necessary to model this interaction. A humanoid
robot is usually modeled as a collection of rigid parts (links) and joints (the connection between two or more links).
The kinematics is the theory usually adopted to analyze the relationship between the position and attitude of a link
and the joint angles of a mechanism (KAJITA et al., 2014). Figure 1 presents Marta humanoid robot structure with its
joints names (CHENATTI et al., 2018).

NeckYaw

LeftElbowPitch

RightKneePitch LeftKneePitch

RightFingersPitch LeftFingersPitch

HeadPitch

RightShoulderPitch
RightShoulderRoll

LeftShoulderPitch
LeftShoulderRoll

RightElbowPitch

HipPitch
HipYaw

HipRoll

RightLegYaw LeftLegYaw

LeftPelvisPitch
LeftPelvisRoll

RightPelvisPitch
RightPelvisRoll

RightAnklePitch
RightPelvisRoll

LeftAnklePitch
LeftAnkleRoll

(26) HeadPitch

(25) NeckYaw

22) LeftShoulderPitch(19) RightShoulderPitch

23) LeftShoulderRoll(20) RightShoulderRoll

(24) LeftElbowPitch

(12) LeftLegYaw(5) RightLegYaw

(21) RightElbowPitch
(2) HipRoll

(3) HipPitch

(4) HipYaw

(6) RightPelvisRoll

(7) RightPelvisPitch

(13) LeftPelvisRoll

(14) LeftPelvisPitch

(8) RightKneePitch

(9) RightAnklePitch
(10) RightAnkleRoll

(11) RightFingersPitch

(15) LeftKneePitch

(16) LeftAnklePitch
(17) LeftAnkleRoll

(18) LeftFingersPitch

(1) MartaBody

y

z

x

(a) (b)

Figure 1 - Marta humanoid robot: a) Links; b) Joints (texts) and links (lines connecting joints).

The humanoid robot and its parts have a physical position in the world (absolute position), which can be defined by
a fixed coordinate systems (x, y, z) in a world coordinate system (∑w) (CRAIG, 2006). A local coordinate system (∑a)
can be attached to any moving part, for example, to the left shoulder (Figure 2). The absolute position of the shoulder
is defined by vector pa, and vector r shows the position of the wrist in relation to the shoulder. It can be expressed in
the following way:

ph = pa + r. (1)

The absolute wrist end position can be expressed by introducing another vector r′ which points the lifted wrist from

24


x

y

z

eax

eaz

eax

eay

eaz

eay

∑

a

φ

r

pa

ph
ph

r′

∑

w

Figure 2 - The world coordinates ∑w and local coordinate of arm ∑a. Vectors and coordinates system related to the
initial (blue) and final (red) position after the rotation of the arm. Inspired by (KAJITA et al., 2014).

the shoulder. The hand lifts by the vector’s rotation from r to r′ while the shoulder keeps the same position pa:

ph = pa + r′. (2)

Unlike the world coordinate system which is fixed to the ground, the local coordinate system ∑a moves together
with the attached link. This coordinate system is defined by vectors eax, eay and eaz, which are parallel to ∑w in the
initial position of the arm. When the arm rotates by an angle φ on the axis eax, the local coordinates also rotates. The
new vectors can be obtained by the following equations:

eax =

1
0
0

 eay =

 0
cos(φ)
sin(φ)

 eaz =

 0
−sin(φ)
cos(φ)

 . (3)

These vectors define the matrix Ra ([eax,eay,eaz]), which describes the relationship between r and r′:

r′ = Ra r. (4)

In the example shown in Figure 2. we can observe:

• The position of the end of wrist ph viewed from ∑w.

• The position of the end of wrist pa
h viewed from ∑a, where the upper index of p indicate the local coordinate.

The relationship between ph and pa
h can be described from Equations (2) and (4) as:

25


ph = pa +Ra pa
h (5)[

ph

1

]
=

[
Ra pa

0 0 0 1

][
pa

h

1

]
. (6)

The homogeneous transformation matrix is obtained from the rotation matrix and the translation matrix, so in this
system Ta converts the points described in arm local coordinates to world coordinates:[

p

1

]
= Ta

[
pa

h

1

]
. (7)

If we consider the rotation around the x, y and z axes, respectively called Roll (α), Pitch (β ) and Yaw (γ), the
rotations will be given by (SPONG; HUTCHINSON; VIDYASAGAR, 2004):

Rx(α) =

1 0 0
0 cos(α) −sin(α)

0 sin(α) cos(α)

 (8)

Ry(β ) =

 cos(β ) 0 sin(β )
0 1 0

−sin(β ) 0 cos(β )

 (9)

Rz(γ) =

cos(γ) −sin(γ) 0
sin(γ) cos(γ) 0

0 0 1

 . (10)

Now, it is possible to introduce the Z-Y-X Euler angles notation, frequently used to model attitudes of ships,
airplanes, and robots (BARRIENTOS et al., 1997). The Rotation Matrix will be given by:

Rrpy(α,β ,γ) = Rz(γ) Ry(β ) Rx(α), (11)

Rrpy(α,β ,γ) =

cγ cβ −sγ cα + cγ sβ sα sγ sα + cγ sβ cα

sγ cβ cγ cα + sγ sβ sα −cγ sα + sγ sβ sα

−sβ cβ sα cβ cα

 , (12)

where: c∗ ≡ cos(∗), s∗ ≡ sin(∗), and so on.

2.1.1.1 DIRECT KINEMATICS

Direct Kinematics focuses on finding the relative position and orientation of a joint from a reference coordinate system,
where the values of each joint and the geometric parameters of the robot are known (PAUL, 1984). In a few words, a
robot is a kinematic chain formed by rigid objects joined together by joints, therefore the direct kinematic problem is
reduced to finding a homogeneous transformation matrix T associating the position and orientation of the joint solicited
(CRAIG, 2006).

The kinematic chain can have different representations, the most used is Denavit-Hartenberg (DH). DH defined i-th
link by four transformations (BARRIENTOS et al., 1997):

• θi: Rotation around the zi−1-axis;

26


• di: Translation along the zi−1-axis;

• ai: Translation along the xi-xis;

• αi: Rotation around the xi-axis.

The relative position and orientation between the systems associated with two consecutive links of the robot are
usually called matrix T i−1

i , so the matrix T 0
k is the result of the product of the matrices T i−1

i where i ∈ [1,k], which
represents the total or partial kinematic chain that forms the robot.

Illustrate the previous definitions, the matrix T 1
21 of the elbow of Marta’s left arm (Figure 3) is defined by:

T 1
21 = T 1

19 T 19
20 T 20

21 (13)

T 1
21 =


cθ19 −cα19 sθ19 sα19 sθ19 a19cθ19

sθ19 cα19 cθ19 −sα19cθ19 a19sθ19

0 sα19 cα19 d19

0 0 0 1

 T 19
20 T 20

21 (14)

T 1
21 =

[
R1

21 p1
21

0 0 0 1

]
, (15)

where: T 1
20 is the homogeneous transformation matrix of the shoulder pitch to the reference coordinate system ∑M , T 19

20

is the transformation matrix of the shoulder roll to the shoulder pitch coordinate system, and T 20
21 is the transformation

matrix of the elbow pitch to the shoulder roll coordinate system; c∗ ≡ cos(∗), s∗ ≡ sin(∗); R1
21 and p1

21 is the relative
orientation and position of the elbow pitch.

(26)hdP

(25)nckY (22)LshP(19)RshP

(23)LshR(20)RshR

(24)LlbwP
(21)RlbwP

(2)hpR

(3)hpP

(4)hpY

(1)MartaBody

∑
M x

y
z
p25, R25

p21, R21

∑
W x

y

z

z

x

y

∑
21

Figure 3 - Direct kinematics model of the Marta humanoid robot arm. The reference coordinate system (∑M) in the
pelvis joint (in red), and the local coordinate system (∑21) in the elbow joint (in green).

27


2.1.1.2 INVERSE KINEMATICS

Inverse kinematics consists of finding the values that the robot’s joints q = [q1,q2, . . . ,qn]
T must adopt so that one

end of the robot is positioned p∗ and oriented R∗ according to a determined spatial location. In the previous section,
direct kinematics was systematically procedure as it is independent of the robot configuration. However, the inverse

kinematics obtaining of the equations are strongly dependent on the robot configuration (BARRIENTOS et al., 1997).

Generic procedures have been developed based on the knowledge of the robot kinematics and numerical methods.
(KAJITA et al., 2014) exemplifies this method by focusing on the robot’s leg, so we focus it on the Marta robot shown
in Figure 4. The position and orientation of HipYaw (p4, R4) is known, and where we want to position and orientation
RightAnkleRoll (p10, R10). The inverse kinematics following steps:

First, the position of the RightLegYaw (p5) is given by:

p5 = p4 +R4

0
D

0

 , (16)

where: D is the distance between HipYaw and the RightLegYaw. Next, we calculate the position of the crotch viewed
from the RightAnkleRoll coordinate space.

r = RT
7 (p5− p10) (17)

r ≡ [rx ry rz]
T . (18)

From this we calculate the distance between the RightAnkleRoll and the HipYaw, which we will define as:

C =
√

r2
x + r2

y + r2
z . (19)

In the Figure 4 b), if it used the cosine rule in the triangle ABC it get the angle of the knee (q8).

q8 =−arccos
(

A2 +B2−C2

2AB

)
+π. (20)

If we define the angle at the lower end f the triangle as α , from the sine rule we get,

C
sin(π−q8)

=
A

sinα
(21)

α = sin−1
(

Asin(π−q8)

C

)
. (22)

Now, it will focus on the RightAnkleRoll local coordinates, as shown in Figure 4, its possible calculate the ankle
roll and pitch angles with vector r. So:

q10 = atan2(ry,rx) (23)

q9 = −atan2
(

rx,sign(rz)
√

r2
y + r2

z

)
−α. (24)

And, we still need to compute is the Yaw, Roll and Pitch angles at the base of the leg. From the equations that
define each joint.

28


(5)RlgY

(10)RnklR

(1)MartaBody

z
y

x

∑
w

a) Definition b) Knee Angle

A

B

r
π − q8

q8

α

y

x

y

z

q9

q10

ry

rx

rz

α

c) Ankle Angle

(11)RfgnP

(9)RnklP

(8)RknP

(7)RplvP

(6)RplvR

(4)HpY

(3)HpP

(2)HpR

r
A

B

D

p10, R10

p10, R10

p4, R4

Figure 4 - Calculating Inverse Kinematics of the right leg.

R10 = R4Rz(q5)Rx(q6)Ry(q7)Ry(q8 +q9)Rx(q10) (25)

Rz(q5)Rx(q6)Ry(q7) = RT
4 R10Rx(−q10)Ry(−q8−q9). (26)

When solving the previous expression, we have the following equation:c5c7− s5s6s7 −s5c6 c5s7 + s5s6c7

s5c7 + c5s6s7 c5c6 s5s7− c5s6c7

−c6s7 s6 c6c7

=

R11 R12 R13

R21 R22 R23

R31 R32 R33

 . (27)

Finally, using the left hand side of this equation, we obtain:

q5 = atan2(−R12,R22) (28)

q6 = atan2(R32,−R12s5 +R22c5) (29)

q7 = atan2(−R31,R33). (30)

2.1.2 ZERO-MOMENT POINT (ZMP)

Most industrial robots are fixed to the ground, so their workspace is limited to the range that their joints can reach.
However, humanoid robots have to move while maintaining an indispensable condition is to keep their feet in contact
with the ground. ZMP is usually used to accomplish this purpose.

29


Zero-Moment Point (ZMP) is a concept introduced by Miomir Vukobratovic in 1968, which denotes a point that the
dynamic reaction force produced by contact of the foot with the ground does not produce any moment in the horizontal
direction (VUKOBRATOVIC et al., 2012). In other words, ZMP is a point where the sum of all the horizontal forces,
the inertia, and the gravity is equal to zero. In 1972, Vukobratovic and Stepaneko wrote the first article on the control
of bipedal robots using ZMP (VUKOBRATOVIC; BOROVAC, 2004). The Figure 5 shows the distribution of force
across the foot. The charge has the same sign over the entire surface, so it can be reduced to a single resultant force R,
so ZMP is the point where the force R acts.

ZMP
R

a) Definition

x1 x2

ξ

ρ(ξ)

b) Vertical Force c) Horizontal Force

x1 x2
ξ

σ(ξ)
z

x

Figure 5 - The definition, vertical force and horizontal force of ZMP (KAJITA et al., 2014).

Another important concept related to ZMP is the support polygon. In Figure 6, the support polygon is the area
formed by all the contact points between the robot and the ground. It is relevant to examine Vukovratovic’s definition
for ZMP: "the point that exists within the limit of the foot". The relationship between Center of Mass (CoM), ZMP,
and the support polygon are illustrated in Figure 6. Figure 6a shows a support polygon formed by two feet – are in
contact with the ground – and the projection of the CoM on the field coincides with ZMP, which means that the robot
is in equilibrium. On the other hand, Figure 6b shows a single support polygon; and the CoM projection is outside the
support polygon; consequently, its control is complicated.

2.1.3 DYNAMICS

In the previous subsection, we observed the CoM, ZMP, and the support polygon concepts for the humanoid robot to
keep in balance. Now, we analyze the relationship between the reaction force of the ground and the robot’s movement,
so follow Kajita’s deduction (KAJITA et al., 2014).

A humanoid robot has a structure similar to a human, so we are going to define the following ten physical parameters
that allow the robot to balance and classified into four groups:

• Mass: Total mass of the robot. M[Kg]

• Center of Mass: Robot’s center of mass. C ≡ [x y z]T [m]

• Momentum: Measure of a robot’s translational motion. P ≡ [Px Py Pz]
T [Ns]

• Angular Momentum: Measure of robot’s rotational motion about the origin. L ≡ [Lx Ly Lz]
T [Nms]

The dynamics are given by these physical laws, the principle is expressed by:

30


CoM

ZMP

Support Polygon

a) A standing human b) A human in action

Figure 6 - Relationship between CoM, ZMP and the support polygon (KAJITA et al., 2014).

Ċ =
P

M
(31)

Ṗ = fall (32)

L̇ = τall , (33)

where: fall and τall denote the sum of all forces and torque applied to the robot from the outside.
The Equation (31) concerning the translational motion gives the relationship between the velocity of the center of

mass and the momentum. Meanwhile, Equation (32) for translational motion shows how the momentum changes due
to external forces. And, the Equation (33) shows the angular momentum changes according to the sum of the moments.

Analyzing the force applied from outside the robot, we have the gravitational force applied to each component of
the robot, and its sum can be considered as the force of Mg applied to the center of mass of the robot C . On earth, the
vector of gravitational acceleration is given by g = [0 0 −9.8]T [m/s2]. So the equation will be given by:

fall = Mg+ f (34)

Ṗ = Mg+ f , (35)

where: f denotes the force other than gravity, for example, the ground reaction force. When a robot stands still, the
momentum change is zero, as the gravitational force balances out with the ground reaction force. Free fall occurs when
the ground reaction force disappears, the moment the robot increases rapidly downward due to gravity.

We take into account the dynamics of the rotational motion that is given by the Equation (33). Furthermore, the
moment generated by the gravitational force is given by:

τg = C ×Mg. (36)

31


If τ is without the effect of gravity, the total moment applied to the robot is observed, given by:

τall = C ×Mg+ τ. (37)

The equation of the rotational movement about the origin can be expressed as follows:

L̇ = C ×Mg+ τ. (38)

The τ moment considers only the moment of reaction to the ground applied to the robot. For a robot to stop,
the moment must balance with the gravitational force. If the ground reaction moment is not balanced, the angular
momentum increases rapidly, causing the robot to fall.

2.1.4 MOMENTUM

A humanoid robot is composed of N points of mass. So, mi is the mass of the i-th point (KAJITA et al., 2014):

M =
N

∑
i=1

mi (39)

C =
N

∑
i=1

mipi/M (40)

Ċ =
N

∑
i=1

miṗi/M (41)

P =
N

∑
i=1

miṗi. (42)

Equation (40) represents the position of the robot’s center of mass, where pi is the position of the i-th mass point.
Equation (42) shows the total momentum of the robot where miṗi is the momentum of the i-th mass point. So, by
combining Equations (41) and (42):

Ċ = P/M. (43)

The dynamics of the robot’s momentum is given by the equation of motion of the mass at the i-th point:

mip̈i =
N

∑
i=1

f int
i j + f ext

i , (44)

where: f int
i j and f ext

i denote the force applied to the i-th point mass from the j-th one and the force applied to the i-th
point mass from the outside the robot. However, due to the law of action and reaction obtain:

f int
i j =− f int

ji (i 6= j).

This equation together to Equation (44) allow us to conclude that the robot’s momentum depend only on the external
forces.

N

∑
i=1

mi p̈i =
N

∑
i=1

f ext
i (45)

Ṗ = fall . (46)

32


On the other side, the angular momentum of the i-th point mass about the origin is defined by:

Li = pi×Pi. (47)

In a robot we also have rigid bodies that are connected by joints. Therefore, it can be taken into account that it is
a body that floats in space and rotates without being affected by external force, so the rotational speed is expressed by
the angular velocity vector (ω). Assuming that the origin of the reference coordinate system coincides with the center
of mass. Therefore, the velocity at a point of the rigid body is expressed by:

υi = υ(pi) = ω×pi. (48)

In addition, the total angular momentum of a rigid body about the reference point r can be calculated by:

L r = ∑
i
pi× (miω×pi) (49)

L r =
NN

∑
i=1

mipi× (−pi×ω) (50)

L r = (
NN

∑
i=1

mip̂ip̂
T
i )ω, (51)

where: NN is the total of points mass about the reference r. The angular momentum seen above is expressed by the
angular velocity vector multiplied by a coefficient matrix (I), called the inertia tensor.

L = (∑
i=1

mip̂ip̂
T
i )ω (52)

L = Iω (53)

I ≡ ∑
i=1

mip̂ip̂
T
i , (54)

where: L is the total angular momentum of a rigid body. We do not need to calculate the inertia tensor of any objects
with uniform density since the inertia tensor with typical shape. For example, for a cuboid which length of each edge
is lx, ly and lz and mass is m:

I =


m
12 (l

2
y + l2

z ) 0 0

0 m
12 (l

2
x + l2

z ) 0

0 0 m
12 (l

2
x + l2

y )

 . (55)

2.2 INVERTED PENDULUM

The inverted pendulum is one of the most popular systems adopted in the control field. The purpose of the control
is to stabilize a naturally unstable open-loop system (ROBERGE, 1960). We should analyze this system in a two-
dimensional world since the robot’s movement to the sagittal plane defined by the axis of the direction of walking and
the vertical axis. For this reason, we consider that all of the robot’s mass is centralized in Center of Mass (CoM) and
that the robot has massless legs that touch the ground at the individual rotary joints (Figure 7).

The robot dynamics control is a challenging task, and the problem can be modeled based on the linear inverted
pendulum. It is shown in Figure 7 that the fixed point is ZMP, and the horizontal component of the force f remains

33


f

r

M

θ

τ

z

x

+

f cos θ

f sin θ

f

−Mg

Figure 7 - Bi-dimensional inverted pendulum shows the simplest model for a robot adapted from (KAJITA et al.,
2014).

while the vertical component is canceled by gravity. The horizontal component accelerates the CoM horizontally is
given by (KAJITA et al., 2014):

Mẍ = f sinθ (56)

Mẍ =
Mg

cosθ
sinθ (57)

Mẍ = Mg tanθ (58)

Mẍ = Mg
x
z
, (59)

where: x and z give the CoM of the inverted pendulum. By rewriting the above equation, it is obtain a differential
equation for the horizontal dynamics of the CoM:

xZMP = x− g
z

ẍ. (60)

Assuming that our inverted linear pendulum model exists the constant z, we can solve the ordinary differential
equation:

x(t) = x(0)cosh(t/Tc)+Tcẋ(0)sinh(t/Tc) (61)

ẋ(t) = x(0)/Tc sinh(t/Tc)+ ẋ(0)cosh(t/Tc) (62)

Tc ≡
√

z/g, (63)

where: Tc is the time constant depending on the height of the CoM and the acceleration of gravity. The initial position
and velocity are given by x(0) and ẋ(0), which together are called initial conditions.

34


In this chapter, we emphasize the kinematic and dynamic concepts of humanoid robots. So, we have the theoretical
foundations to understand the complexity of the controllers. In the next chapter, we will focus on the concepts and
algorithms of controllers for humanoid robots.

35


3 MACHINE LEARNING CONTROL

"Keep your love of nature, for that is the true way to understand art more and more"
Vincent Van Gogh

Control can be defined as ”to act, to implement decisions that guarantee that device behaves as desired" (ANDREI,
2006). This concept is one of the fundamental concepts of robotics and a possible solution to slavery. For example, we
can see it in the book "Politics"by Aristotle (384-322 BC) (BENNETT, 1979), where wrote: ”... if every instrument

could accomplish its own work, obeying or anticipating the will of others ... if the shuttle weaved and the pick touched

the lyre without a hand to guide them, chief workmen would not need servants, nor masters slaves". The text shows
us the idea of having a slave machine that works automatically, a description similar to the origin of the word robot
presented in Capek’s work (CAPEK, 2004). Therefore, the concept of control an object is inherent in human history.
Then, we will detail the relevant investigations of control theory.

The first significant work was on James Watt’s centrifugal governor controller in the seventh century. It is a
device that automatically controls the speed of steam engines; however, it has an oscillatory behavior as the machines
got bigger and more powerful (LEIGH, 2012). In 1868, James Clerk Maxwell developed a formal analysis of the
dynamics of the centrifugal governor. He described and analyzed self-oscillation, that is, delays in the system lead
to overcompensation and unstable behavior (OGATA, 2010). In 1895, Hurwitz and his colleagues created the Routh-
Hurwitz theorem that allows the verification of the stability of the hydroelectric power station in Davos, Switzerland
(LEIGH, 2012). This theorem determines whether a linear dynamical system is stable without solving the system.

The works of Minorsky, Hazen, and Nyquist began the foundations of the control theory. Minorsky (1922)
worked on automatic controllers to pilot ships and demonstrated this stability with differential equations. In 1932,
Nyquist developed a simple procedure for determining the closed-loop system stability based on the open-loop response
of steady-state sinusoidal inputs. Hazen (1934) introduced the term servomechanisms for position control systems
(OGATA, 2010).

The investigators of the Servomechanisms Laboratory group observed the difficulty analyze high-order servome-
chanisms with the differential equations approach, thus investigated the frequency response methods (BENNETT,
1993). This method allowed the design of a closed-loop linear control system. In the 1940s, Ziegler and Nichols crea-
ted rules for tuning PID control used to control industrial systems (OGATA, 2010). From 1948 to 1950, Evans worked
on his proposal of the root-locus method that allows a rapid system evaluation through the closed-loop characteristics
that vary with changes in gain (BENNETT, 1993).

The frequency response methods and the root-locus method are the core of classical control theory, providing a
stable system that satisfies a set of performance requirements. In the 1960s, digital computers made available time
domain analysis in complex systems, modern control theory, based on time-domain analysis and synthesis using state
variables. This theory solved the increasing complexity of modern plants and the stringent requirements of precision,
weight, and cost (OGATA, 2010).

Modern control theory simplifies the design of the control system based on the physical system model. However,
the system stability is sensitive to the error between the model and the system. Therefore, the system may not be stable,
which can be avoided by setting an error range and keeping the control system within that range. This method is called
robust control (OGATA, 2010).

Robust control designs the system model with a priori uncertainty and guarantees stability; however, the uncer-
tainty doesn’t update during the system operation (SKOGESTAD; POSTLETHWAITE, 2007). The field of machine
learning is the study of computer algorithms that improve automatically through experience (MITCHELL, 1997). Con-


sequently, the machine learning control originates that allows merging these fields. This control focuses on the design
control systems as regression problems (BÄCK; SCHWEFEL, 1993), identify and optimizes control parameters (LEE
et al., 1997; ALFARO-CID et al., 2008), and determines the control law with continuous updating from the system
performance changes (BARTO, 1994).

We have briefly detailed the relevant works on control theory and its evolution. Therefore, we will explain in the
following section: control theory and machine learning. Control theory will expound concepts, and PID control. On
the other hand, machine learning will explain evolutionary algorithms and reinforcement learning.

3.1 CONTROL THEORY

Control theory defines control rules to achieve a goal in a certain period; their concepts are simple and independent
of the application. Consequently, control theory is applied from everyday to complex tasks (LEIGH, 2012). For
example, controlling a robot to move involves calculating the robot’s position, analyzing possible collisions, estimating
movement’s trajectory, and executing the movement. To understand the control systems, we must know the following
terminologies (OGATA, 2010):

• Plant: physical object to perform a particular operation, such as a mechanical device, an industrial machine, or a
robot.

• Process: Method to do or do something. In this case, it will be the objective operation of the control.

• System: Combination of components that act in conjunction and meet an objective. A system can be a physical
or abstract phenomenon.

• Disturbance: A signal that negatively affects the output of the system.

• Feedback control: A process of creating a feedback loop to modify the behavior of a system through action that
determined by system measurements.

3.1.1 FEEDBACK CONTROL

Figure 8 illustrates a feedback control system. The blue box and the yellow box represented the plant and the control
law, respectively. The control law input is the sensor measurement (y), and the actuators (u) modified the plant behavior.
The plant is subject to disturbances (w) and allows optimizing a cost function (J) (DURIEZ; BRUNTON; NOACK,
2017). There are two types of control loops: open-loop control and closed-loop control (feedback).

Physical
System

Control
Law

Disturbance Cost

SensorActuators
u y

Jw

Figure 8 - General framework for feedback control (DURIEZ; BRUNTON; NOACK, 2017).

38


Figure 9 illustrates the open-loop control architecture. For this strategy, actuation signal b is chosen based on
knowledge of the system to produce the desired output that matches the ordered reference signal (wr) (DURIEZ;
BRUNTON; NOACK, 2017). However, the external disturbances (wd) and sensor noise (wn) degrade the control-
ler’s performance. Time-dependent systems use this architecture. For example, the washing machine controller is
determined by periods of soaking, washing, and rinsing.

Physical
System

Control
Law

Actuators

u

Reference Signal
wr

External
Disturbance

Sensor
Noise

Sensor
y

wd wn

+ +

Figure 9 - Open-loop control architecture (DURIEZ; BRUNTON; NOACK, 2017).

The advantages of open-loop control systems are simple construction, easy to maintain, and a feasibly economical
system for outputs that are difficult to measure accurately. However, disturbances and calibration changes cause errors
in the system output (OGATA, 2010).

On the other hand, the closed-loop system involves sensor feedback, which allows the stabilization of an unstable
system by reducing the actuation error (OGATA, 2010). The error e is the difference between the sensor signal y

(feedback) and the reference signal wr (Figure 10). Also, closed-loop control can compensate for external disturbances
wd and attenuate noise wn (DURIEZ; BRUNTON; NOACK, 2017).

Physical
System

Control
Law

Actuators

u

Reference Signal
wr

External
Disturbance

Sensor
Noise

Sensor
y

wd wn

+ +Error+

−

e

Feedback Signal

Figure 10 - Closed-loop control architecture (DURIEZ; BRUNTON; NOACK, 2017).

3.1.2 PID CONTROL

The Proportional-Integral-Derivative controller is the most widely used. This controller solved approximately 90 to
95% of control problems. The operation of this controller relies on the feedback signal to eliminate existing drifts
(LEVINE, 1996). PID control is generally the sum of three terms (Figure 11). Therefore, this control law is described
as:

y(t) = uP(t)+uI(t)+uD(t), (64)

where uP is the proportional part, uI the integral part, and uD the derivative part.
The proportional part is simple feedback (Equation (65)). This part is proportional to its input, the error signal as a

function of time (e(t)). The error is the difference between the reference signal (desired value) and the feedback signal
(process variable). Furthermore, there is a factor Kp (proportional gain) that allows reversing the error calculation. This

39


Physical
System

Actuators

u(t)

Reference Signal
wr(t)

Sensor

y(t)

Error+

−

e(t)

Feedback Signal

uP (t)

uI(t)

uD(t)

∑

Control Law

Figure 11 - PID control scheme (DURIEZ; BRUNTON; NOACK, 2017).

action does not modify the shape of the input signal. It only introduces a scaling factor (amplifies or reduces) the value
applied at the plant input (PINTO, 2014).

uP(t) = Kpe(t), (65)

Proportional control has a steady-state error. A manually adjustable reset term can be added to the control signal to
eliminate the steady-state error (LEVINE, 1996). The proportional controller by Equation (65) becomes:

uP(t) = Kpe(t)+ub(t), (66)

where ub is the reset term.

The integral part is an automatic setting of the reset term. It filters out the low-frequency of the error signal and
adds it to the proportional part. This part is given by:

uI(t) = Ki

t∫
0

e(t)dt. (67)

uI(t) =
Kp

Ti

t∫
0

e(t)dt, (68)

where Ki is the integral factor, and Ti is the integral time.

The derivative part provides anticipated action, given by:

uD(t) = Kd
de(t)

dt
. (69)

uD(t) = KpTd
de(t)

dt
, (70)

where Kd is the derivative factor, and Td is the derivative time.

The parameters of the controller are selecting with controller tuning. That, in the next subsection, we will explain
the Ziegler and Nichols method.

40


3.1.2.1 ZIEGLER-NICHOLS TUNING

Ziegler and Nichols propose rules to determine the proportional gain (Kp), integral time (Ti), and derivative time (Td).
This method is based on the characteristics of the plant dynamics. There are two methods for performing the Ziegler-
Nichols adjustment: step response method and frequency response method (OGATA, 2010).

The first method determines Kp, Ti, Td according to Table 1. Figure 12 shows the input of a unit-step signal that
produces the output of an S-shaped curve. This curve is determined by two constant: the delay time Lz and the time
constant Tz (OGATA, 2010).

Table 1 - Ziegler-Nichols tuning rule based on step response.

Type of
Controller Kp Ti Td

P Tz

Lz
∞ 0

PI 0.9
Tz

Lz

Lz

0.3
0

PID 0.9
Tz

Lz
2 Lz 0.5 Lz

Physical
System

y(t)

K

0

t
Lz Tz

Tagent line at

inflection point

u(t)

I

t

u(t) y(t)

Figure 12 - Unit-step response of a physical system (OGATA, 2010)

The second method is obtained by the proportional gain that is incremental until obtaining an oscillatory response
with constant amplitude. At this point, the critical gain (Ku) and the critical oscillatory period (Tu) are determined.
Figure 13 shows an example of the process response. The Ku and Tu values are used in Table 2 to obtain the PID
settings.

41


y(t)
Tu

0
t

Figure 13 - Examples of the frequency response (PINTO, 2014)

Table 2 - Ziegler-Nichols tuning rule based on frequency response.

Type of
Controller Kp Ti Td

P 0.5 Ku ∞ 0

PI 0.45 Ku
1

1.2
Tu 0

PID 0.6 Ku 0.5 Tu 0.125 Tu

3.2 MACHINE LEARNING

The field of Artificial Intelligence (AI) emerged at a workshop at Dartmouth College in 1956, making use of concepts
from many other science and engineering disciplines. This field is typically concerned with problems like playing
games, proving mathematical theorems, writing poetry, driving a car in the street, and diagnosing diseases. Histori-
cally, robotics walks side-by-side with AI focusing on building robots that can perform complex tasks in dynamic,
unpredictable, and unknown environments.

One of the branches of AI that has experienced a significant increase in the last decades is Machine Learning
(ML). This field studies algorithms and techniques for automating solutions to complex problems that are hard to pro-
gram using conventional programming methods (REBALA; RAVI; CHURIWALA, 2019). ML is based on the premise
that systems improve from experience, that is, a computer program is said to learn from experience E concerning
some class of task T and performance measure P, if its performance at tasks in T , as measured by P, improves with
experience E (MITCHELL, 1997).

ML agglutinates different families of algorithms divided according to the input/output nature, data/model, or the
algorithms’ theoretical nature. Considering the relationship between input and output, there are four types of learning:
supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning.

In supervised learning, the program receives a labeled data set, from which it has to learn the fundamental cha-

42


racteristics in each data set to create a general model so that the next time new data is provided, it can respond with a
correct label. For example, the machine receives thousands of images classified as mammals, birds, fish, and reptiles.
The machine learns the key characteristics that determine the set to which each image belongs. If an image of a new
animal is presented to the machine, it should determine the set to which the animal belongs.

On the other hand, unsupervised learning receives unlabeled data set, so this family of ML algorithms usually
focuses on identifying similarity trends. For example, a toy store has customer information and needs to know the
trend of customers buying teddy bears. So, the machine analyzed the information and determined that male customers
between the ages of 18 to 25 are the ones who buy it. In reality, the machine determined a purchase pattern for all
customers in the store, which determined this response.

Semi-supervised learning based on between supervised and unsupervised learning. The machine receives a large
data set, where only a part of the data is labeled. Typically, an algorithm of this family focus on the use of clustering
techniques (unsupervised learning) to determine groups within the data set and then determine the corresponding label
using the labeled data for each given group (REBALA; RAVI; CHURIWALA, 2019). The most obvious benefit is the
less effort and time employed to label data. For example, a company has 1 billion images of good and bad parts, and
it is specified that the machine determines which parts are bad. You would have to label some of these images and run
the algorithm for the machine to learn.

Reinforcement learning addresses how an agent learns to make decisions based on perception and execution in its
environment. This process does not need a training set and is based on trial and error. For example, a robot needs to
go in a straight line from one point to another in a room. The robot will start to move in this environment. If the robot
falls, the environment will give it a penalty. Otherwise, it receives a reward.

Another important family of ML algorithms is the one related to optimization. Optimization is the process of se-
lecting the hypothesis of models so that an objective function – that describes the quality of the solution – is optimized.
The optimal solution is the one in which the variables maximizes or minimizes the objective function. Evolutionary
Algorithms (EA) are typical representatives of this class of algorithms.

The next sections will describe in more detail two families of ML algorithms that are particularly relevant in this
work: i) Evolutionary Algorithms and ii) Reinforcement Learning.

3.2.1 OPTIMIZING WITH EVOLUTIONARY ALGORITHMS

In the last decades, there has been a growing interest in engineering problem-solving systems based on evolution
principles since the natural evolution process allows biological systems to be sophisticated, robust, and adaptable
(MITCHELL, 1998), (FLOREANO; MATTIUSSI, 2008). Holland and Rechenberg introduced this field of study.
They created optimization algorithms that are based on a population of possible solutions, which through a process of
selecting the fitness of individuals and some genetic operators allows obtaining an optimal solution (RECHENBERG,
1978), (HOLLAND et al., 1992).

The basic principles of evolutionary systems are derived from the principles of life written by Darwin in 1859,
which describes the following (ROTHLAUF, 2006):

• There is a population of individuals with different properties and abilities.

• Nature creates new individuals with properties similar to existing individuals.

• Promising individuals were selected more frequently for reproduction by natural selection.

43


3.2.2 PRINCIPLES OF THE EVOLUTION

Natural evolution theory is based on four pillars: population, diversity, inheritance, and selection. Two or more indi-
viduals define the population; an individual is a way of life that is made up of cells formed for genomes. A genome
is a complete set of chromosomes, which are the carriers of hereditary biological characteristics. Chromosomes are
made up of genes that are responsible for both the structure and processes of organisms. An allele is any variation
of a gene, which generally arises through the mutation. It is responsible for the heritable variation (NEAPOLITAN;
JIANG, 2018).

In summary, evolutionary systems define an individual through their genes, so there cannot be an individual gene-
tically the same as another (diversity). Individuals can reproduce so that individual characteristics are passed on to
the next offspring (inheritance). Darwinian theory shows that the individual with the greatest aptitude has the highest
probability of reproducing due to natural selection. The new generations adapt to the changing environment, and the
new behavioral characteristics, morphology, functionality, and natural niches allow for greater adaptation to changing
environments. Furthermore, natural selection ensures that negative traits are eliminated in future generations, and only
beneficial traits are propagated (FLOREANO; MATTIUSSI, 2008).

It is observed that the basic principles of evolutionary theory are sufficiently explanatory and relatively simple to
compact them in a model and apply them in several areas. Several formal models have been developed to address
specific problems, mainly in the field of optimization (FLOREANO; MATTIUSSI, 2008).

3.2.3 GENETIC ALGORITHM

In 1975, John Holland, in the book "Adaptation in Natural and Artificial Systems", presented the Genetic Algo-
rithm (GA) as an abstraction of biological evolution. Holland relied on the formal study of the phenomenon of adap-
tation in nature and then developed mechanisms of natural adaptation that could be imported into computer systems
(MITCHELL, 1998). GA is a method with a population that can be represented by a string of bits with m individu-
als. This population uses the ”natural selection” determined by an evaluation function that selects the most suitable
individuals for reproduction that, together with operators inspired by genetics - mutation, combination, and inversion -
produce individuals’ next generation. Figure 14 shows an example of the GA pipeline.

The fundamental part of GA is the selection of individuals from a population for their reproduction. Also, each
individual has an associated fitness, which determines how better an individual is. The fitness function f (∗) defined the
aptitude of the individual. For example, we must obtain the parameters of each joint for a humanoid robot to remain
standing. We have as input the position of the robot. Our fitness function is determined by the error between the initial
and current positions of our robot. GA’s problem arises in choosing an appropriate fitness function.

The selection of the individual that can reproduce is based on the fitness function, and there are several strategies
for this selection. The first idea is focused on choosing those who have high aptitudes. However, there is a certain risk
of loss of diversity and convergence to a sub-optimal solution in this strategy. Frequently, a proportional selection is
applied where the probability (p(i)) of an individual i-th reproduce is given by the relationship between its fitness value
( f (i)) and the sum of the fitness values of all individuals in the population (∑

i
f (i)) (MITCHELL, 1998), given by:

p(i) =
f (i)

m

∑
i

f (i)
. (71)

The expected number of individuals of the next-generation (m) is usually the same as the previous population.

Biological mutations inspired the creation of a genetic operator, whereby have a wide variety of operators. The
selection of operators depends on the individual’s structure. The crossover operator allows exchanging sub-parts of

44


1 0 0 1 0. . .
0 1 0 1 0. . .

1 1 0 0 0. . .
0 0 1 1 1. . .

Ind1

Ind2

Indm−1

Indm

1 1 0 0 0. . .

1 0 0 1 0. . .

0 0 1 1 1. . .

0 1 0 0 0. . .

0 0 0 1 0. . .

0 1 1 0 1. . .

f (∗)

Indm−1

Ind2

Ind1

. . .

Indm

Selection

Evaluation

Genetic Operator

New Generation

Figure 14 - Main features of the Genetic Algorithm

two individuals, that mimicking biological recombination between two organisms. The mutation operator randomly
changes a part of an individual. Finally, the inversion reverses the order of the individual’s genes.

3.2.3.1 GENETIC REPRESENTATIONS

A suitable genetic representation must be devised for the genetic operators to be used in the Genetic Algorithm, thus
allowing a high probability of generating increasingly better individuals and covering the space of optimal solutions to
the problem. There are various representations, so the most used will be detailed (ROTHLAUF, 2006).

The discrete representation describes an individual using the sequence of l discrete values extracted from an alpha-
bet with cardinality k. One of the most used is the binary alphabet 0,1 with cardinality k = 2. The binary representation
can be interpreted directly. In most cases, a different transformation must be performed to be interpreted (FLOREANO;
MATTIUSSI, 2008).

Another representation is the floating-point representation that is based on a set of n numbers that belongs to the
real numbers. This representation is suitable for solutions that require high precision parameter optimization. The
solution space for the problem must be taken into account (MITCHELL, 1998).

The tree representation is suitable for describing hierarchical structures with branching points and conditions,
which is used to describe computer circuits, construction procedures, and planning of experiments or compilers. This
representation is based on a mapping directly onto a tree (ROTHLAUF, 2006).

45


3.2.3.2 INITIAL POPULATION

The initial population must be large and diverse enough. Population size depends on the search space’s properties and
the computational cost of evaluating all individuals over several generations. The literature often finds populations ran-
ging from a hundred to a few thousand individuals. If an individual’s evaluation requires real-world experimentation, as
is the case with the evolution of a robot, the population size is usually less than one hundred individuals (MITCHELL,
1998).

In the binary and real value representations, a random sequence is generated. However, the real value representation
must be within the predefined range. Another way to initialize the population is to seed copy mutations in one or more
alleles known to correspond to good or promising individuals. However, this strategy carries the risk that the population
is not diverse enough and that evolution remains a sub-optimal solution (FLOREANO; MATTIUSSI, 2008).

3.2.3.3 FITNESS FUNCTION

The fitness function is associated with a numerical score for each individual in the population and indirectly assesses
the development of the learning process’s quality. There are two critical aspects of designing a fitness function: the
choice and combination of fitness components. It is possible to optimize multiple objectives of a problem into a single
fitness function. The optimize multiple objectives is based on the properties of the search space and find the relationship
between the objectives (REEVES; ROWE, 2002).

The evaluation of fitness is the slowest part of a genetic algorithm since the solution’s quality depends on how
exhaustive the individual’s evaluation has been (RECHENBERG, 1978). The fitness subjective exists when observers
rate individual performance through visual inspection. This fitness is often used in the artistic field, architectural
structures, and music. It is difficult to formalize aesthetic qualities into objective aptitude functions.

3.2.3.4 SELECTION AND REPRODUCTION

The selection should assign a more significant number of descendants to the best individuals in the population. The high
selection pressure means that only a small percentage of individuals will be selected for reproduction. This approach
results in a rapid increase in aptitude. However, it risks the diversity of the population. Therefore, a balance must be
maintained between selection pressure and diversity generation factors (REEVES; ROWE, 2002).

The proportional selection is defined by the probability of an individual making a copy of their genome based
on the relationship between their fitness value and the sum of the entire population’s fitness values. The selection
of individuals can be visualized as if it were a roulette wheel where the slot size depends on the probability of each
individual. This selection method does not work well when all the individuals have almost similar fitness values or
obtain a very high probability compared to the others. In the first case, all individuals will have an almost equal
probability of having offspring, and evolution is equivalent to a random search. In the second case, almost all the
offspring will be copies of the best-fit individual, whose genes will dominate the population and cause premature
convergence (FLOREANO; MATTIUSSI, 2008).

There are two solutions to these problems, the first is to scale the fitness values before selection to emphasize or
reduce differences; this procedure requires additional parameters. On the other hand, it is possible to classify all the
individuals from best to worst and to assign reproductive probabilities proportional to the individual’s rank; therefore,
no matter how small the difference is between two individuals, the best of the two will always have a greater chance of
having offspring (MITCHELL, 1998).

Another type of selection is the truncated selection based on ranges. This selection takes the first n individuals from
the list classified by the fitness function and copies these individuals m times until completing the population. It should

46


be considered that n cannot be small since it would cause premature convergence. This method guarantees it is used in
populations that do not need an exhaustive evaluation (FLOREANO; MATTIUSSI, 2008).

Selection by tournament consists of organizing a tournament among a small subset of individuals to generate each
offspring. k individuals are randomly selected from a population, where k is known as the tournament size. The
individual with the best aptitude of the k individuals generates an offspring. The individuals that were not chosen
return to the population to participate again in other tournaments. Tournament selection allows selection pressure and
genetic diversity to be maintained (MITCHELL, 1997).

The generational change is based on the fact that the offspring generated replaces the entire previous population.
However, in a complex search space, the fitness assessment can be very noisy or that genetic mutations strongly affect
the individual, such that a better individual is lost. For this case, there is the strategy of elitism, which consists of
keeping the n best individuals from the previous population. However, it is possible to relax the entire generational
replacement by inserting only a few descendants into the population, which is called gradual generational change
(FLOREANO; MATTIUSSI, 2008).

3.2.3.5 GENETIC OPERATORS

Genetic operators capture the effects of biological mutations. This subsection will describe the most commonly used
operators in the genetic representations seen above. The main task of genetic operators is to introduce diversity into
the population and the exploration of solutions.

The crossover operator is the most used, which allows the offspring to inherit the parents’ characteristics through
the recombination of the genomes of two selected individuals. The newly created offspring are randomly paired for
gene crossing. This operator is based on the benefit of the synergistic effect of the combination of the solutions found
by the parents. However, genetic recombination is equivalent to a large random mutation and has a deleterious effect
on the individual’s fitness (MITCHELL, 1998).

One-point crossover can be applied to discrete, real-value representations. It consists of randomly selecting a
crossing point in each of the two chains and exchanging genetic material between individuals around this point. Multi-
point crossing consists of randomly selecting n crossed points in the two chains and exchanging genetic material
between these points (REEVES; ROWE, 2002).

For real value representations, smooth and arithmetic crossovers are options. The uniform crossover consists of
exchanging the genetic content in n positions chosen at random. Instead, arithmetic crossing creates a unique gene
by taking the average of n randomly chosen positions from the two gene chains. For tree-based representations,
the crossover randomly selects one node in each parent and changes the two corresponding sub-trees (FLOREANO;
MATTIUSSI, 2008).

On the other hand, mutations are small random modifications that allow explorations of variations of existing
solutions. Mutations are useful for escaping from local minimal and making further progress in the highly convective
population where genetic recombination has little effect. However, the number and size of mutations must be relatively
low to avoid the loss of previously discovered solutions (BACK, 1996).

Mutation in binary representations consists of alternating selected bit values. In the real value representation, the
individual selected a position and added a random value based on the Gaussian distribution N(0,σ), where 0 is the
mean and σ is the variance.

In binary representations, the mutation consists of alternating the selected bit values. In real-value representations,
a selected position is modified by adding a random value drawn from a Gaussian distribution N(0,σ), where 0 is the
mean and σ is the variance, to produce few large mutations. In tree-based representations, a selection node swaps its

47


content with another from the same subset. If the selected node is a terminal, it will be replaced by another element
chosen at random from the terminal sets (FLOREANO; MATTIUSSI, 2008).

3.2.3.6 SCALING

Regulation of which individuals reproduce is critical in GA with small populations. At first, we may have some
extraordinary individuals in a population with multiple mediocre individuals. The underlying problem is that the early
strongest candidates may be in or around a local peak fitness space that can be difficult to get out of. In contrast, the
weakest candidates could be closer to the global maximum. The aggressive population pruning compromises genetic
diversity, reducing search space and search speed. Therefore, we may want to scale fitness so that selection pressure
remains the same throughout the run (GOLDBERG, 1989).

Linear scaling is a simple procedure where a linear relationship is defined between the raw fitness f and the scaled
fitness f ′ ( f ′ = a f + b). The coefficients a and b can be chosen in several ways; however, in all cases, it is desired
that the scaled average fitness f ′avg be equal to the average raw fitness favg to ensure that each member of the average
population contributes an expected offspring to the next generation.

Linear scaling misbehaves with negative aptitudes. Sigma truncation scaling can overcome the presence of poor
individuals. This method uses information from the population variance to pre-process the objective function’s values
( f ) before scaling. Sigma truncation scaling is given by:

f ′ = f − ( favg− c∗ fdev), (72)

where favg is the mean of the objective functions in the population. The constant c is chosen as a multiple of the
population standard deviation ( fdev), and the negative results are arbitrarily set to zero.

3.2.4 REINFORCEMENT LEARNING

In Reinforcement Learning (RL), the agent learns by interacting with the environment to maximize its gain. Imagine a
pet learning to bark for food. If we feed our pet every time it barks (action), the pet is likely to repeat this action to get
a response from us. The reward determines the agent’s behavior. For this example, our pet (agent) learns an action to
achieve a goal without any previous teaching. Therefore, Reinforcement Learning resided on the agent behavior who
seeks to act positively in its environment to obtain a reward from it.

Formally, the Reinforcement Learning (RL) agent interacts with its environment, producing a sequence of actions
a1,a2, . . . , over time. These actions affect the agent’s environment, resulting in a reward or punishment rt at each
time step t. The algorithm’s goal is to learn to act in a way that can maximize some measure of utility in the future
(NEAPOLITAN; JIANG, 2018). Reinforcement Learning comprises the following elements (SUTTON; BARTO,
2018):

• Policy (πt ): defines the agent’s behavior in time t. Policy maps the perceived states of the environment to the
actions it must take in that state.

• Reward (Rt ): define the objective of an RL problem, that is, which events are good and bad for the agent. The
agent’s goal is to maximize or minimize the total reward he receives in the long run. The reward is unalterable
but can be used as a basis for modifying policies.

• Value Function (Vπ(∗)): determines what is better in the long term, although the reward indicates what is better
in the immediate sense. Generally, the total amount of reward that an agent can expect to accumulate in the
future, starting from that state.

48


• Environment model: allows it to make inferences about how the environment will behave. The model is used
to plan for future situations before they happen. This element may not exist, as there are free-model methods
that learn from trial and error.

One of RL’s challenges is the trade-off between exploration and exploitation. To obtain a greater reward, an
agent must prefer the actions that it has tried in the past, resulting in reward (exploit). However, it falls into an over-
exploitation problem, where it only focuses on a particular group of actions falling to a local solution. On the other
hand, if the algorithm explores too much, it may never really learn the problem’s desired solution. This dilemma is
called exploration-exploitation, which has been studied intensively. These two paradigms must be balanced, which is
very difficult to carry out (SUTTON; BARTO, 2018).

Besides, we have a strand of RL that has helped in the last decades increasing investigations in this area, which are
the deep reinforcement learning algorithms. These methods employ deep neural networks and large-scale computing
power to improve the capacity to store the experiences of Reinforcement Learning (RL) agents. This new capability has
allowed solving problems in various fields, such as board games or video games. Another field that is strongly using
this aspect is robotics so that robots learn complex tasks for dynamic environments (REBALA; RAVI; CHURIWALA,
2019).

The difference between the Evolutionary Algorithms and Reinforcement Learning relies on the fact that RL algo-
rithms can remember and use what it has learned over time. In contrast, evolutionary systems make decisions in each
generation based on information from the current generation.

3.2.4.1 MARKOV DECISION PROCESS

Markov Decision Process (MDP)s is a mathematical model that formalizes sequential decision-making, where actions
influence not only the immediate reward but also subsequent situations or states, and through future rewards. Thus,
MDPs involve delayed reward and the need to compensate for immediate and delayed reward (SZEPESVÁRI, 2010).

Figure 15 shows a robot responsible for making decisions, called agent. The thing it interacts with, comprising
everything outside the agent, is called the environment. The agent continuously selects actions and the environment
responds to these actions and presents new situations to the agent. The environment also generates rewards, which are
numerical values that the agent aims to maximize over time through its actions’ choice. This process is mathematically
defined as a Markov Decision Process (MDP).

Specifically, the agent and the environment interact in each sequence of discrete time steps, t = 0,1,2,3, . . . . At
each time step t, the agent receives the representation of the state of the environment, St ∈ S, and the agent selects an
action, At ∈ A(s). In the next step, in part as a consequence of the action, the agent receives a reward Rt+1, and finds a
new state St+1. So the sequence would be as S0,A0,R1,S1,A1,R2,S2,A2, . . . (SUTTON; BARTO, 2018) (DONG et al.,
2020).

A finite MDP occurs when the state, action, and reward sets have a finite number of elements. For this case, the
variables Rt and St have a defined discrete distribution that depends on the state and the previous action. Therefore,
given s′ ∈ S and r ∈ R, there is a probability that these values occur at time t, given the particular values of the preceding
state and action, given by:

p(s′,r | s,a) .
= Pr{St = s′,Rt = r | St−1 = s,At−1 = a}. (73)

∑
s′∈S

∑
r∈R

p(s′,r | s,a) = 1, for all s ∈ S, a ∈ A(s), (74)

where s′,s ∈ S, r ∈ R, and a ∈ A(s). The function p defines the MDP dynamics. It specifies a probability distribution
for each choice of s and a (Eq. 74).

49


Agent

Environment

state
st

st+1

rt+1

reward
rt

action
at

Figure 15 - The agent-environment interaction in Reinforcement Learning (SUTTON; BARTO, 2018)

From p, we can derive the state transition probability, denoted by:

p(s′ | s,a) .
= Pr{St = s′ | St−1 = s,At−1 = a}. (75)

p(s′ | s,a) = ∑
r∈R

p(s′,r | s,a). (76)

In addition, the expected reward of the state-action pair can also be calculated as:

r(s,a) .
= E{Rt | St−1 = s,At−1 = a}. (77)

r(s,a) = ∑
r∈R

r ∑
s′∈S

p(s′,r | s,a). (78)

An MDP is an abstraction of the problem of goal-directed learning from interaction. MDP is an independent
scheme of the goal, the sensory, the memory, and control details. In conclusion, the RL problems can be reduced to
three signals between the agent and his environment: the signal to represent the agent’s action, the signal to represent
the agent decision (states), and the signal to define the agent’s objective (reward). (SUTTON; BARTO, 2018).

3.2.4.2 AGENT’S OBJECTIVE

The agent’s objective is to maximize the cumulative reward it receives in the long term. The sequence of rewards
received over time t is denoted Rt+1,Rt+2,Rt+3, . . . . The agent can only maximize the future cumulative reward for the
rest of the episode since it cannot go back in time to change past rewards. Gt will be defined as the cumulative reward
that the agent receives (REBALA; RAVI; CHURIWALA, 2019):

Gt
.
= Rt+1 +Rt+2 +Rt+3 + · · ·+RT , (79)

where T is a final time step. This approach makes sense for applications where the agent-environment interaction
naturally breaks into sub-sequences (episodes), for example, board games where you have an ending that can be
winning or losing the game, and again return to the initial state of the board.

50


On the other hand, other agent-environment interactions are not divided into identifiable episodes but continue
continuously without limit. In this case, the maximization of the reward tends to infinity easily. The concept that is
added is the discount, in which the agent tries to select actions to maximize the sum of the discount rewards that he
receives in the future, such that (SUTTON; BARTO, 2018):

Gt
.
= Rt+1 + γ Rt+2 + γ

2 Rt+3 + γ
3 Rt+4 + . . . , (80)

Gt
.
= Rt+1 + γ (Rt+2 + γ Rt+3 + γ

2 Rt+4 + . . .), (81)

Gt
.
= Rt+1 + γ Gt+1, (82)

where γ is a parameter, 0 ≤ γ ≤ 1, called the discount rate. If γ is less than 1, its value decreases exponentially,
therefore, future rewards contribute less percentage of their numerical value in the cumulative rewards, and this avoids
the problem of infinite rewards.

3.2.4.3 POLICY AND VALUE FUNCTIONS

Reinforcement Learning algorithms need to estimate the value function, that is, state functions that estimate "how

good" it is for the agent to be in a state and "how good" it is to perform an action given that state. The notion of "how

good" is defined in terms of future rewards that can be expected (expected performance). Value functions are defined
concerning a particular way of acting, called policies (SEWAK, 2019).

A policy is a mapping of states to selection probabilities of possible actions. If the agent follows the policy π at
time t, then π(a | s) is the probability that At = a if St = s. Reinforcement learning methods specify how agent policy is
changed as a result of their experience. For any π policy and any s state, the following consistency condition between
the value of s and the value of its possible successor states (MITCHELL, 1997) (SUTTON; BARTO, 2018):

Vπ(s)
.
= Eπ [Gt |St = s]. (83)

= Eπ [Rt+1 + γ Gt+1 |St = s]. (84)

= ∑
a

π(a | s)∑
s′

∑
r

p(s′,r | s,a)[r+ γ Eπ [Gt+1 |St+1 = s′] ]. (85)

= ∑
a

π(a | s)∑
s′,r

p(s′,r | s,a)[r+ γ Vπ(s′)], for all s ∈ S (86)

where it is implied that the action, a ∈ A(s), that the next state, s′ ∈ S, and that the reward, r ∈ R.
Solving a Reinforcement Learning task means finding a policy that achieves many rewards. Infinite MDP, an

optimal policy can be precisely defined, which the value function defines a partial order on the policy. The π policy is
defined as better than or equal to a π ′ policy if its expected return is greater than or equal to π ′ for all states. We denote
all optimal policies by pi∗, where they share the same state value function, called the optimal state value function (v∗),
defined as (SUTTON; BARTO, 2018):

V∗(s)
.
= max

π
Vπ(s), for all s ∈ S (87)

Optimal policies also share the same optimal action-value function, denoted as q∗, and defined by:

q∗(s,a)
.
= max

π
qπ(s,a), for all s ∈ S,a ∈ A(s) (88)

q∗(s,a) = E[Rt+1 + γ v∗(St+1) |St = s,At = a] (89)

51


The different Reinforcement Learning algorithms solve the problems of learning value functions, improving poli-
tics, or performing both simultaneously, such as the actor-critic method.

We detail the theoretical foundations of humanoid robots, and in this chapter, we see the different control strategies.
In the next chapter, we summarize the works of literature for the walk of humanoid robots.

52


4 HUMANOID ROBOT WALKING APPROACHES

"To know more is to be freer"
César Vallejo

Walk can be defined as "to move or go somewhere by placing one foot in front of the other on the ground, but without

running" (OXFORD, 2015). If we adopt this definition, we must consider that the robot must always have at least one
foot in contact with the ground during the walking process. However, how can we make a machine to perform such
movements? Bipedal locomotion is approached in the literature in various distinct aspects, including walking patterns,
control algorithms, types of controllers, control strategies, optimization techniques, and so on. Some of these aspects
will be discussed in this chapter. Figure 16 presents some of the most relevant aspects found in literature considering
the state of art.

Type of
Controller

Bipedal
Locomotion

Torque

Position

Velocity
Controller
Strategy

Lower Part

Whole

Decoupled

Body

Control
Algorithm

Quasi-static Dynamic

Based on

Model Free

Model

Walk
Pattern

Angular Neuro-
Trajectory muscular

Optimization
Techniques

Evolutionary Reinforcement
Technique Learni