UNIVERSIDADE ESTADUAL PAULISTA
"JÚLIO DE MESQUITA FILHO"

João Renato Ribeiro Manesco

3D Human Pose Estimation Based on
Monocular RGB Images and Domain Adaptation

Bauru
2023


João Renato Ribeiro Manesco

3D Human Pose Estimation Based on Monocular RGB
Images and Domain Adaptation

Dissertação apresentada como parte dos
requisitos para obtenção do título de Mestre
em Ciência da Computação, junto ao Programa
de Pós-Graduação em Ciência da Computação,
da Universidade Estadual Paulista “Júlio de
Mesquita Filho".

Área de concentração: Computação Aplicada

Financiadora: FAPESP - Proc. 2021/02028-6 e
2022/07055-4

Orientador: Prof. Associado Aparecido
Nilceu Marana
Co-Orientador: Prof. Titular Stefano Berretti

Bauru
2023


Manesco, João Renato Ribeiro.
3D Human Pose Estimation Based on Monocular RGB Images and Domain

Adaptation / João Renato Ribeiro Manesco. – Bauru, 2023
74 f. : il., tabs.

Supervisor: Prof. Associado Aparecido Nilceu Marana
Dissertação (mestrado) - Universidade Estadual Paulista “Júlio de Mesquita

Filho", Faculdade de Ciências, Bauru, 2023

1. Estimação de Pose Humana 3D. 2. Adaptação de Domínio. 3. Deep Learning.
I. Marana, Aparecido Nilceu II. Universidade Estadual Paulista “Júlio de Mesquita Filho",
Faculdade de Ciências. III. 3D Human Pose Estimation Based on Monocular RGB Images
and Domain Adaptation.

CDU – 518.72:76


João Renato Ribeiro Manesco

3D Human Pose Estimation Based on Monocular RGB
Images and Domain Adaptation

Dissertação apresentada como parte dos
requisitos para obtenção do título de Mestre
em Ciência da Computação, junto ao Programa
de Pós-Graduação em Ciência da Computação,
da Universidade Estadual Paulista “Júlio de
Mesquita Filho".

Área de concentração: Computação Aplicada

Financiadora: FAPESP - Proc. 2021/02028-6 e
2022/07055-4

Comissão Examinadora

Prof. Associado Aparecido Nilceu Marana
UNESP - Câmpus de Bauru - SP
Orientador

Prof. Titular Hélio Pedrini
UNICAMP - Campinas - SP

Prof. Associado João Paulo Papa
UNESP - Câmpus de Bauru - SP

Bauru
30 de agosto de 2023


I dedicate this dissertation to my cherished family, friends, and mentors, whose unwavering
support has guided me to this milestone.


Acknowledgements

I extend my heartfelt gratitude first and foremost to my family whose enduring support
and assistance have enabled me to conquer this dream.

Furthermore, my sincere appreciation goes to my professors and mentors who have
played a pivotal role in my educational journey, particularly, Prof. Aparecido Nilceu Marana,
my advisor, for his invaluable guidance, support and kindness throughout this period.

I’d also like to thank my colleagues at MICC, as well as my supervisor during my period
abroad, Prof. Stefano Berretti, for all the help during my time in Italy.

I wish to express my thanks to my friends for their companionship and unwavering
presence during both challenging and joyful moments.

Lastly, my gratitude extends to FAPESP for the sponsorship through Proc. 2021/02028-6
and 2022/07055-4.


Progress is made by trial and failure; the failures
are generally a hundred times more numerous than

the successes, yet they are usually left unchronicled.
William Ramsay


Resumo
Estimação de poses humanas em imagens monoculares é um importante e desafiador problema
de Visão Computacional cujo objetivo é obter a forma do corpo de um indivíduo baseando-se em
uma única imagem. Atualmente, métodos que empregam técnicas de deep learning destacam-se
na tarefa de estimação de poses humanas 2D. Poses 2D podem ser utilizadas em um conjunto
diverso e amplo de aplicações, de grande relevância para a sociedade. Entretanto, a utilização de
poses 3D pode trazer resultados ainda mais precisos e robustos. Como rótulos referentes a poses
3D são difíceis de serem adquiridos e suas aquisições podem ser realizadas apenas em locais
restritos, métodos totalmente convolucionais apresentaram desempenho insatisfatório para a
tarefa. Uma estratégia para solucionar este problema consiste em utilizar estimadores de poses
2D, que já se encontram mais consolidados, para estimar poses 3D em duas etapas, a partir de
poses 2D. Devido a restrições na aquisição das bases de dados, a melhora de performance desta
estratégia só pode ser observada em ambientes controlados, desta forma, técnicas de adaptação
de domínio podem ser aplicadas com o objetivo de melhorar a capacidade de generalização
dos métodos por meio da inserção de novos ângulos de câmera e ações, advindos de domínios
sintéticos. Neste trabalho, propomos um novo método, chamado de Domain Unified Approach
(DUA), que visa resolver os problemas causados pela má representação de pose em cenários
com domínios distintos, por meio da adição de três novos módulos ao estimador de poses:
conversor de pose, estimador de incerteza e classificador de domínio. Treinado com um conjunto
enorme de dados sintéticos (SURREAL) e aplicado a um conjunto de dados obtido de um
cenário do mundo real (Human3.6M), nosso método DUA levou a uma redução de 44,1 mm
no erro médio por posição de junta no espaço 3D, um resultado bastante competitivo com os
resultados do estado da arte.
Palavras-chave: Estimação de Poses Humanas 3D, Poses 2D, Adaptação de Domínio.


Abstract
Human pose estimation in monocular images is an important and challenging problem in
Computer Vision. Currently, methods that employ deep learning techniques excel in the task of
2D human pose estimation. 2D poses can be used in a diverse and broad set of applications,
of great relevance to society. However, the use of 3D poses can bring even more accurate
and robust results. Since labels referring to 3D poses are difficult to acquire and can only be
obtained in restricted scenarios, fully convolutional methods tend to perform poorly on the
task. One strategy to solve this problem is to use 2D pose estimators, already well established
in the literature, to estimate 3D poses in two steps using 2D pose inputs. Due to database
acquisition constraints, the performance improvement of this strategy can only be observed
in controlled environments, therefore domain adaptation techniques can be used to increase
the generalization capability of the system by inserting new actions and camera angles from
synthetic domains. In this work, we propose a novel method called Domain Unified Approach
(DUA), aimed at solving pose misalignment problems on a cross-dataset scenario, through
a combination of three modules on top of the pose estimator: pose converter, uncertainty
estimator, and domain classifier. Trained on a huge synthetic dataset (SURREAL) and applied
to a dataset taken from a real-world scenario (Human3.6M), our DUA method led to a
44.1mm reduction in mean error per joint position in 3D space, a result quite competitive with
state-of-the-art results.
Keywords: 3D Human Pose Estimation, 2D Poses, Domain Adaptation.


List of Figures

Figure 1 – Main models used to represent 2D human poses. . . . . . . . . . . . . . . 16
Figure 2 – 2D human poses estimated in an image and represented through the COCO

model (LIN et al., 2014). . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Figure 3 – Examples of challenges that can be found on the human pose estimation

task in real environments (domains). . . . . . . . . . . . . . . . . . . . . . 18
Figure 4 – Different models used to represent the human pose. . . . . . . . . . . . . 24
Figure 5 – Usual pipeline of a skeleton-based action recognition method. . . . . . . . 24
Figure 6 – Upper body pose information of a patient positioned in a hospital bed. This

information can be used to monitor proper posture and patient activity in
the hospital. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Figure 7 – Augmented reality application, in which a pose is applied to a 3D model of
a character visualized in a photo. . . . . . . . . . . . . . . . . . . . . . . 26

Figure 8 – Hourglass module. Each one of the blue boxes is a residual module presented
below the network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Figure 9 – Taxonomy proposed in this dissertation to categorize the different two-step
3D pose estimation approaches found in the literature. . . . . . . . . . . . 31

Figure 10 – Illustration of the operation of a generic matching technique. . . . . . . . . 32
Figure 11 – General scheme of operation of regression methods. . . . . . . . . . . . . . 33
Figure 12 – Residual neural network architecture proposed to solve the two-step 3D pose

estimation problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Figure 13 – SemGCN network architecture (a), accompanied by the representation of

the semantic graph of a pose used as a basis for the development of the
task (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Figure 14 – General scheme of a self-supervised method based on the error obtained
between the 3D pose projection and the input 2D pose. . . . . . . . . . . 37

Figure 15 – Overview of refinement methods, illustrating a situation where both pre-
processing and post-processing techniques are applied. . . . . . . . . . . . 38

Figure 16 – Example of a domain shift problem. The first graph (left) shows source
domain data with a trained classifier highlighted in blue. The second graph
(middle), shows the target domain with a trained classifier highlighted in
red. In the third graph (right), data from both domains are shown after
domain adaptation, with a classifier trained on the common domain. . . . . 40

Figure 17 – The Deep Adaptation Network architecture. . . . . . . . . . . . . . . . . . 41
Figure 18 – The Domain Adversarial Neural Network architecture. . . . . . . . . . . . . 42


Figure 19 – Proposed Domain Unified approach for 3D human pose estimation. The
method is composed of three main modules on top of the 3D pose estimator:
the unified pose representation module, the uncertainty estimation module,
and the domain discriminator. The dashed lines on the pose converter
represent frozen weights. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Figure 20 – Five distinct pose representations used by common 3D Human Pose datasets
found in the literature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Figure 21 – Overlapped joints of the Human3.6M dataset coming from two distinct
pose representations, SMPL (red) and the original H3.6M format (black).
This makes explicit the difference in the pose representations being used by
common 3D human pose datasets in the literature. . . . . . . . . . . . . . 49

Figure 22 – Pose conversion method used to find a unified pose representation. . . . . 50
Figure 23 – Uncertainty-based 3D human pose estimation method devised using Martinez

et al. (2017) as backbone. . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Figure 24 – Example images found in the SURREAL dataset. . . . . . . . . . . . . . . 53
Figure 25 – Example images found in the Human3.6M dataset. . . . . . . . . . . . . . 54
Figure 26 – Overlapping of the different pose representations available, Human3.6M

(red), SMPL ground-truth (blue), converted pose (black). Item (a) shows
a Human3.6M (red dots) pose superimposed on the original SMPL pose
(blue dots). Item (b) shows the resulting pose after conversion (black dots)
superimposed to the original SMPL pose (blue dots). Item (c) shows the
converted pose (black dots) superimposed to the original Human3.6m pose
(red dots). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Figure 27 – Qualitative results obtained from our proposed approach on the Human3.6M
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Figure 28 – Poses in which the method achieved the worst performance per scenario. . 62


List of Tables

Table 1 – MPJPE and P-MPJPE measures, by groups of actions, of the SMPL model
without pose conversion (right) and with pose conversion (left), when over-
lapped to the original Human3.6M pose. Error-values in millimetres (mm) -
the lower the better. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Table 2 – MPJPE and P-MPJPE measures, per joints, of the SMPL model without
pose conversion (right) and with pose conversion (left), when overlapped to
the original Human3.6M pose. Error values in millimetres (mm) - the lower
the better. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Table 3 – MPJPE measures obtained from the 3D Human Pose considering all of the
established scenarios with distinct 2D Pose sources: Stacked Hourglass (HG)
(denoted by ∗), Cascaded Pyramid Networks (CPN) (denoted by †), and
Ground Truth (GT) (denoted by ‡). Experiments were performed using a
linear backbone (indicated by ⋆) and a graph-based backbone (indicated by
§). The best results are presented in bold. Error-values in millimeters (mm) -
the lower the better. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Table 4 – Results from the MPJPE metric (mm – the lower the better) obtained from
different domain adaptation scenarios. . . . . . . . . . . . . . . . . . . . . 59

Table 5 – Quantitative results obtained on the H3.6M → SURREAL evaluation. Table
results and layout are obtained from experiments conducted by Zhang et al.
(2021) and Kundu et al. (2022), bold indicates the best result. . . . . . . . 59

Table 6 – Explicit ablation results of our method evaluated on the Human3.6M →
SURREAL evaluation setting. . . . . . . . . . . . . . . . . . . . . . . . . . 60


List of abbreviations and acronyms

AMASS Archive of Motion Capture As Surface Shapes

CPN Cascaded Pyramid Networks

CVAE Conditional Variational Auto Encoder

DA Domain Adaptation

DAN Deep Adaptation Networks

DANN Domain Adversarial Neural Networks

DDC Deep Domain Confusion

DUA Domain Unified Approach for 3D Human Pose Estimation

GCNs Graph Convolutional Networks

GRL Gradient Reversal Layer

GT Ground Truth

H3.6M Human3.6M

HG Stacked Hourglass

HOG Histograms of Oriented Gradient

LCNs Localized Neural Networks

MK-MMD Multi-Kernel Maximum Mean Discrepancy

MoSh Motion and Shape capture

MPJPE Mean Per-Joint Position Error

OOD Out-of-Distribution

P-MPJPE Procrustes-Aligned Mean Per-Joint Position Error

RegDA Regressive Domain Adaptation for Unsupervised Keypoint Detection

RSD Representation Subspace Distance

SemGCN Semantic Graph Convolutional Networks

SMPL Skinned Multi-Person Linear Model


Contents

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2.1 Specific Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.5 Document Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2 HUMAN POSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1 Human Pose Definition and Representation . . . . . . . . . . . . . . 23
2.2 Applications of Human Pose Estimation . . . . . . . . . . . . . . . . 23
2.2.1 Action Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.2 Healthcare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.3 Augmented and Virtual Reality . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Pose Estimation Approaches . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1 Single Person Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1.1 Regression-based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.1.2 Detection-based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.2 Multiple People Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.3 3D Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Proposed Taxonomy for 3D Human Pose Estimation Methods . . . 30
2.4.1 Matching Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.2 Regression Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.2.1 Image-based Regression Techniques . . . . . . . . . . . . . . . . . . . . . . 33
2.4.2.1.1 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.2.2 Video-based Regression Techniques . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.2.3 Self-Supervised Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.3 Refinement-based techniques . . . . . . . . . . . . . . . . . . . . . . . . . 37

3 DOMAIN ADAPTATION . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1 Motivation and Definition . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Deep Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 Deep Adaptation Network . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.2 Domain Adversarial Neural Network . . . . . . . . . . . . . . . . . . . . . 41
3.3 Regressive Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . 42


4 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1 3D Human Pose Estimation in Cross-Domain Scenarios . . . . . . . 44

5 PROPOSED METHOD . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1 DUA - Domain Unified Approach . . . . . . . . . . . . . . . . . . . . 46
5.2 Unified Pose Representation . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Pose Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6 EXPERIMENTAL SETTINGS . . . . . . . . . . . . . . . . . . . . . 52
6.1 Hardware Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.2 Hyperparameter Values . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.3 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.4 Evaluation Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.5 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.6 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

7 RESULTS AND DISCUSSION . . . . . . . . . . . . . . . . . . . . . 55
7.1 Unified Pose Representation . . . . . . . . . . . . . . . . . . . . . . . 55
7.2 Pose Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.3 Domain Unified Approach . . . . . . . . . . . . . . . . . . . . . . . . . 58

8 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . 63
8.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
8.3 Published Articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66


1 Introduction

This chapter aims to introduce the task of 3D Human Pose Estimation, addressing
current challenges and issues found in the existing literature, in order to properly describe the
research that has been done and our contributions to the state of the art.

Human pose estimation is an essential and challenging computer vision problem. Its
objective is to estimate the human body shape (pose) based on a single image, usually
monocular. This shape can be inferred by the detection of joints in a skeleton, which are
connected in such a way that each connection represents a part of the human body (ZHENG
et al., 2020).

Currently, methods that use deep learning techniques using a bottom-up approach
excel in the task of 2D pose estimation for images full of people, presenting good accuracy
in real-time scenarios. Among these methods, we can cite OpenPose (CAO et al., 2019) and
PifPaf (KREISS et al., 2019).

For the 2D pose representation, a few models have been proposed, among those we can
find: MPI (INSAFUTDINOV et al., 2016), COCO (LIN et al., 2014) and BODY 25 (CAO et al.,
2019). Figure 1 shows the main models used to represent 2D poses and Figure 2 shows examples
of 2D human poses estimated in an image and represented using the COCO model (LIN et al.,
2014).

Figure 1 – Main models used to represent 2D human poses.

(a) MPI (b) COCO (c) BODY 25

Source: Reprinted from Silva and Marana (2020), Copyright 2020, with permission from Elsevier.

2D poses can be employed in a diverse and vast set of applications, of major relevance to


17

Figure 2 – 2D human poses estimated in an image and represented through the COCO
model (LIN et al., 2014).

Source: Cao et al. (2019), © 2019 IEEE.

society, among which we can mention: crowd control, action recognition, person identification,
medical aid for therapies and sports analysis, human-computer interaction, augmented and
virtual reality, and pedestrian location for autonomous cars (CHEN et al., 2020). However, the
usage of 3D poses can bring even more accurate and robust results, as seen in 2D pose-based
methods proposed by Silva and Marana (2020) and Jangua and Marana (2020), which, despite
showing good results for action recognition and gait recognition, respectively, rely on the proper
camera positioning to achieve success. If these methods used 3D poses on their pipeline, this
restriction would be minimized and their performances would be even better.

There are a few ways to approach the 3D human pose estimation problem, for example
by using depth sensors, infrared sensors, radio sensors, or even multiple camera perspectives by
pose triangulation. However, these solutions end up being costly to implement or work in highly
controlled environments (CHEN et al., 2020). Besides those restrictions, with the crescent
growth of digital cameras shipped in mobile devices, like smartphones and webcams, a necessity
to approach the 3D human pose estimation problem by using monocular RGB images emerges.

The usage of a single RGB camera introduces several challenges to the problem of
human pose estimation, such as the occurrence of occlusions and the lack of full-body images
of some individuals. Furthermore, variations in clothing, body type, and camera angle can have
a negative impact on the performance of the methods (BARTOL et al., 2020). In Figure 3, it
is possible to notice some of the challenges found in the problem of human pose estimation
from images captured in real and non-controlled environments (domains). These challenges
become even more significant when methods based on a single RGB image are used.

The referred challenges are further aggravated when the objective of the analysis is 3D
poses since the majority of datasets used for the training of this task is obtained in controlled
environments, through the usage of motion capture systems (DOERSCH; ZISSERMAN, 2019),


18

Figure 3 – Examples of challenges that can be found on the human pose estimation task in
real environments (domains).

Source: Chen et al. (2020), Copyright 2020, with permission from Elsevier.

which decreases the amount of relevant data present, making it very difficult to apply the
methods in real environments. In these cases, there is a significant disparity between the domain
used for training and the real domain in which the methods are going to be applied.

The development of new methods and the improvement of techniques in the area of
Computer Graphics and Computer Systems have enabled the emergence of animations and
games with photorealistic features on environments and characters, as is the case of the game
Grand Theft Auto V, from the company Rockstar North1. This kind of development enables the
ability to create synthetic datasets through legal modifications in the game source code, such
as the Joint Track Auto (JTA) dataset (FABBRI et al., 2018), allowing large sets of accurately
labeled images to be obtained in a wide variety of environments, with different camera angles,
and different times of the day. The usage of such diverse synthetic databases can contribute
significantly to the improvement of 3D human pose estimation methods.

Despite the feasible contribution of synthetic data in the learning process of the task,
its introduction can lead to a problem called domain shift, which occurs when the probability
distributions from training and evaluation are different (KOUW; LOOG, 2018). In this case,
distinctions between actions, environments, and camera angles between synthetic and real
domains can directly impact in the efficacy of the method, thus, the usage of domain adaptation
techniques as a way to mitigate the domain shift problem between datasets can be of great
importance. Domain adaptation can be defined as the ability to apply an algorithm trained in
one or more source domains to a different but related target domain (CSURKA, 2017; WANG;
DENG, 2018; WEN et al., 2018).
1 https://www.rockstargames.com/br/games/V


19

1.1 Problem
According to Martinez et al. (2017), state-of-the-art 3D two-step pose estimation

techniques tend to perform better than their end-to-end counterparts. Even so, recent methods
have reported high levels of overfitting regarding camera angles in frequently used databases,
thereby impacting the performance of real applications (WEI et al., 2019; VÉGES et al., 2019;
XU et al., 2020; XU et al., 2021; CHEN et al., 2021). Furthermore, evaluation protocols aimed
to address these issues are rarely used in the literature, such that the most frequently used
workaround is data augmentation across different camera angles.

The presence of overfitting in this type of scenario can be attributed to the fact that
3D pose dataset annotation needs to be captured in restricted and controlled environments,
which implies a limitation in the set of actions and scenarios represented by the methods.

An alternative, and arguably more reliable solution to this problem, would be to include
synthetic datasets during the training step, which expand the number of available poses,
scenes, and camera angles, providing a way to deal with the overfitting problem caused by the
limitations of a single database. As emphasized in Chapter 4, some few works seek to follow
this strategy to assist the learning process. This type of solution, however, ends up requiring the
manipulation of data from different domains (real and synthetic), which can be a challenging
task due to possible differences in the distributions of both databases, which leads to a problem
called domain shift (CSURKA, 2017).

Although domain adaptation methods are capable of handling the problem of domain
shift in several types of situations, a major drawback can be noticed. The concept of domain
adaptation, as well as the methods that arise from this concept, were all intended to work in a
classification scenario (JIANG et al., 2021).

Therefore, in this work, we seek to address the problem of incorporating data from differ-
ent domains by combining domain adaptation with techniques that address the representation
problems found by the usage of different sensors in the 3D pose acquisition phase.

1.2 Objective
The primary objective of this research is to improve the existing methods for estimating

3D poses in monocular RGB images. To accomplish this, two properly labeled datasets composed
of 3D human poses representing two distinct domains, real-world and synthetic, are used in
conjunction with domain adaptation techniques in order to facilitate effective knowledge transfer
between the two distinct domains.

Estimating 3D poses in monocular RGB images is a significant challenge as it requires
inferring depth information from a single image. Various approaches have been proposed to
address this issue, but in this study, our focus will be on methodologies that utilize 2D poses


20

as a foundation for 3D regression, as detailed in Section 2.3.3.

We found out, during the evaluation of the proposed methods, that treating the problem
in a cross-domain scenario introduces many problems alongside the domain shift, such as
the misrepresentation of poses between domains as well as the error propagation along the
edge-joints, to deal with that, we proposed a domain unified approach for 3D human pose
estimation, aimed at dealing with all the aforementioned problems in a cohesive way.

1.2.1 Specific Objectives
Besides the primary objective, this work has the following specific goals:

• Review and analyze the state-of-the-art techniques for estimating 3D poses in monocular
RGB images;

• Assemble and fetch real-world and synthetic datasets to work in this cross-domain
scenario;

• Investigate and implement domain adaptation techniques to enable the effective transfer
of knowledge from synthetic to real domains;

• Propose a way to deal with the pose misrepresentation of joints between domains;

• Investigate the presence of domain shift in the aforementioned problem;

• Evaluate the performance of the proposed method on benchmark datasets, comparing
them to existing methods, and quantifying the improvements achieved;

• Conduct experiments to assess the model efficacy and generalizability across diverse
scenarios;

• Investigate the limitations of the proposed method and identify potential areas for further
improvement and research.

1.3 Hypothesis
Currently, several challenges can be found in the field of 3D human pose estimation,

such as pose misrepresentation, the propagation of errors on extreme kinematic joints, and the
emergence of domain shift when integrating data from diverse sources.

Therefore, the core hypothesis of this work is that a unified approach, combining
solutions related to pose conversion, uncertainty estimation, and domain adaptation, has the
potential to achieve more robust pose estimation results, even in out-of-distribution scenarios.


21

1.4 Contributions
To date, few studies have investigated the direct use of domain adaptation methods to

assist in the estimation of 3D poses from 2D poses in a monocular environment, however, the
problem of working with multiple datasets introduces many details to the problem in such a
way that the current solutions are unable to address all the nuances of cross-domain 3D human
pose estimation. The following issues were identified as exacerbating the domain discrepancy
in the task:

• The utilization of diverse body capture sensors across distinct datasets that leads to
distinct pose representations, consequently resulting in misalignment between joints;

• Distinct domains frequently exhibit misalignment in their camera and action distributions,
which can impact the accuracy and robustness of 3D human pose estimation;

• The propagation of errors on the edge kinematic groups, namely the arms, and legs,
resulting in a substantial increase in the overall error.

In order to address the aforementioned problems, and subsequently mitigate the domain
discrepancy, this study brings three main contributions to the pose estimation procedure:

• The introduction of a pose conversion technique aimed at achieving a Unified Pose
Representation to overcome the observed differences between capture sensors;

• An enhancement and evaluation of the pose estimation training pipeline through the
development of a novel uncertainty-based method;

• The creation of a domain adaptation model based on adversarial networks for 3D human
pose estimation.

1.5 Document Organization
Besides this introductory chapter, this document also feature the following chapters:

Chapter 2: This chapter provides a literature review on 3D Human Pose Estimation. Addi-
tionally, a taxonomy is proposed to facilitate the systematic discussion and definition of
current works.

Chapter 3: Provides a concise literature review on domain adaptation, laying the foundation
to connect domain adaptation and 3D human pose estimation, and providing a formal
definition of the domain-adaptative 3D human pose estimation problem.


22

Chapter 4: Introduces relevant works that directly address our research problem, specifically
focusing on cross-domain evaluation in 3D human pose estimation and issues related to
pose misrepresentation.

Chapter 5: Presents the Domain-Unified Approach (DUA) method proposed to solve the 3D
human pose estimation task in a cross-domain scenario.

Chapter 6: Introduces the evaluation protocol as well as the databases and metrics used for
evaluation;

Chapter 7: This chapter is dedicated to presenting the experiments and their corresponding
results related to the proposed method.

Chapter 8: Concludes the work and introduces directions for future works.


2 Human Pose

This chapter provides an introduction to the human pose research problem that includes
the definition of human pose, the ways human pose have been represented, and some applications
of human pose. Additionally, it presents a literature review on 3D Human Pose Estimation
methods and the taxonomy proposed, in this dissertation, to facilitate the systematic discussion
and definition of current works.

2.1 Human Pose Definition and Representation
According to Stamou et al. (2005), a human pose can be described as an articulated

body, that is, an object composed of a set of rigid parts connected through joints, which allow
the execution of translational and rotational movement in six degrees of freedom.

Therefore, human pose estimation is a problem that aims to find a particular pose P

in a space Π that contains all possible articulated poses. In the context of RGB images, the
objective is to extract a set of features from the image that represent each joint of the human
body.

There are several ways to represent a human pose, as shown in Figure 4, each one with
a specific purpose. The most commonly used model in tasks of 2D and 3D pose estimation is
the kinematic model, as it is a flexible and intuitive model to use (ZUFFI et al., 2012), however,
this model lacks shape and texture information. Planar models are used when the human body
silhouette and its respective deformations are relevant, enabling the acquisition of shape data.
Finally, volumetric models are generally used when the reconstruction of three-dimensional
body models is desired (ZHENG et al., 2020).

2.2 Applications of Human Pose Estimation
Human pose estimation is a fundamental and traditional computer vision problem, with

the potential to serve as a crucial foundation for solving many challenges in diverse domains.
In this section, we aim to illustrate the practical applications of human poses across a wide
spectrum of fields. By showcasing the versatility and relevance of this task, we intend to provide
an understanding of how this task is important in various domains.

2.2.1 Action Recognition
Pose information has been traditionally used as input data for training action recognition

models, due to the movement information it carries, as well as, for being a low-dimensional piece


24

Figure 4 – Different models used to represent the human pose.

(a) Kinematic (b) Planar (c) Volumetric

Source: Chen et al. (2020), Copyright 2020, with permission from Elsevier.

of data. Earlier works used 3D skeleton information to map 3D information into Lie algebra vector
spaces in order to be able to classify actions through this new representation (VEMULAPALLI
et al., 2014). By using 3D poses, Devanne et al. (2014) represent actions as shape trajectories
in a Riemannian manifold so that kNN can be performed on this manifold to perform action
classification. A general pipeline of action recognition can be seen in Figure 5.

Figure 5 – Usual pipeline of a skeleton-based action recognition method.

Source: Sarker et al. (2021), reproduced with permission from Springer Nature.

Kim and Reiter (2017) employ skeleton-based action recognition through a Temporal
Convolutional Neural Network. Angelini et al. (2018), on the other hand, do action recognition
by the usage of an LSTM on 2D poses via OpenPose. Another usage of skeleton-based action
recognition is given by Belluzzo and Marana (2022), who use a method where the skeleton
joints projected on a black background were used as a basis for action recognition.

Silva and Marana (2020) map the bone segments of the pose to points in the parameter


25

space and encodes this set of points in a bag of poses classifier, used to perform action
recognition.

Some authors try to employ traditional human pose estimation techniques to perform
feature extraction for the action recognition task, such as Duan et al. (2022), in which the
authors represent the skeletons as 3D volume heatmaps, instead of graphs, and perform
convolutions upon these heatmaps.

As the extracted poses can be represented through a graph between the joints, the
usage of pose information also enables the development of models based on graph neural
networks as a way to model the neighboring relationship between joints. One work that exploits
this is the one proposed by Shi et al. (2022), in which the authors address the problem using a
pose-based graph convolutional network used to encode the body part features.

Action recognition by itself can be applied in many applications, however, the reliance
of certain action recognition methods on 3D poses, or the decrease in efficacy observed in
2D-based methods when changes in the orientation are observed, increases the necessity of a
3D pose estimator capable of working in a variety of scenarios.

2.2.2 Healthcare
The usage of human poses in healthcare can provide several quantitative and qualitative

information regarding human motion in certain areas of public health evaluation and treatment.
An example of that is the work proposed by Lu et al. (2020), in which several 3D body skeletons
are obtained and tracked in order to investigate the motor severity of Parkinson’s Disease based
on their gait.

Not only movement can be captured, but pose estimation can also help in finding
reliable posture for patients in clinical hospital environments, by monitoring the patient activity
in the hospital bed (CHEN et al., 2018). An example of this case can be seen in Figure 6,
where the upper body pose of a patient was extracted for further evaluation.

Another application of human poses in this setting is provided by Gu et al. (2019),
where a system based on computer vision is proposed as a way to perform physical therapy
at home. This system, called ExerciseCheck, makes it possible for the person performing the
exercise to compare their own movement with the desired movement recorded in the software.

Lastly, some works employ human poses on fall detection monitoring of elderly people,
in order to provide immediate assistance. Chen et al. (2020) does that by analyzing information
on the external rectangle of the predicted pose. Meanwhile, Alaoui et al. (2021) map the
skeletons to a Riemannian manifold of semi-definite matrices in order to establish a dissimilarity
measure between fallen skeletons so that an SVM could classify the fall.


26

Figure 6 – Upper body pose information of a patient positioned in a hospital bed. This
information can be used to monitor proper posture and patient activity in the
hospital.

Source: Chen et al. (2018), © 2018 IEEE.

2.2.3 Augmented and Virtual Reality
Human pose information can be used as a way to enhance the immersion of the

interaction between real and digital objects in augmented and virtual reality scenarios. Stübl et
al. (2023), for example, create an industrial augmented reality application in the domain of
furniture production in order to assist in the quality inspection of the factory worker.

Weng et al. (2019) allow for the character animation obtained from a single photo on
the palm of your hand by performing pose estimation and motion transfer on the animated
character, in such a way that, by introducing a pose and a rigid body in the character, it is
able to move in an augmented reality scenario, as illustrated in Figure 7, where a 3D model of
the character in the painting starts to run in the direction of the person.

Figure 7 – Augmented reality application, in which a pose is applied to a 3D model of a
character visualized in a photo.

Source: Weng et al. (2019), © 2019 IEEE.

Lastly, another augmented reality application was proposed by Zhang et al. (2021a),


27

in which, by retrieving information from professional tennis matches, a set of video sprites
containing the players and ball trajectories is generated. The correct player actions are generated
based on a pose retrieval system, in which, the most adequate pose for a certain scenario is
used to select the sprites.

2.3 Pose Estimation Approaches
Early approaches tried to solve the pose recognition problem by employing traditional

image processing techniques, such as the Yang and Ramanan (2013) approach, which uses
histograms of oriented gradient (HOG) to create a set of features that identify each body
part. However, this kind of approach proved to be inadequate to solve the problem accurately
in real environments. The development of deep learning techniques and the emergence of
convolutional neural networks have pushed the area of pose detection based on monocular
images so that new methods started to show excellent results and overcome the traditional
techniques (DANG et al., 2019).

There are various ways to approach the pose estimation problem through deep neural
networks. Chen et al. (2020) classify those approaches according to the variations present when
modeling the problem, these variations can appear in the form of the use or not of a pose to
obtain new poses, the organization of the neural network, or the number of people involved in
the scene.

Regarding the usage of poses, methods can be classified as generative or discriminative.
Generative methods use the joint information on the labels to find a new set of poses and,
during training, use this information to find viable joint positions in images. Discriminative
methods, on the other hand, aim to learn a function that maps a person to a given pose space
without knowledge of the pose models, and, based on that function, select a pose or a set of
poses from a dictionary to represent the novel pose.

The neural network architecture is also a criterion for classifying pose estimation tech-
niques, which are divided into single-stage (methods that use a single end-to-end architecture)
and multiple-stage (methods that split the task between multiple networks). Estimation of
3D poses with multiple individuals is an example of a multi-stage method, where up to three
networks can be defined, one to detect the individuals, one to find their 2D poses, and finally,
one to project the 2D poses into the 3D space. In general, it is necessary to distinguish the
approaches into two categories: those that deal with a single person in the image and those
that deal with multiple people.

2.3.1 Single Person Approach
In this case, the problem refers to the situation where there is only one person in the

scene, in a well-defined region, simplifying the job. If there are more people in the scene, it is


28

necessary to perform a preprocessing step to isolate only one person in the image. In this case,
there are two types of methods commonly employed to solve the problem: methods based on
regression and methods based on detection.

2.3.1.1 Regression-based Methods

Regression-based methods seek to find all the joints of the human body in an image
through end-to-end networks. Toshev and Szegedy (2014) were the first to approach this
problem using an AlexNet.

A problem found in this approach was that the usage of only raw information about
the joint positions was not interesting, since this did not take into account the information
in the neighbourhood of the joint, therefore, the supervision was converted to use heat maps
representing the probable neighbourhoods of the joints.

The problem of using only heat maps is that depending on their resolution, the
information regarding joints may end up being inaccurate in the decision process. To deal with
this problem, Nibali et al. (2018) propose a numerical transform to calculate joint coordinates
using heat maps obtained from a neural network.

2.3.1.2 Detection-based Methods

Detection-based methods aim to perform the detection of the body segments separately
so that the found body parts found are later organized to represent the human body. These
kinds of methods are usually more susceptible to variations in background complexity and
occlusion.

In order to improve the training process of convolutional neural networks, Jain et al.
(2015) propose the usage of a heat map channel related to a joint-centered Gaussian distribution,
this way, each joint will have its own heat map. Since then, most detection methods end up
working with heap maps. Furthermore, Luvizon et al. (2019) propose a soft-argmax function
with the goal of converting detection-based networks into regression-based networks, making
an end-to-end network suitable for use in detection environments.

Some of the detection-based approaches still use traditional architectures, like GoogLeNet
(RAFI et al., 2016). Another architecture that has become prominent recently, exhibiting good
results in the pose estimation task, is the stacked hourglass architecture, proposed by Newell
et al. (2016). This kind of architecture works by employing a set of cascaded pooling and
upsampling layers in order to extract information at various scale ranges, thus obtaining more
accurate information about the orientation and position of each joint. An example of an
hourglass model can be seen in Figure 8.


29

Figure 8 – Hourglass module. Each one of the blue boxes is a residual module presented below
the network.

Source: Misra et al. (2020), licensed under CC by 4.0.

2.3.2 Multiple People Approach
The presence of multiple people in an image can significantly increase the complexity of

the problem, requiring either preprocessing to isolate people in the image or architectures that
are able to identify individuals in images, even then, problems involving interaction between
individuals can still occur.

One approach to the detection of subjects in images is called the Top-down approach.
In this case, high-level abstractions are used first to detect the individuals and from that point,
pose estimation is applied to each one of the subjects.

Another approach is called Bottom-up, in which, instead of detecting the subjects
directly, the objective is to find the joints and limbs of all the subjects and then sort them by
clustering the joints. This kind of approach can have major problems when there are people
interacting very close together in an image.

2.3.3 3D Pose Estimation
Although there are commercial solutions that address the 3D pose estimation problem,

these solutions work mostly in restricted environments, as is the case of those based on Kinect
which has a depth sensor, or use markers for body detection (CHEN et al., 2020), which ends
up being quite restrictive. Therefore, there is a need to propose more flexible solutions to 3D
pose estimation, which can be used in uncontrolled scenarios, preferably using a low-cost easily


30

accessible monocular RGB camera.

With the emergence of deep learning and convolutional neural networks, the performance
of the methods began to improve. An initial solution involving end-to-end networks for 3D
pose regression was proposed, however, this type of solution is difficult to be employed in
scenarios different from those used in training, as the databases for 3D pose estimation need
to be captured in controlled environments, decreasing the diversity of the data and impacting
the generalization ability of the networks (LI; CHAN, 2015).

With the development and popularization of 2D pose estimation techniques achieving
surprising results, such as the OpenPose (CAO et al., 2019), PifPaf (KREISS et al., 2019)
and Stacked Hourglass (NEWELL et al., 2016), approaches that seek to take advantage of
the maturity of 2D pose estimators in 3D pose estimation have been gaining popularity. This
is done by performing a two-step pose estimation process, a first step that is responsible for
obtaining valid 2D poses and a second step in which, from the 2D pose, a 3D pose is retrieved.

The idea is that 2D pose estimation techniques, whose labels are easier to retrieve
in a diverse range of situations, can help to ease the labor of obtaining 3D data, and can
provide accurate enough information for the 3D pose lifting (MARTINEZ et al., 2017). This is,
however, a difficult problem, since depth information, which is already scarce in images, is lost
with the 2D pose, and the problem itself is non-invertible, as a single 3D pose may have more
than one 2D pose projection representing itself.

Despite being a difficult problem, two-step 3D human pose estimation has recently
been accomplished in the literature. Moreno-Noguer (2017) was one of the first to achieve that,
by creating Euclidean distance matrices representing the relationship between different joints
in a spatial context, making it possible to achieve results comparable to those of end-to-end
networks using only information found in 2D poses. Meanwhile, Martinez et al. (2017) goes
further and show that, through proper preprocessing of the poses, a simple residual architecture
is able to estimate 3D poses with higher accuracy than end-to-end methods using only a 2D
pose as input.

2.4 Proposed Taxonomy for 3D Human Pose Estimation Methods
Aiming at a better understanding of 3D pose estimation techniques based on 2D

poses, we proposed, through an analysis of the literature, a taxonomy in order to categorize
the different methods according to key characteristics shared among them. This taxonomy
is presented in Figure 9. In general, three major types of 3D pose estimation techniques
based on 2D poses have been identified in our analysis through clustering similar techniques:
pose refinement-based techniques, pose regression techniques, and matching techniques. After
analyzing each method’s key features, we can formulate proper definitions for them. These
definitions will be described in the following paragraphs.


31

Figure 9 – Taxonomy proposed in this dissertation to categorize the different two-step 3D pose
estimation approaches found in the literature.

Refinement-based 
Techniques 

Two-step 3D Human 
Pose Estimation 

! 
Regression Techniques Matching Techniques 

Pre-processing 

refinement 

Post-processing 

refinement 
Image-based Video-based Discriminative Generative 

Task Criteria 

Discriminative Generative 

Supervision Criteria 

Fully 
Supervised 

Self 
Supervised 

Unsupervised 

Source: Manesco and Marana (2022). Reproduced with permission from Springer Nature.

Refinement techniques, as established by Wang et al. (2019a) are defined by a pre-
processing or post-processing step in the pose dataset, in order to find a common space or
perspective for the poses, or, through a refining step applied on the 3D poses already predicted,
based on joint error, a frame sequence, or bone consistency, improving the accuracy of an
already established 3D pose estimator.

Matching techniques, such as the one proposed by Chen and Ramanan (2017), are
characterized by the creation of a dictionary with poses from the training set, in such a way that
the 2D pose input is matched with the most similar pose that can be found in the dictionary,
and the corresponding 3D pose is chosen as a pose candidate. Matching techniques can operate
in a partial context, that divides the poses dictionaries by kinematic groups, or in a complete
context, where the whole body is used as input. In addition, they can work in a discriminative
context whose objective is to only obtain the desired pose, or in a generative context, which
aims to create an artificially generated dictionary to help.

Lastly, regression-based techniques, such as Martinez et al. (2017), consist of using
mechanisms, such as neural networks, to recover a 3D pose from the input 2D pose, in a
regression learning context. Regression techniques can operate in an image-based analysis, i.e.
a single frame is considered for the evaluation, or in a video-based analysis, where a sequence
of frames is observed. Finally, regression-based techniques are divided between both task and
supervision criteria.

Regarding the task criteria, regression-based techniques resemble matching techniques,
in which there is a discriminative context, whose goal is to regress the 3D pose, and a generative
context, which can operate through data augmentation mechanisms or in adversarial generative
scenarios aimed to generate new poses and increase the generalization capacity of the system.
As for the supervision criterion, we can classify the techniques into fully supervised, whose 3D


32

ground truth data is used during the regression step, or self-supervised, where only the 2D
pose information is considered, making use of other types of information as the label, such as
the projections of the predicted 3D poses into a 2D camera.

2.4.1 Matching Techniques
Matching techniques work on top of a dictionary created from the training dataset. One

of the pioneers in this approach were Chen and Ramanan (2017), proposing a discriminative
method that consists of creating a dictionary of 2D projections from the training set. The
method operates by searching for the 2D projection with the highest probability of representing
the input 2D pose and relating it to the target 3D pose. Figure 10 illustrates the operation of
a generic matching technique.

Figure 10 – Illustration of the operation of a generic matching technique.

Pose Dictionary

......

Input 3D Pose

Source: The Author.

An alternative approach works in a generic generative scheme that can accommodate
various types of matching techniques. The approach consists of augmenting the dataset
based on a set of anatomical constraints, upon which, traditional matching techniques can be
employed (JAHANGIRI; YUILLE, 2017).

Localized approaches, such as the one proposed by Yang et al. (2019), work by dividing
the pose matching into upper and lower kinematic groups, which are then used to populate the
2D pose dictionary through several camera perspectives, by combining different groups. Other
locality-based methods further spread the local kinematic groups, aiming to divide the task
into different local-matching problems (ZHOU et al., 2020). A different approach works on an
unsupervised scenario, where a sparse representation in a linear combination of pose basis is
learned before the matching occurs (JIANG et al., 2019).

2.4.2 Regression Techniques
Regression techniques, on the other hand, seek to estimate three-dimensional coordinates

from techniques such as neural networks, using a 2D pose as input. There are several approaches
to regression methods, whose temporal information or 3D labels may or may not be used during
learning. Figure 11 illustrates the general processing approach of regression methods.


33

Figure 11 – General scheme of operation of regression methods.

Sequence of Frames

Single Frame

3D Output
Regressor

Source: The Author.

2.4.2.1 Image-based Regression Techniques

Image-based methods aim to perform human pose estimation based on a single frame or
2D input pose. Moreno-Noguer (2017) was one of the pioneers in this area through Euclidean-
distance matrices used as input to neural networks. The method was distinguished by its
comparable performance to the end-to-end methods observed at the time.

Another relevant work is the one proposed by Martinez et al. (2017), in which the authors
create a baseline for pose estimation through proper pose processing and a simple residual
neural network. The proposed method consists of a simple residual neural network combined
with a suitable processing of the poses, achieving powerful results, and outperforming the
end-to-end approaches of the time. This processing, considered standard for several subsequent
methods, consists in multiplying the 3D poses by the inverse of the extrinsic camera parameter
matrix, aiming at representing the pose in a canonical space. Figure 12 shows the residual
neural network architecture proposed to solve the two-step 3D pose estimation problem.

Figure 12 – Residual neural network architecture proposed to solve the two-step 3D pose
estimation problem.

Source: Adapted from Martinez et al. (2017), © 2017 IEEE.

Other methods are also inspired by this architecture, such as Véges et al. (2019), which


34

uses it in a siamese neural network scheme aiming to find rotation-equivalent representations, i.e.
where the rotation projection matrix has the same values, such that the poses are represented
in a universal space before being estimated.

Pavlakos et al. (2018) treat the lack of depth information through an ordinal relationship,
modeled through a neural network. Another method aims to diversify the set of predicted poses,
through the usage of a Gaussian mixture model, used to predict more than one valid pose. Xu
et al. (2021) utilizes pose grammar and data augmentation to deal with the problem. Another
fresh perspective to the problem deals with a bone representation, instead of 2D joints, to
estimate the 3D poses (WEI et al., 2021). Li and Lee (2019), on the other hand, employ a
Gaussian mixture model, aiming to find the parameters of the distribution of the model in
M Gaussian kernels, this is done through a deep neural network and allows the generation of
multiple hypotheses of poses that satisfy the problem, enabling the choice of the best pose
among those available.

2.4.2.1.1 Graph Neural Networks

Due to the recent success of graph networks in tasks that employ poses, such as action
recognition, several methods attempting to employ graph networks in the task of 3D pose
estimation have been proposed. Simple approaches to the problem include replacing the linear
layers in the model in Figure 12 with convolutional graph layers (BANIK et al., 2021).

Early uses of this approach involved employing SemGCNs to solve the problem, a kind of
graph neural network that works with a semantic representational model for 2D poses, starting
from a set of operations called semantic convolutions. This set of semantic convolutions works
in such a way that each of the nodes of a graph, which already semantically defines the problem,
has its own convolution matrix, solving a problem in which traditional graph neural networks
do not perform well in this task, as they share the same convolution filter between all nodes.
Aspects of the SemGCN network are exposed in Figure 13.

Ci et al. (2020), in contrast, propose the creation of Localized Neural Networks (LCNs)
to solve the problem, combining concepts of Graph Convolutional Networks (GCNs), which
have a problem in the representation of convolution filters, with concepts of fully connected
networks, which do not exploit the connections between vertices directly during learning. To
accomplish that, each of the nodes in an LCN has its own weight matrix with its own set of
filters.

SemGCNs have been explored in several other scenarios, such as density mixture models
to generate multiple pose hypotheses (ZOU et al., 2021), or even in a generative-adversarial
context, in which one SemGCN acts as a generator of 3D poses, whereas the other SemGCN
acts as a discriminator to differentiate real poses from artificially generated poses (XIA; XIAO,
2020).


35

Figure 13 – SemGCN network architecture (a), accompanied by the representation of the
semantic graph of a pose used as a basis for the development of the task (b).

Se
m

G
Co

nv
, 1

28

Ba
tc

hN
or

m
 1

D

Re
LU

No
nL

oc
al 

1D

Se
m

G
Co

nv
, 1

28

Ba
tc

hN
or

m
 1

D

Re
LU

Se
m

G
Co

nv
, 1

28

Ba
tc

hN
or

m
 1

D

Re
LU

No
nL

oc
al 

1D

+

(16, 128)

(16, 2)

4x

Se
m

G
Co

nv
, 3

(16, 3)

(a) SemGCN architecture. (b) Semantic Graph Pose Representation.

Source: Adapted from Zhao et al. (2019), © 2019 IEEE.

An interesting use of the SemGCN has been proposed by Sun et al. (2020), who employ
them in a generative context, through a stereo pose generator module to obtain new viewpoints
for an input pose. In this way, the problem is transformed into a multi-view 3D pose acquisition
problem.

Some graph-based methods started expanding on the idea of the SemGCN through
the usage of attention and transformers. Yin et al. (2023), for example, employ an attention
mechanism aiming to extract global joint features from the input skeleton without neglecting
local and neighboring information.

Another such work is the one proposed by Zhao et al. (2022), which introduces the
GraFormer architecture, that works by embedding graph convolutional layers working together
with an attention block capable of learning implicit relationships with a dynamic adjacency
matrix.

2.4.2.2 Video-based Regression Techniques

In contrast to image-based techniques, video-based techniques employ a collection of
frames in order to learn. One of the first approaches working with videos employed LSTMs to
perform pose prediction (HOSSAIN; LITTLE, 2018). This type of approach has continued to
be explored, such as the method of He et al. (2019), which employs a BiLSTM after splitting
the poses into kinematic groups, with a regression head for each group.

Based on the idea of dividing poses into local groups, Zeng et al. (2020) split a single
pose into local poses, focusing on the arms, legs, and torso, and proposed the use of an SRNet,
composed of recurrent neural networks, to add global context information to local connections.

Other approaches seek to use temporal convolution networks for learning, either by
creating memory banks (DENG et al., 2020) or by analyzing bone consistency, such as the
method of Chen et al. (2021), which explores anatomical constraints through a network used to


36

predict bone size by analyzing random frames in conjunction with an analysis of bone direction
consistency on consecutive frames.

Taking advantage of the trend of graph methods in image-based 3D pose estimation,
Zhang et al. (2021b) propose a dynamic spatial convolutional graph method, whose goal is to
generalize the spatial relationship between vertices over time. Assuming that the local analysis
of traditional graph techniques always takes into account the same vertices, even if there is no
interaction between them, the authors propose a dynamic graph convolution, which updates
the joint neighborhood according to the Euclidean distance between the other vertices in a
given frame.

Liu et al. (2021), in turn, use a set of attention mechanisms combined with temporal
dilated convolutions, trained with the full frame sequence in order to estimate 3D poses in a
temporal context. Meanwhile, Li et al. (2022) employ a pose-based transformer to learn a proper
spatiotemporal representation of different poses, generating multiple initial pose hypotheses
and learning the temporal relationship between them in order to deal with the non-invertible
pose representation problem.

The concept is also examined in the work of Zhao et al. (2023), which utilizes a spatial
encoder to detect the correlation between joints in a single frame. The authors of this work
also employ a temporal encoder to acquire spatiotemporal data from the joints. In addition,
they incorporate a Discrete Cosine Transform to represent the overall movement of the pose
through a low-frequency representation of the skeleton.

To handle incomplete input sequences, Einfalt et al. (2023) replaced the missing poses
with position-aware upsampling tokens. These tokens were then transformed into 3D pose
estimates through self-attention of the entire sequence.

2.4.2.3 Self-Supervised Methods

Self-supervised methods work by using the information present in the extracted 3D pose
itself as a label, without the need for direct information from the 3D ground truths. Figure 14
illustrates the general behavior of self-supervised 3D human pose estimation techniques.

Dabral et al. (2018) work on this type of method by ensuring that the estimated 3D
pose satisfies a set of anatomical constraints, such as the truth of an angle, or valid symmetry
for twin limbs. Following a similar line, Xie et al. (2019) start from geometric constraints based
on symmetry, in a context of graph networks, to perform the method supervision.

Wang et al. (2019b) utilize a mechanism for generating 3D poses without the need for
3D labels, for this purpose, the output of a 3D pose estimator module based on LSTMs serves
as input to a 3D pose refiner module, which, in its intermediate layers, compares the projection
of the estimated 3D pose with the 2D pose input in order to define the error function.

Another work, proposed by Klug et al. (2020) aims to define theoretical limits for


37

Figure 14 – General scheme of a self-supervised method based on the error obtained between
the 3D pose projection and the input 2D pose.

Regressor

Input 2D Pose
3D Pose

2D Projection

Error

Source: The Author.

self-supervised approaches, through a minimum threshold error for pose estimation techniques
that use weak projections. The threshold is based on the error propagated by the distortions
caused by the projection, thus, even if a technique can estimate almost perfectly a 3D pose, the
distortion caused by the projection of the 3D pose on the camera, used for error propagation,
will be taken into account and a minimum threshold can be defined for each of the analyzed
databases.

2.4.3 Refinement-based techniques
Refinement-based techniques work on improving the pose representation, so that other

pose estimation techniques can achieve better results. As illustrated in Figure 15. Wang et
al. (WANG et al., 2019a), for example, carry out the post-processing of the poses by learning a
distance matrix, upon which k pose-basis are retrieved and used to reposition the 3D pose in a
new space.

Guo et al. (2019), on the other hand, perform the refinement by grouping joints into
different categories: easy, medium, and hard, referring to the error magnitude at the joints. In
this way, each type of joint undergoes another regression step in a neural network, seeking to
adjust the poses in groups, without the group with greater error influencing the other groups.

Regarding the pre-processing of the poses, one approach aims to fix the poses by
maintaining the consistency of the gravity center along the frames (XU; WU, 2020). Liang et
al. (2020) employ a neural network on the 2D poses, in such a way that all of the 2D poses are
represented in the same perspective, dealing with camera angles overfitting. Lastly, following
a similar idea, Wei et al. (2019) employ a hierarchical correction network in a generative


38

Figure 15 – Overview of refinement methods, illustrating a situation where both pre-processing
and post-processing techniques are applied.

Regressor
Input

2D Pose Refined 2D Pose

Pre-Processing Step

Refined 3D Pose3D Pose

Post-Processing Step

Source: The Author.

adversarial context, to find a new perspective for the 2D poses.

Xu et al. (2020) create a video processing technique through a cinematic analysis based
on the bone’s length and angle that projects 2D poses to a common perspective, and at the
end of the prediction, refine the predicted poses with a low 2D confidence score based on their
trajectory.


3 Domain Adaptation

This chapter aims to present the motivation and the definition of domain adaptation,
and to provide a short literature review on domain adaptation, with the intent of orienting
the reader within this subject matter and drawing the necessary connections between domain
adaptation and 3D human estimation.

3.1 Motivation and Definition
When images from the training dataset and the test dataset show differences between

their data distributions, a problem called domain shift arises. This problem can cause a negative
impact on the accuracy of classifiers, leading images to be misclassified (KOUW; LOOG, 2018).
One way to deal with this problem is to use domain adaptation techniques (PATEL et al.,
2015).

Some authors define domain adaptation as a sub-area of transfer learning, whose goal is
to use data from a domain other than the one used for training, in order to improve the accuracy
of the classifier when applied to an alternative dataset (CSURKA, 2017). Recent studies in
the literature, instead, define domain adaptation as a subarea of domain generalization. In
both approaches, the common goal is to address the domain shift problem in unsupervised
target distributions, the difference being that domain adaptation techniques typically focus on
addressing the domain shift within a well-defined target domain, leveraging accessible data to
assist in the distribution learning process. Domain generalization, on the other hand, covers a
broader scope, emphasizing generalization to out-of-distribution (OOD) unseen domains solely
based on the available source data (ZHOU et al., 2022).

Figure 16 shows the importance of applying domain adaptation techniques when dealing
with the domain shift problem. In the middle graph, when comparing the source domain
classifier (highlighted in blue), with the target domain classifier (highlighted in red), one can
observe the negative impact of using a different distribution to evaluate a classifier trained in
a distinct domain. The right graph shows a scenario in which domain adaptation techniques
were applied and a common domain was found between both datasets, solving the domain shift
problem.

The formal definition of the concepts that compose the basis of domain adaptation
theory, according to Pan and Yang (2010) are exposed hereafter. Given a Dataset composed of

Definition 1 (Domain). A domain D is composed of a feature space F with d dimensions
and a marginal probability function P (x), which means that, given a data point x, we have
D = {F , P (x)}, with x ∈ F .


40

Figure 16 – Example of a domain shift problem. The first graph (left) shows source domain
data with a trained classifier highlighted in blue. The second graph (middle),
shows the target domain with a trained classifier highlighted in red. In the third
graph (right), data from both domains are shown after domain adaptation, with a
classifier trained on the common domain.

Source: Reprinted from Chai et al. (2016), Copyright 2016, with permission from Elsevier.

Definition 2 (Task). Given a domain D composed from the set of training data {x, y}, a task
T consists of a set of labels Y and an objective predictive function f(x), that can be learned
from the training data, which means that T = {Y , f(x)}, with y ∈ Y and f(x) = P (y|x).

Definition 3 (Domain Adaptation). Given a source domain DS = {XS , Y S} and a target
domain DT = {XT , Y T }, and assuming that DS ̸= DT regarding their marginal probabilities
P (XS) ̸= P (XT ), and two tasks TS ≈ TT , with conditional distribution P (Y S |XS) ≈
P (Y T |XT ); The goal of domain adaptation is to improve the prediction fT (·) in the target
domain DT , using the source domain DS data.

3.2 Deep Domain Adaptation
The subject of domain adaptation has also benefited from recent developments in the

field of deep learning. Initially, deep neural networks were used only as feature extractors for
later application of traditional domain adaptation techniques, however, the development of
the area enabled the establishment of architectures and training protocols focused on domain
adaptation in the deep learning scenario (CSURKA, 2017). These architectures come in different
forms, with methods based on the discrepancy between domains, methods that use adversarial
training, methods based on autoencoders, and methods that take advantage of the spatial
relationship between the data. Two architectures that are fairly simple to apply and show
excellent results are the Deep Adaptation Networks (DAN) architecture (LONG et al., 2018)
and the Domain Adversarial Neural Networks (DANN) architecture (GANIN et al., 2017).

3.2.1 Deep Adaptation Network
The Deep Adaptation Network (LONG et al., 2018) architecture is based on the principle

that the last layers of a convolutional neural network have more specific features, thus the first


41

layers have their weights frozen, while the last fully connected layers of a pre-trained network
are fine-tuned in such a way that the Multi-Kernel Maximum Mean Discrepancy (MK-MMD)
metric, used to calculate the distance between distributions, from the last layers are minimized
by being part of a domain error function that composes the fine-tuning part of the training.
Figure 17 shows one example of a DAN network.

Figure 17 – The Deep Adaptation Network architecture.

Source: Long et al. (2018), © 2018 IEEE.

The MK-MMD distance between two probability distributions p and q is defined by
the distance between the average representation of their distributions in a reproducing kernel
Hilbert Space, from a kernel k, from which, the following expression can be obtained:

d2
k(p, q) ∆= ||Ep[ϕk(x)] − Eq[ϕk(x)]||2Hk

. (1)

Thus, the error is minimized by:

min
Θ

1
n

n∑
i=1

J(θ(xi), yi) + λ
5∑

l=4
d2

k(Dl
S , Dl

T ), (2)

where J is the error function applied on the neural network output θ(xi) for an input data
object xi, in comparison to a true label yi, d2

k(Dl
S , Dl

T ) denotes the MK-MMD between the
source domain (S) and target domain (T ) at layer l.

3.2.2 Domain Adversarial Neural Network
The Domain Adversarial Neural Network architecture (GANIN et al., 2017) follows an

adversarial training protocol through a domain-classifier model, that is used to assure that the
feature extraction module finds a uniform representation of the features in a common domain.

The method combines a feature generator Gf , with a domain discriminator Gd and a
label predictor Gy. The main idea of the method is to maximize the confusion of the domain
classifier in a way that does not incur performance loss on the label predictor. Figure 18 shows
how the DANN architecture works.

Initially, the method obtains the feature representation Gf (X) of a data input X. These
features serve two distinct purposes: predicting class labels Gy(Gf(X)), and domain labels
Gd(Gf (X)).


42

Figure 18 – The Domain Adversarial Neural Network architecture.

Source: Ganin et al. (2017), reproduced with permission from Springer Nature.

After properly obtaining the desired labels, the method proceeds into maximizing the
domain classifier confusion through the usage of a Gradient Reversal Layer (GRL). The idea
is that the gradient of the domain classifier is reversed with respect to the feature extractor
during backpropagation, this is achieved by multiplying its gradients by a negative scalar −Λ.

By reversing the gradient and maximizing domain confusion, the training process
compels the feature extractor to learn domain-invariant features. This ensures that the feature
distributions over both domains are as indistinguishable as possible when passed to the domain
classifier. The overall loss of the method is given by Equation 3, where Ly represents the
loss of the label predictor, Ld the loss of the domain classifier, and y the ground-truth label
information.

LDANN = Ly(Gy(Gf (X)), y) − ΛLd(Gd(Gf (X))). (3)

3.3 Regressive Domain Adaptation
Although domain adaptation methods are able to deal with the domain shift problem

in several types of situations, a fundamental issue can be observed: the concept of domain
adaptation, as well as the methods that arise from this concept, were envisaged to operate in
a classification scenario.

Jiang et al. (2021) show that there are few methods focused on addressing the particu-
larities of regression methods and that, in certain scenarios, traditional Domain Adaptation
techniques, are not able to adequately adapt to the task. This fact occurs as a consequence of
the behavior of the border that separates the predictions in the different tasks. In a classification
scenario, the border between classes is usually well-defined in both domains and, when applying
domain adaptation techniques, the margins of the border that defines the classes in the source
domain can be enlarged, increasing the generalization capacity of the classifier for the new


43

domain.

In regression problems, on the other hand, due to the problem being located in a
continuous space, the decision margins are not as clear as in classification problems. This
problem can be aggravated when dealing with keypoint detection, whose output also involves a
high-dimensional discrete space.

Because of this particularity of domain adaptation methods in regression tasks, such as
the estimation of 3D poses from 2D poses, some techniques that are specific to the regression
task have been proposed or adapted for the application of domain adaptation in this kind of
scenario.

Although regression-based domain adaptation methods have been developed to deal
with problems that traditional domain adaptation methods could not solve, one problem still
persists: current methods aimed at solving this problem work only on very specific regression
problems and do not deal well with other types of tasks.

Some regression-based techniques were considered during the development of this work,
among them, we can cite Representation Subspace Distance (RSD) (CHEN et al., 2021) and
Regressive Domain Adaptation for Unsupervised Keypoint Detection (RegDA) (JIANG et al.,
2021), however, none of them turned out to be suitable for the problem of 3D pose estimation
based on 2D poses. The RSD method encounters a numerical problem during the optimization
of this specific problem, meanwhile, the RegDA technique requires the use of ground-false data,
which is hard to obtain efficiently in this particular case.

Therefore, our challenge is to propose a method capable of operating with regression
to solve the following problem:

Problem 1 (Domain Adaptative 3D Human Pose Estimation). Given a source domain DS

composed of a set of poses X, Y and an unsupervised target domain DT consisting of a
set of pose annotations X, with distinct marginal probabilities DS ̸= DT , the goal is to
find a feature map θ and a pose estimator head P such that the conditional probability
P(θ(XS)|XS) ≈ P(θ(XT )|XT ) without negatively affecting the efficacy of the pose regressor
head P .


4 Related Work

In the previous chapters, we presented a literature review and discussed earlier works
that are related to the human pose estimation and domain adaptation in a more general sense.
In this chapter, we present works that are more closely related to the focus of this dissertation,
since they either deal with pose representation issues or are used in a cross-domain evaluation
scenario. We also discuss some works that inspired us while solving the problem.

4.1 3D Human Pose Estimation in Cross-Domain Scenarios
The idea of using domain adaptation to 3D Human Pose Estimation has been discussed

before. Zhang et al. (2019), for example, proposed a method in which a synthetic depth-based
dataset is used for domain adaptation during the learning step. However, the idea of evaluating
3D Human Pose Estimation in a Cross-Domain scenario is still not discussed by them.

Recent works started to notice the discrepancy in performance between data obtained
from distinct distributions. To deal with this issue, several approaches have been proposed,
such as that of Dabral et al. (2018), in which synthetic pose datasets artificially generated
were used to enhance the amount of data available during the training. Other authors also
followed this data augmentation paradigm by using Generative Adversarial frameworks (YANG
et al., 2018) or a Conditional Variational Auto Encoder (CVAE), aiming to generate poses
from another dataset distribution (JIANG et al., 2021).

The expansion of the training set through data augmentation is further discussed by
recent works aimed at working directly in cross-dataset scenarios, in which the discrepancy in
performance is even more noticeable. One such work introduced augmentation by adjustment
of distinct geometric factors through a joint optimization algorithm trained online (GONG et
al., 2021).

Gholami et al. (2022) addressed the domain gap caused by the cross-dataset evaluation
through the weakly-supervised generation of synthetic 3D motions. In this way, the target
distribution could be represented only by looking at the 2D poses, working both as a pose
estimation technique and as a synthetic pose generator. A distinct approach that also employs
synthetically generated poses, focused on alleviating the domain shift jointly through feature
spaces and pose spaces using semantic awareness and skeletal pose adaptation.

The idea of directly using domain adaptation techniques to approach this problem has
been discussed in previous works. One such work (GUAN et al., 2021) utilized the Skinned
Multi-Person Linear (SMPL) model and proposed a method called Bilevel Online Adaptation
to reconstruct mesh and pose, through a multi-objective optimization problem using temporal


45

constraints to deal with the domain discrepancy.

Chai et al. (2023), on the other hand, observed that most of the distribution discrepancy
of cross-dataset evaluation stems from camera parameters and the diversity of local structures
during training. Thus, they employed domain adaptation by combining a global position
alignment mechanism, aiming to eliminate the viewpoint inconsistency, and a local pose
augmentation used to enhance the diversity of the available poses.

The approach proposed by Kundu et al. (2022) introduced the usage of uncertainty
mechanisms to work with self-supervised 3D human pose estimation. This operated in such a
way that by minimizing the uncertainty for the unsupervised real dataset alongside a supervised
synthetic dataset, it is possible to perform cross-dataset pose adaptation. Zhang et al. (2021),
on the other hand, proposed a method for learning casual representations in order to generate
out-of-distribution features that can properly generalize to unseen domains, and in order
to compare the efficacy of their method, they compare it to previously established domain
adaptation techniques, such as DDC (TZENG et al., 2014), DAN (LONG et al., 2018), and
DANN (GANIN et al., 2017).

Some works tried to solve the problem of pose misrepresentation which is also found in
the literature regarding cross-dataset evaluation. The work of Rapczyński et al. (2021) aimed
to solve this issue through virtual camera augmentation, and a joint-harmonization mechanism,
supported by scale normalizatoin. The harmonization mechanism was used to ensure that joints
were represented in the same position across all datasets, meanwhile, normalizing the scale
ensures that all the limbs of the subjects have the same proportion. The approach of Sárándi
et al. (2023), instead, involved using an autoencoder to learn a set of latent key points that
can properly represent all of the distinct datasets across the same embedding.


5 Proposed Method

This Chapter, introduces the Domain-Unified Approach method, a novel solution we
propose for addressing the challenges discussed in previous Chapters.

Even with various methods being proposed to tackle specific aspects of the problem of
3D pose estimation from monocular RGB images, a comprehensive approach for addressing
cross-dataset human pose estimation remains lacking. Considering the limitations of existing
methods in addressing the multifaceted challenges posed by domain discrepancy, often focusing
on addressing specific aspects of the domain discrepancy problem, in this work we introduce a
novel method, called Domain-Unified Approach (DUA), which tackles this issue from a unified
perspective combining domain adaptation techniques with a universal pose representation and a
specialized training technique to mitigate error propagation at the edges. This chapter presents
the DUA method.

5.1 DUA - Domain Unified Approach
In order to address the Problem 1, discussed in Section 3.3, we proposed DUA (Domain

Unified Approach), a method capable of accurately inferring poses from source and target
domains with minimal error, by combining a pose conversion unit, presented in Section 5.2, and
the uncertainty loss mechanism, described in Section 5.3, with a domain adaptation module.
The general idea of DUA is to maximize the distance between 3D human poses using a domain
discriminator that is jointly optimized with the entire deep learning system. Based on the
proposed taxonomy discussed in Section 2.4, this method can be categorized as a hybrid
approach that combines image-based regression and pre-processing refinement.

Figure 19 depicts the DUA method, which is structured around three main modules, all
operating on top of a backbone pose estimator. Initially, the pose estimator serves as a feature
extractor, from which one can obtain poses from a dedicated pose head P. These extracted
features are fused with pose predictions to generate an uncertainty estimate, aiding the training
process. Furthermore, the predicted target-domain poses undergo transformation into a unified
pose representation, harmonizing joint distribution with the source domain. Lastly, a domain
discriminator is employed, tasked with distinguishing between source and target poses, in order
to establish a consistent feature representation within a common domain.

The DUA method has an architecture inspired by the DANN architecture. To find the
desired pose, given a pose estimator Π, the following pose loss is used:

Lpose(x) = β(y − Π(x))2 + (1 − β)∥y − Π(x)∥, (4)


47

Figure 19 – Proposed Domain Unified approach for 3D human pose estimation. The method
is composed of three main modules on top of the 3D pose estimator: the unified
pose representation module, the uncertainty estimation module, and the domain
discriminator. The dashed lines on the pose converter represent frozen weights.

Source: The Author.

where 0 ≤ β ≤ 1 is a hyperparameter that controls the importance of each part of the pose
loss.

The pose estimator is engaged in a minimax game, aiming to minimize Lpose, while
simultaneously maximizing the domain discrepancy of the joints to find the optimal representa-
tion from the pose feature extractor θ. This is achieved using a domain classifier G, trained
with the loss:

Ld(x) = G(θ(x)) log(G(θ(x))) + (1 − G(θ(x)))log(1 − G(θ(x))). (5)

In DUA method, the unified pose representation obtained by the converter is pre-trained,
and its weights remain frozen during training. On the other hand, the other components of the
method are trained in an online mode. The overall training loss is given by:

L = λLd + γLunc + Lpose , (6)

where 0 < λ < 1 and 0 < γ < 1 are regularization parameters.

5.2 Unified Pose Representation
The incompatibility between the pose representations is a commonly observed problem

in 3D human pose estimation when dealing with different datasets. Previous works have already


48

Figure 20 – Five distinct pose representations used by common 3D Human Pose datasets found
in the literature.

Source: Sárándi et al. (2023), © 2023 IEEE.

discussed this issue in the literature. One such work aims to learn unified representations by
utilizing different data sources concurrently (SÁRÁNDI et al., 2023). This problem arises from
the existence of various body capture sensors and different 3D pose representations being used
in the literature leading each dataset to have its own representation. Figure 20 shows five
distinct 3D pose representations found in the literature.

This problem was previously addressed in the task of pose shape estimation, by the
creation of the Archive of Motion Capture As Surface Shapes (AMASS) (MAHMOOD et al.,
2019). This represents a large and varied database of human motion that unifies 15 different
optical marker-based datasets through the lenses of the SMPL (Skinned Multi-Person Linear)
representation. This is done through the usage of the Motion and Shape capture (MoSh)
technique, aimed at estimating body shape SMPL parameters given the 3D pose data (LOPER
et al., 2014).

An example of the moshed SMPL representation of the 3D pose found in the Human3.6M
dataset is found in Figure 21. In this figure, the SMPL representation (red) is juxtaposed with
the H3.6M representation (black) to show their differences, with the biggest impact being on
the hips and the head. Therefore, to address this problem in the pose representations, our
approach aims to train a pose converter to transform the 3D human pose to the singular
pose representation using data obtained from both the SMPL and original Human3.6M
representations.


49

Figure 21 – Overlapped joints of the Human3.6M dataset coming from two distinct pose
representations, SMPL (red) and the original H3.6M format (black). This makes
explicit the difference in the pose representations being used by common 3D human
pose datasets in the literature.

Source: The Author.

This conversion problem was already discussed in the literature (RAPCZYŃSKI et al.,
2021). However, previous approaches tried to find a harmonization and normalization technique
through handcrafted features. This approach works well in some cases but does not preserve the
body proportions after normalization. Thus, we developed a pose converter to dynamically learn
how to convert from one pose representation to another. The idea of our converter network is
to dynamically find an array, based on the network weights and the 3D pose input, upon which
when adding this array to a pose in representation A, a pose in an arbitrary format B is found.

In mathematical terms, the mapping function Φ : A 7→ B takes a set of joints XA

represented in pose format A and calculates weights to map XA to a representation XB in the
pose format B.

Instead of directly mapping A to B, the task of converting between representation
spaces of the same semantic skeleton graph involves finding trajectory vectors that describe the
new joint positions and their trajectories in the new pose space. To simplify this process, we work
directly with the joint trajectory vectors by introducing a mapping function φ : A 7→ (B − A),
in such a way that:

XB = φ(XA) + XA. (7)

The weights of the mapping function φ are obtained through a single-layer residual
neural network using gradient descent. To train this network, a loss function combining mean


50

squared error and mean average error is employed. The loss is given by:

Lconv = α(XB − XA)2 + (1 − α)∥XB − XA∥, (8)

where 0 ≤ α ≤ 1 is a hyperparameter used to impose the importance of each loss term.
Figure 22 illustrates the learning process of the proposed method.

Figure 22 – Pose conversion method used to find a unified pose representation.

Source: The Author.

5.3 Pose Uncertainty
The issue of 3D human pose estimation presents a challenge in the form of error

propagation within the most extreme kinematic group, compounded by the ill-defined monocular
estimation resulting from self-occlusion during varying camera perspectives. In order to mitigate
this problem, an approach has been devised to quantify and reduce the uncertainties arising
from such scenarios.

Uncertainty in Bayesian networks has been defined in two forms: epistemic uncertainty
captures the model’s ignorance despite sufficient training data with well-defined data distribu-
tions, while aleatory uncertainty aims to model unexplained uncertainties within the current
training data (KENDALL et al., 2018). Previous works have explored uncertainty modeling
through Bayesian networks for 3D human pose estimation using different approaches (LI et al.,
2023; KUNDU et al., 2022). In this work, we propose a method based on a naive definition of
uncertainty.

To quantify uncertainty, our method utilizes the features extracted from the pose
estimator to predict the probability of a joint being incorrect. A random variable U is generated
by mapping the normalized Euclidean distance of the joint difference, where joints with small
distances are mapped near zero and those with significant distances are mapped to one. This
mapping allows for improved assessment and quantification of uncertainty associated with


51

individual joints. Figure 23 illustrates our devised approach using the method proposed by
Martinez et al. (2017) as the backbone.

Figure 23 – Uncertainty-based 3D human pose estimation method devised using Martinez et
al. (2017) as backbone.

Source: The Author.

Our method consists of J heads, each representing one joint in the pose representation.
After passing through the sigmoid activation function, each head provides the probability
of a specific joint being incorrect. This probability is learned through supervised training by
comparing the output to the normalized Euclidean distance. The Uncertainty Error is calculated
as the L1 distance between the array composed of the heads and the normalized distance.

Given the outputs of a pose feature extractor θ for an input x, inserted into each of
the J heads Hj, and using a pose estimator head P to obtain the output 3D pose, we define
U(x) as the desired pose uncertainty obtained by concatenating each head Hj . In other words,
using the concatenation operation denoted by || and the sigmoid function σ, we have:

U(x) = ||σ(Hj(θ(x))). (9)

By combining the pose feature extractor and the pose estimator head through the
function Π(x) = P (θ(x)), where P represents the pose estimator, the uncertainty U(x) of a
given pose x can be learned. This is achieved by comparing the normalized Euclidean distance
of each joint in the predicted pose Π(x) to the ground-truth 3D pose y, and comparing it to
the output uncertainty using L1 distance. The loss function for training is defined as:

Lunc(x) =

∥∥∥∥∥∥
(
√

Π(x) − y)2

∥Π(x)∥2∥y∥2
− U(x)

∥∥∥∥∥∥ . (10)


6 Experimental Settings

This chapter aims to present the experimental settings used in this work in order to
conduct the experiments, including the hardware configuration, the hyperparameters values,
and the training and evaluation protocols. This chapter also presents the metrics and datasets
utilized to assess the DUA method, proposed in this dissertation for 3D human pose estimation
based on monocular RGB images and domain adaptation.

6.1 Hardware Configuration
All experiments were conducted using a computer with two Intel Xeon E5620 CPUs,

48GB of RAM, and an NVIDIA TitanXP GPU with 12GB of VRAM.

6.2 Hyperparameter Values
During training, a batch size of 2048 was employed, with a learning rate of 1e−3 paired

with the Adam optimizer. For hyperparameters, α = 0.5 were employed in the pose conversion
scenario, for the pose estimator, λ = 0.01, γ = 0.1, and β = 0.4 were chosen via empiric
evaluation. The pose conversion was pre-trained and its weights were frozen on the DUA
method.

6.3 Preprocessing
Prior to training, a preprocessing step was applied to center all the poses with their

hip joint on the origin of the coordinate system. Additionally, the 2D joint coordinates were
normalized to fit within a [−1, 1] coordinate system by scaling the image input space. The
3D joint coordinates followed the standard protocol established by Martinez et al. (2017),
where they were transformed into the camera coordinate system. This transformation involves
centering the pose around the camera’s optical center, and is achieved by applying the inverse
rotation and translation of the camera position to the pose joints, using the extrinsic camera
matrixes from the training data.

6.4 Evaluation Protocol
Cross-dataset evaluation in 3D human pose estimation shows a significant challenge due

to the inherent misalignment of target distributions, especially when synthetic data is involved.
The scarcity of literature addressing this specific scenario has motivated only a few authors to


53

explore the evaluation protocol for assessing cross-domain generalization when synthetic data
is involved (KUNDU et al., 2022; ZHANG et al., 2021).

In this work, our focus lies on evaluating the performance of synthetic to real cross-
domain pose estimation. Building upon previous works, we adopt a widely used general domain
adaptation evaluation protocol to assess the effectiveness of domain generalization. Specifically,
we employ an unsupervised training approach using the target dataset training split, while
utilizing the supervised source data for training. The evaluation is conducted using both
synthetic and real datasets as source data.

For the purpose of comparison, we adopt the unified pose representation of the
Human3.6M model as our baseline, with the pose converter trained to transform the SMPL
pose representation of the dataset into the Human3.6M representation. By employing this unified
pose representation, we aim to facilitate meaningful comparisons with existing approaches.

6.5 Datasets
We employed two datasets in our cross-domain experiments: SURREAL (VAROL et

al., 2017) and Human3.6M (MARCARD et al., 2018). In particular, the SURREAL dataset
was used to represent the synthetic image domain, while the Human3.6M dataset was used to
represent the real people image domain.

The SURREAL is a large-scale dataset containing more than 6 million photorealistic
synthetic image frames found in real environments with large variations in texture, body time,
camera positioning, and pose actions. The database contains information about the depth
map, body parts, optical flow, and 2D and 3D joints. Figure 24 shows examples of images,
environments, and actions found in the SURREAL dataset

Figure 24 – Example images found in the SURREAL dataset.

Source: The Author.


54

The Human3.6M database is composed of real people images obtained from a Motion
Capture system based on markers, it contains scenes of 11 professional actors obtained in a
controlled environment. The database has about 3.6 million annotations of 3D poses, considering
four different angles. This dataset also has three evaluation protocols with different data for
training and testing. Figure 25 shows some examples of images present in the Human3.6M
dataset, showcasing the environment, some of the actions, and some actors present in the
dataset.

Figure 25 – Example images found in the Human3.6M dataset.

Source: The Author.

6.6 Metrics
The methods proposed and developed in this work were evaluated by the standard

metrics applied in the literature in the 3D human pose estimation problem, in order to provide
a comparison with previous works dealing with the same task the following metrics were used:

MPJPE: The Mean Per-Joint Position Error (MPJPE) represents the mean error, in millimeters,
between the estimated points and the real points after the root joint alignment;

P-MPJPE: The Procrustes-Aligned Mean Per-Joint Position Error (P-MPJPE) represents
the mean error, in millimeters, between the estimated points and the real points after
Procrustes alignment.


7 Results and Discussion

In this chapter, we present the results obtained by DUA (Domain Unified Approach)
method, in the experiments conducted according to the settings detailed in Chapter 6.

DUA consists of three essential modules: pose conversion, pose uncertainty, and a
domain-unified approach, as established in Section 5.1. In order to provide a comprehensive
analysis, we show the results of each module in its respective section. By evaluating the
performance of each module individually, we gain insights into its effectiveness and contribution
toward accurate and robust pose estimation.

7.1 Unified Pose Representation
The pose conversion method underwent training for 100 epochs, and the converted

poses were evaluated using the MPJPE (Protocol 1) and P-MPJPE (Protocol 2) metrics on the
Human3.6M dataset. We conducted evaluations in two