RESSALVA 

Atendendo solicitação do autor, o 
texto completo desta Tese será 

disponibilizado somente a partir
de 28/06/2026. 


UNIVERSIDADE ESTADUAL PAULISTA
“JÚLIO DE MESQUITA FILHO”
Câmpus de Rio Claro

Programa de Pós-Graduação em Ciência da Computação

Lucas Pascotti Valem

Contextual Similarity Learning for Image
Retrieval and Classification: Applications in

Person Re-Identification

Rio Claro - SP

2024


UNIVERSIDADE ESTADUAL PAULISTA
“Júlio de Mesquita Filho”

Instituto de Geociências e Ciências Exatas
Câmpus de Rio Claro

Lucas Pascotti Valem

Contextual Similarity Learning for Image Retrieval
and Classification: Applications in Person

Re-Identification

Tese de Doutorado apresentada ao Instituto
de Geociências e Ciências Exatas do Câmpus
de Rio Claro, da Universidade Estadual
Paulista “Júlio de Mesquita Filho”, como
parte dos requisitos para obtenção do título
de Doutor em Ciência da Computação.

Orientador: Prof. Dr. Daniel Carlos
Guimarães Pedronette

Rio Claro - SP
2024


V151c
Valem, Lucas Pascotti

    Contextual similarity learning for image retrieval and
classification: applications in person re-identification / Lucas
Pascotti Valem. -- Rio Claro, 2024

    261 f. : il., tabs.

    Tese (doutorado) - Universidade Estadual Paulista (UNESP),

Instituto de Geociências e Ciências Exatas, Rio Claro

    Orientador: Daniel Carlos Guimarães Pedronette

    1. Ciência da Computação. 2. Inteligência Artificial. 3.

Recuperação de Imagens por Conteúdo. 4. Classificação de Imagens.

5. Re-Identificação de Pessoas. I. Título.

Sistema de geração automática de fichas catalográficas da Unesp. Biblioteca da Universidade
Estadual Paulista (UNESP), Instituto de Geociências e Ciências Exatas, Rio Claro. Dados

fornecidos pelo autor(a).

Essa ficha não pode ser modificada.


UNIVERSIDADE ESTADUAL PAULISTA
“Júlio de Mesquita Filho”

Instituto de Geociências e Ciências Exatas
Câmpus de Rio Claro

Lucas Pascotti Valem

Contextual Similarity Learning for Image Retrieval
and Classification: Applications in Person

Re-Identification

Tese de Doutorado apresentada ao Instituto
de Geociências e Ciências Exatas do Câmpus
de Rio Claro, da Universidade Estadual
Paulista “Júlio de Mesquita Filho”, como
parte dos requisitos para obtenção do título
de Doutor em Ciência da Computação.

Comissão Examinadora
• Prof. Dr. Daniel Carlos Guimarães Pedronette (Orientador)

Instituto de Geociências e Ciências Exatas (IGCE)
Universidade Estadual Paulista - UNESP

• Profa. Dra. Agma Juci Machado Traina
Instituto de Ciências Matemáticas e de Computação (ICMC)
Universidade de São Paulo - USP

• Prof. Dr. Hélio Pedrini
Instituto de Computação (IC)
Universidade Estadual de Campinas - UNICAMP

• Prof. Dr. João Paulo Papa
Faculdade de Ciências (FC)
Universidade Estadual Paulista - UNESP

• Prof. Dr. Wallace Correa de Oliveira Casaca
Instituto de Biociências, Letras e Ciências Exatas (IBILCE)
Universidade Estadual Paulista - UNESP

Conceito: Aprovado.
Rio Claro (SP), 28 de junho de 2024.


Acknowledgements

First and foremost, I thank God for the gift of life and health.

My parents and family, for their love and unconditional support.

My advisor, for the guidance, support, and trust.

All the professors, both national and international, who contributed to this research.

The university and all the faculty members of the graduate program.

Friends and colleagues, for the encouragement and support.

The São Paulo Research Foundation (FAPESP), Fulbright, and Petrobras for the
financial support.


Resumo
O crescimento exponencial das coleções de imagens produziu um aumento significativo
nas aplicações de aprendizado de máquina e recuperação de imagens em diversos cenários.
Apesar dos avanços recentes, muitos métodos ainda dependem fortemente de grandes
volumes de dados rotulados para treinamento, o que representa um obstáculo importante,
uma vez que produzir dados rotulados é geralmente custoso. Para enfrentar esse desafio,
várias técnicas foram desenvolvidas. Um aspecto crítico de tais abordagens é definir a
similaridade entre imagens de maneira eficaz, o que continua sendo um desafio central em
aplicações de recuperação e aprendizado de máquina, tais como classificação. A questão
central está intrinsecamente relacionada à forma como a informação é representada e
aos métodos usados para comparar essas representações. Uma grande limitação é que a
maioria ainda depende de medidas par-a-par e ignoram outras informações significativas
presentes na vizinhança que podem ser usadas para melhorar os resultados. Este trabalho
foca em melhorar a eficácia da recuperação de imagens por conteúdo visual e tarefas de
classificação usando similaridade contextual, indo além das métricas tradicionais par-a-par
para explorar as relações entre os elementos. O aprendizado de similaridade contextual é
empregado para explorar relações de vizinhança entre os elementos, usando técnicas tais
como informações baseadas em ranqueamento, medidas contextuais, grafos e hipergrafos
para modelar a informação contextual de forma eficaz. Esta tese propõe sete métodos novos
aplicados a cenários de propósito geral e re-identificação de pessoas (Re-ID) abordando
diferentes contribuições. Três tarefas principais foram consideradas: estimativa de eficácia
de consultas, recuperação e classificação de imagens. Foi realizada uma ampla avaliação
experimental, totalizando 17 coleções de imagens e mais de 50 descritores visuais. Os
métodos propostos, quando comparados com o estado-da-arte, demonstram resultados que
são comparáveis ou superiores aos das abordagens existentes na maioria dos casos.

Palavras-chave: Similaridade Contextual; Recuperação de Imagens; Classificação
de Imagens; Estimativas de Eficácia; Re-identificação de Pessoas; Aprendizado de
Representações.


Abstract
The exponential growth of image collections has demanded a significant increase in the
use of machine learning and image retrieval applications across various scenarios. Despite
the relevant advances, many methods still rely heavily on large volumes of labeled data for
training, which establishes an important obstacle, once producing labeled data is generally
expensive and time-consuming. To address this challenge, numerous techniques have been
developed recently. A critical aspect of these approaches is effectively defining image
similarity, which remains a central challenge in retrieval and machine learning applications,
such as classification. The core of this issue is intrinsically linked to how information is
represented and the methods used to compare these representations. A major limitation is
that most of them still rely on pairwise measures, ignoring other meaningful information
present in the neighborhood that can be used to further increase the results. This work
focuses on improving the effectiveness of image retrieval by visual content and classification
tasks using contextual similarity, moving beyond traditional pairwise measures to exploit
relationships among elements. Contextual similarity learning is employed to capture
underlying relationships among elements, using techniques such as rank-based models,
contextual measures, graphs, and hypergraphs to model contextual information effectively.
This dissertation proposes seven novel methods applied across general-purpose and person
re-identification (Re-ID) scenarios addressing different contributions. Three main tasks
were considered: query performance prediction, image retrieval, and image classification.
A wide experimental evaluation was conducted, totaling 17 datasets and more than 50
visual image descriptors. The proposed methods, when compared with state-of-the-art and
recent baselines, demonstrate results that are comparable to or surpass those of existing
approaches in most cases.

Keywords: Contextual Similarity Information; Image Retrieval; Image Classification;
Query Performance Prediction; Person Re-ID; Representation Learning.


List of Figures

Figure 1.1 – Overview of goals and contributions and how contextual similarity is
exploited. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Figure 1.2 – Dissertation structure: organization, main concepts, proposed
approaches, and publications. . . . . . . . . . . . . . . . . . . . . . . . 37

Figure 2.1 – Typical architecture of a CBIR system. Figure adapted from [316]. . . . 40
Figure 2.2 – Overview of unsupervised similarity learning workflow for image retrieval. 44
Figure 2.3 – Overview of unsupervised similarity learning applied for

rank-aggregation in image retrieval. . . . . . . . . . . . . . . . . . . . . 50
Figure 2.4 – General diagram of a Re-ID system. Figure adapted from [25]. . . . . . 52
Figure 2.5 – Categorization of graph learning approaches. Figure adapted from [369]. 57
Figure 2.6 – Hypergraph illustration. . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Figure 2.7 – Incidence matrix H example. . . . . . . . . . . . . . . . . . . . . . . . 60
Figure 4.1 – Example of precision × recall curve. . . . . . . . . . . . . . . . . . . . 80
Figure 5.1 – Illustration of a confusion matrix of probabilities between classes. . . . 90
Figure 5.2 – Diagram illustrating the main stages of the DRNE. . . . . . . . . . . . 96
Figure 5.3 – Illustration that exemplifies the calculation of a contextual image

constructed from a hypothetical ranked list. . . . . . . . . . . . . . . . 97
Figure 5.4 – Examples of images generated for synthetic ranked lists with different

degrees of effectiveness. . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Figure 5.5 – Proposed CNN model for effectiveness prediction. . . . . . . . . . . . . 99
Figure 5.6 – Diagram of the proposed approach (RQPPF) for self-supervised query

performance prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Figure 5.7 – Losses along training epochs for train and validation sets. . . . . . . . . 105
Figure 5.8 – Correlation of MAP and effectiveness estimation measures on DukeMTMC.108
Figure 5.9 – Two examples of ranked lists (good and bad queries) for Duke dataset

and OSNET-AIN descriptor. . . . . . . . . . . . . . . . . . . . . . . . . 109
Figure 5.10–Impact of parameters on Pearson correlation between MAP and our

approach on MPEG-7 dataset. . . . . . . . . . . . . . . . . . . . . . . . 109
Figure 5.11–Proposed approach against MAP on MPEG-7 dataset (all descriptors).

Pearson Correlation = 0.8977. . . . . . . . . . . . . . . . . . . . . . . . 112
Figure 5.12–Two examples of RQPPF results on ranked lists of Market dataset

(CNN-HACNN descriptor). . . . . . . . . . . . . . . . . . . . . . . . . . 112
Figure 6.1 – Illustrative example of original Jaccard index limitation. . . . . . . . . 117
Figure 6.2 – Query on Holidays with results for RBO and JacMax. . . . . . . . . . . 121
Figure 6.3 – Visual example of fusion result on Market dataset. . . . . . . . . . . . . 121
Figure 7.1 – Overview of the HRSF proposed approach. . . . . . . . . . . . . . . . . 125


Figure 7.2 – Evaluation of the impact of parameter k on MAP and R1 for Market1501
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

Figure 7.3 – Evaluation of the HQPP measure compared to the MAP on DukeMTMC
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

Figure 7.4 – Average MAP of top pairs on CUHK03 dataset. . . . . . . . . . . . . . 133
Figure 7.5 – Average MAP of top pairs on the Market dataset. . . . . . . . . . . . . 133
Figure 7.6 – Average MAP of top pairs on Duke dataset. . . . . . . . . . . . . . . . 133
Figure 7.7 – Selected Combination (among top-5) on Market considering MAP. . . . 135
Figure 7.8 – Selected Combination (among top-5) on Duke considering MAP. . . . . 136
Figure 7.9 – Distance distribution for two query images on DukeMTMC dataset. . . 139
Figure 7.10–Examples to illustrate the impact of HRSF selection and fusion on the

CUHK03 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Figure 7.11–Examples to illustrate the impact of HRSF selection and fusion on the

DukeMTMC dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Figure 8.1 – Overall organization of Rank Flow Embedding (RFE). . . . . . . . . . 144
Figure 8.2 – Impact of parameter α in function σ (Equation 8.4) as the rank position

varies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Figure 8.3 – Impact of parameters α and T (number of iterations) on MAP for two

datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Figure 8.4 – Ablation study for RFE on 6 datasets considering two descriptors each. 158
Figure 8.5 – Feature space illustrations for RFE embeddings computed by t-SNE on

Flowers dataset with CNN-ResNet descriptor. . . . . . . . . . . . . . . 169
Figure 8.6 – Examples of ranked lists before and after RFE was applied on 3 different

datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Figure 9.1 – Workflow of our proposed Manifold-GCN framework for image

classification. The steps of the approach are numbered. . . . . . . . . . 173
Figure 9.2 – Impact of manifold learning approaches on F-measure results considering

GCN-SGC on different datasets and features. . . . . . . . . . . . . . . 183
Figure 9.3 – t-SNE visualizations showing improved feature space using manifold

learning and reciprocal graph on the Flowers dataset. . . . . . . . . . . 186
Figure 10.1–MiniImageNet images used as references for the bidimensional space plots.196
Figure 10.2–Bidimensional space for similar and dissimilar images on the

MiniImageNet dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Figure 10.3–Workflow of the steps of the proposed approach. . . . . . . . . . . . . . 197
Figure 10.4–Accuracy (%) on the test set for different batch sizes. . . . . . . . . . . 200
Figure 10.5–Accuracy (%) on the test set across epochs comparing SupCon to CCL. 201
Figure 10.6–t-SNE visualization for 9 classes comparing the features of the original

method to CCL on the Food101 dataset with 20% of training data. . . 202
Figure 11.1–RFE and JaccardMax relative gains (%) over MAP of descriptors. . . . 208


Figure 11.2–Relative gains (%) obtained by CCL in comparison to SupCon for
different train/test splits. . . . . . . . . . . . . . . . . . . . . . . . . . . 210

Figure 11.3–Published and submitted collaborations and their connection to terms
and concepts related to this dissertation. . . . . . . . . . . . . . . . . . 218


List of Tables

Table 2.1 – Examples of traditional distance measures. . . . . . . . . . . . . . . . . 43
Table 2.2 – Summary of concepts and terminologies discussed for Re-ID. . . . . . . 55
Table 3.1 – State-of-the-art methods in Re-ID with results of MAP (%) and R1 (%). 72
Table 4.1 – General-purpose datasets used in the experimental evaluation. . . . . . 84
Table 4.2 – Descriptors used for general-purpose datasets. . . . . . . . . . . . . . . . 85
Table 4.3 – Re-ID datasets used the experimental evaluation. . . . . . . . . . . . . . 86
Table 4.4 – Values of MAP and R1 for each Re-ID descriptor on each dataset. . . . 87
Table 4.5 – Datasets used to evaluate each of the proposed methods, categorized by

task and type of supervision. . . . . . . . . . . . . . . . . . . . . . . . . 88
Table 5.1 – Pearson correlation between MAP and effectiveness estimation measures

on Flowers dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Table 5.2 – Pearson correlation between estimation measures for all descriptors of

MPEG-7 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Table 5.3 – Pearson correlation between MAP and effectiveness estimation measures

on datasets considering train with k = 20. . . . . . . . . . . . . . . . . . 107
Table 5.4 – Pearson correlation between our proposed RQPPF and MAP considering

different regression models and measures. . . . . . . . . . . . . . . . . . 110
Table 5.5 – Relative gains obtained by RQPPF using the Authority estimation

measure for modeling the features. . . . . . . . . . . . . . . . . . . . . . 111
Table 5.6 – Relative gains obtained by RQPPF using the Reciprocal estimation

measure for modeling the features. . . . . . . . . . . . . . . . . . . . . . 111
Table 5.7 – Comparing RQPPF and DRNE to baselines. Pearson correlation between

MAP and effectiveness estimations is reported. . . . . . . . . . . . . . . 113
Table 5.8 – Pearson correlation between MAP and combinations of methods on

Re-ID datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Table 6.1 – Re-ranking results considering MAP (%). . . . . . . . . . . . . . . . . . 119
Table 6.2 – Rank-aggregation results for different measures. . . . . . . . . . . . . . . 119
Table 6.3 – State-of-the-art on Holidays dataset (MAP). . . . . . . . . . . . . . . . 120
Table 6.4 – State-of-the-art on UKBench dataset (N-S Score). . . . . . . . . . . . . 120
Table 6.5 – Comparison with person Re-ID baselines. . . . . . . . . . . . . . . . . . 120
Table 7.1 – Table of symbols used in the definition of HSRF [331]. . . . . . . . . . . 126
Table 7.2 – The best selected combination of each size (among the top-5) is reported

on each dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Table 7.3 – Proposed approach compared to early and late fusion baselines. . . . . . 135
Table 7.4 – State-of-the-art comparison considering MAP (%) and R-01 (%). . . . . 138
Table 7.5 – State-of-the-art methods ranked by their results. . . . . . . . . . . . . . 138


Table 8.1 – Retrieval results of RFE on general-purpose image datasets (Flowers,
Corel5k, and ALOI) considering MAP (%). . . . . . . . . . . . . . . . . 159

Table 8.2 – Retrieval results of RFE on the Holidays dataset considering MAP (%). 160
Table 8.3 – Retrieval results of RFE on the UKBench dataset for both NS-Score and

MAP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Table 8.4 – Retrieval results of RFE on 3 Re-ID datasets (CUHK03, Market, and

Duke) considering both R1 and MAP. . . . . . . . . . . . . . . . . . . . 161
Table 8.5 – Semi-supervised classification (accuracy) on Flowers dataset using RFE

embeddings for different input features. . . . . . . . . . . . . . . . . . . 162
Table 8.6 – Semi-supervised classification (accuracy) on Corel5k dataset using RFE

embeddings for different input features. . . . . . . . . . . . . . . . . . . 163
Table 8.7 – Evaluation of RFE on unseen queries considering MAP (%). . . . . . . . 163
Table 8.8 – State-of-the-art (SOTA) comparison with other variants of diffusion

processes on the ORL (R@15) and the MPEG-7 (R@40) datasets. . . . 164
Table 8.9 – State-of-the-art comparison on Flowers, Corel5k, and ALOI datasets

(MAP %). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Table 8.10–State-of-the-art comparison on Holidays dataset (MAP). . . . . . . . . . 165
Table 8.11–State-of-the-art comparison on UKBench dataset (NS-Score). . . . . . . 165
Table 8.12–State-of-the-art (SOTA) comparison on person Re-ID datasets

considering MAP (%) and R-01 (%). . . . . . . . . . . . . . . . . . . . . 166
Table 8.13–Accuracy comparison (%) for baselines on Flowers and Corel5k datasets.

The RFE is compared with semi-supervised classification baselines. . . . 167
Table 9.1 – Impact of manifold learning approaches and Reciprocal Graph on the

classification accuracy of 5 different GCN models on Flowers dataset. . . 180
Table 9.2 – Impact of manifold learning approaches and Reciprocal Graph on the

classification accuracy of 5 different GCN models on Corel5k dataset. . . 181
Table 9.3 – Impact of manifold learning approaches and Reciprocal Graph on the

classification accuracy of 5 different GCN models on CUB200 dataset. . 182
Table 9.4 – Results (%) for GCN-SGC on CUHK03 dataset. . . . . . . . . . . . . . 184
Table 9.5 – Results (%) on Market1501 dataset. . . . . . . . . . . . . . . . . . . . . 184
Table 9.6 – Results (%) for GCN-SGC on DukeMTMC dataset. . . . . . . . . . . . 185
Table 9.7 – Manifold-GCN compared to baseline approaches on Flowers, Corel5k,

and CUB200 datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Table 9.8 – Execution time (in seconds) for manifold learning methods and GCN

approaches for both training and testing. . . . . . . . . . . . . . . . . . 190
Table 10.1–Neural network architecture and default hyperparameters utilized in the

evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Table 10.2–Impact of batch size on accuracy (%) on Food101 dataset, considering a

split of 20% for training. . . . . . . . . . . . . . . . . . . . . . . . . . . 199


Table 10.3–Impact of parameter k (neighborhood size) on accuracy (%). Results
highlighted in gray deviate less than 0.20 from the best value in bold. . 200

Table 10.4–Accuracies (%) achieved for 100 epochs of training, comparing the
proposed CCL with other contrastive losses on three datasets. . . . . . . 201

Table 10.5–Accuracies (%) achieved on the Food101 dataset when comparing the
proposed CCL against SupCon [143], for different training epochs. . . . 201

Table 11.1–Relative gains of DRNE and RQPPF when compared to Authority and
Reciprocal Density. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

Table 11.2–Comparison between the proposed approaches on person Re-ID
considering MAP (%) and R-01 (%). . . . . . . . . . . . . . . . . . . . . 205

Table 11.3–Proposed approaches on person Re-ID ranked according to their
effectiveness (R1 and MAP). . . . . . . . . . . . . . . . . . . . . . . . . 205

Table 11.4–Proposed approaches compared to state-of-the-art on person Re-ID
considering MAP (%) and R-01 (%). . . . . . . . . . . . . . . . . . . . . 206

Table 11.5–State-of-the-art (SOTA) comparison on Holidays dataset (MAP). . . . . 207
Table 11.6–State-of-the-art (SOTA) comparison on UKBench dataset (NS-Score). . 208
Table 11.7–Accuracy comparison (%) for baselines on Flowers and Corel5k datasets.

The RFE and Manifold-GCN are compared with classification baselines. 209
Table 11.8–Research questions addressed by each of the proposed approaches. . . . 213
Table 11.9–Future work related to each of the proposed approaches. . . . . . . . . . 221


List of Abbreviations and Acronyms

ACC Color Autocorrelogram Descriptor
ACF Aggregated Channel Features Detector
AF Attention Features
AIN Adaptive Instance Normalization
AIR Articulation-Invariant Representation Descriptor
ALOI Amsterdam Library of Object Images
ANML Adaptive Neighborhood Metric Learning
AP Average Precision
APPNP Approximate Personalized Propagation of Neural Predictions
ARMA Auto-Regressive Moving Average Filter Convolution
ARN Adaptation and Re-Identification Network
ASC Aspect Shape Context Descriptor
ATNET Adaptive Transfer Network
BAS Beam Angle Statistics Descriptor
BFS Breadth-First Search
BFSTREE Breadth-First Search Tree
BIC Border/Interior Pixel Classification Descriptor
BOVW Bag of Visual Words
BOW Bag of Words
CAMEL Cross-view Asymmetric Metric LEarning
CAP Camera-aware proxies
CBIR Content-Based Image Retrieval
CC Connected Components
CCL Contextual Contrastive Loss
CCOM Color Co-Occurrence Matrix Descriptor
CEDD Color and Edge Directivity Descriptor
CFD Contour Features Descriptor
CG Correlation Graph
CIFAR Canadian Institute For Advanced Research
CLD Color Layout Descriptor
CMC Cumulative Matching Characteristics
CNN Convolutional Neural Networks
COMO Compact Composite Moment-Based Descriptor
CPRR Cartesian Product of Ranking References
CPU Central Processing Unit
CSGLP Camera Style Generation and Label Propagation
CSRT Discriminative Correlation Filter with Channel and Spatial Reliability
CUB200 Caltech-UCSD Birds Dataset
CUHK Dataset from the Chinese University of Hong Kong
DAAM Domain Adaptive Attention Model
DCNN Deep Convolutional Neural Networks
DIDAL Discriminative Identity-Feature Exploring and Differential Aware

Learning


DPM Deformable Parts Model
DPNET Dual Path Network
DRNE Deep Rank Noise Estimator
DUKEMTMC Duke Multi-Tracking Multi-Camera Dataset
EANET Enhancing Alignment Network
ECN Exemplar Memory Convolutional Network
EHD Edge Histogram Descriptor
ELF Ensemble of Localized Features
EMTL Enhanced Multi-Dataset Transfer Learning
FBRESNET Facebook Residual Neural Network
FCTH Fuzzy Color and Texture Histogram
FN False Negatives
FOH Fuzzy Opponent Histogram
FP False Positives
GAN Generative Adversarial Network
GAT Graph Attention Networks
GB Gigabytes
GBICOV Covariance descriptor based on Bio-inspired Features
GCN Graph Convolutional Network
GCN-APPNP Approximate Personalized Propagation of Neural Predictions GCN
GCN-ARMA Auto-Regressive Moving Average Filter Convolution GCN
GCN-GAT Graph Attention Networks GCN
GCN-SGC Simple Graph Convolution GCN
GDP Graph Diffusion Process
GIST Global Image Descriptor for low-dimensional features
GNN Graph Neural Network
GNN-KNN-LDS KNN variation of GNN-LDS
GNN-LDS Learning Discrete Structures for Graph Neural Networks
GOG Gaussian Of Gaussians Descriptor
GPU Graphics Processing Unit
GRAD-NET Graph Diffusion Network
GRID UnderGround Re-IDentification (GRID) Dataset
GS Graph Sampling
GSP Graph Signal Processing
GSSL Graph-based Semi-Supervised Learning
HACNN Harmonious Attention Network
HCT Hierarchical Clustering with Hard-batch Triplet Loss
HHL Hetero and Homogeneously Learning
HLBP Histogram of Local Binary Patterns
HQPP Hypergraph Query Performance Prediction
HRSF Hypergraph Rank Selection and Fusion
HSV Color Space: Hue, Saturation, Value
IBN Instance-Batch Normalization
ICE Inter-Instance Contrastive Encoding
ICS Intra-Camera Supervise
ID Identifier
IDSC Inner Distance Shape Context
IICS Intra-inter Camera Similarity


IR Information Retrieval
ISSDA Iterative Self-Supervised Domain Adaptation
JCD Joint Composite Descriptor
JVCT Joint Generative and Contrastive Learning
KCF Kernelized Correlation Filters
KISSME Keep-it-simple-and-straightforward distance learning
KNN K Nearest Neighbors
LAS Local Activity Spectrum
LBP Local Binary Patterns
LCDP Locally constrained diffusion process
LDA Linear Discriminant Analysis
LDFV Local Descriptors encoded by Fisher Vector
LDS-GNN Learning Discrete Structures for Graph Neural Networks
LGBM Light Gradient Boosting Machine
LHRR Log-based Hypergraph of Ranking Reference
LMNN Large margin nearest neighbor learning
LOMO Local Maximal Occurrence Descriptor
LS Label Spreading
LSTM Long short-term memory
MAM Memory Access Method
MAP Mean Average Precision
MAR MultilAbel Reference Learning
MATE Multi-Task Multi-Label
MCFS Muti-cluster Feature Selection
MCRN Multi-Centroid Representation Network
MGCE-HCL Multi-Granularity Clustering Ensemble-based Hybrid Contrastive

Learning
MGH Metadata Guided Hypergraph
ML Machine Learning
MLFN Multi-Level Factorisation Network
MMCL Memory-based Multi-label Classification Loss
MOSSE Minimum Output Sum of Squared Error
MPEG Moving Picture Experts Group
MR Manifold Ranking
MSE Mean Squared Error
MSMT Multi-Scene Multi-Time Re-ID Dataset
NASNET Neural Architecture Search Network
NDFS Non-negative Discriminative Feature Selection
NET Network
NMF Non-negative Matrix Factorization
NNCLR Nearest-Neighbor Contrastive Learning of Visual Representations
NP-HARD Nondeterministic Polynomial-time Hard
NS Abbreviation of NS-Score
NS-SCORE Score for UKBench Dataset
O2CAP Offline-Online Associated Camera-Aware Proxies
OLDFP Object Level Deep Feature Pooling
OOD Out-of-Domain
OPF Optimum-Path Forest


ORL Our Database of Faces
OSNET Omni-Scale Feature Learning Neural Network
OSNET-AIN OSNET with Adaptive Instance Normalization
OSNET-IBN OSNET with Instance-Batch Normalization
PAF Part Association Field
PAUL Patch-Based Unsupervised Learning Framework
PCA Principal Component Analysis
PHOG Pyramidal Histogram of oriented gradients
PIF Part Intensity Field
PK-SAMPLER Random Sampling Method in Re-ID
QPP Query Performance Prediction
RAM Random Access Memory
RBF Radial Basis Function Kernel
RBO Rank-Biased Overlap
RDNN Residual Dense Neural Network
RDP Regularized Diffusion Process
RDPAC Rank Diffusion Process with Assured Convergence
RE-ID Person Re-Identification
RESNET Residual neural network
RFE Rank Flow Embedding
RGB Red, Green, Blue
RL-SIM Ranked Lists Similarity Approach
RLCC Refining Pseudo Labels with Clustering Consensus
RQPPF Regression for Query Performance Prediction Framework
SCC Strongly Connected Components
SCD Scalable Color Descriptor
SCH Simple Color Histogram
SD Self-diffusion for Image Segmentation and Clustering
SDC Scale-invariant Feature Transform Dense Color
SENET Squeeze-and-Excitation Network
SGC Simple Graph Convolution
SGD Stochastic Gradient Descent
SIFT Scale-invariant Feature Transform
SORT Simple Online and Realtime Tracking
SOTA State-of-the-art
SP Spatial Pyramid
SPACC Spatial Pyramid Color Autocorrelogram Descriptor
SPCEDD Spatial Pyramid Color and Edge Directivity Descriptor
SPEC Spectral Regression
SPFCTH Spatial Pyramid Fuzzy Color and Texture Histogram
SPGAN Similarity preserving generative adversarial network
SPJCD Spatial Pyramid Joint Composite Descriptor
SPLBP Spatial Pyramid Local Binary Patterns
SS Segment Saliences
SSL Softened Similarity Learning Approach
STF Swin-Transformer
SURF Speeded-Up Robust Features
SVD Singular Value Decomposition


SVM Support Vector Machines
SVR Support Vector Regression
SWIN-TF Swin-Transformers
T-SNE t-Distributed Stochastic Neighbor Embedding
TAUDL Tracklet Association Unsupervised Deep Learning
TN True Negatives
TP True Positives
TPG Tensor Product Graph
UDA Unsupervised Data Augmentation
UDLF Unsupervised Distance Learning Framework
UGAF-RSF Unsupervised Genetic Algorithm Framework for Rank Selection and

Fusion
UKBENCH University of Kentucky Dataset
USRF Unsupervised Selective Rank Fusion Method
UTAL Unsupervised Tracklet Association Learning
VAL-PAT Framework for Transferable Representations of Pedestrians
VGGNET Visual Geometry Group Network
VIT Vision Transformer
VOC Vocabulary Tree
VRAM Video Random Access Memory
WHOS Weighted Histograms of Overlapping Stripes
WSEF Weakly Supervised Experiments Framework
YOLO You Only Look Once object detection network


List of Symbols

A(i) Set of all elements in a batch, except the image of index i.
C Number of virtual classes in the synthetic scenario.
Cn Set of combinations where each combination is of size n.
E Set of edges of a graph.
Eh Set of hyperedges of a hypergraph.
G A graph.
H A hypergraph model or the number of feature maps (or hidden units)

in the hidden layer of a GCN.
I The set of indices for all augmented samples in a batch.
L Size of ranked lists.
M Confusion matrix of probabilities between classes.
Mc Confusion matrix of probabilities between elements of the same class.
Mf Sparse matrix used by RFE to accumulate normalized scores from

different rankers.
N The size of the collection C, i.e., dataset size.
NNk(i) The set of k nearest neighbors of image i.
NNY

k (i) A subset of NNk(i) containing only images from the same class of image
i.

Nb Number of image pairs in a training batch.
P (i) The set of indices of all positive samples in the batch distinct from

image i.
Ri Ranker of index i.
S Selection set of all possible combinations of rankers.
Sp Selection set of pairs of rankers.
T , t Number of iterations.
V Set of graph vertices.
VL Set of labeled nodes in the graph.
VU Unlabeled subset of the node set.
α Constant for the normalization equation of RFE.
β Weight or relevance of correlation in the selection measure.
zi The embedding of the data sample i generated by the metric learning

model.
◦ Hadamard (element-wise) product.
δ A function that computes the distance between two feature vectors.
ϵ A function that extracts a feature vector from an image.
ηf Fused affinity measure used for rank aggregation.
ηr(i, x) Function that assigns a weight to image x according to its position in

τi.
γ An effectiveness estimation measure.
γA Authority effectiveness estimation measure.
γR Reciprocal Density effectiveness estimation measure.
Â Normalized adjacency matrix of a graph.


λ A correlation measure (e.g., RBO).
R The set of real numbers.
A Affinity matrix (RFE) or adjacency matrix (GCNs).
C Similarity measure matrix based on Cartesian product.
D Distance matrix.
HG HRSF Hypergraph model.
H Incidence matrix for HRSF or matrix encoding the similarity information

of h-embeddings for RFE.
I Identity matrix.
S Similarity matrix.
W Affinity matrix (HRSF) or weight matrix in the definition of GCNs.
X Feature vectors provided as input to the GCN.
Z Matrix of embeddings learned by the GCN model.
b Reciprocal neighborhood binary vector used in the computation of

RQPPF meta-features.
bi Reciprocal neighborhood binary vector for image i.
ci Connected component of index i.
cq CC-embedding of a connected component q computed by RFE.
ei Representation vector (embedding) of the element of index i from the

dataset.
f Contextual Rank-based Feature (Meta-Feature) vector.
fs Set of synthetic features used for training the regression model.
ft Set of test features used for testing the regression model.
hi Row i of matrix H, named h-embedding.
p Reciprocal rank position vector used in the computation of RQPPF

meta-features.
pi Reciprocal rank position vector for image i.
q Effectiveness estimation vector used in the computation of RQPPF

meta-features.
qi Effectiveness estimation vector for image i.
xi Feature vector representing the image oi, a row of matrix X.
zi Row i of matrix Z, embedding representation for the node vi.
C Image dataset.
CL Set containing the L most similar images to image oq in the collection

C.
Ec Set of candidate edges defined by RFE.
Lccl Proposed contextual contrastive loss.
Lsup Supervised contrastive loss.
N Neighborhood set.
N (oq, k) Neighborhood set containing the k most similar elements to oq.
Nr(oq, k) Reciprocal neighborhood set for image oq.
Nr Reciprocal neighborhood set.
R Set of rankers.
S Set of connected components.
T Set of ranked lists for all the images in the dataset.
Ti Set of ranked lists produced by ranker Ri.
Tj The set of ranked lists produced by the ranker Rj.


X A subset of C.
Y A set of labels (classes).
R Set of rankers provided as input to the method.
X∗ Selected combination among all sizes.
X∗

n Selected combination composed by n rankers.
Xn Candidate combination composed by n rankers.
µ Constant used in RBO correlation measure.
ϕ Regression model for query performance prediction.
ψ The Contrastive loss temperature parameter.
ρ A similarity measure.
σ Normalization function used in RFE.
τR

n Ordered list of combinations of size n. Also referred to as the selection
list.

τR
n (Xi

n) Position of the combination Xi
n in the selection list τR

n .
τq Ranked list of image q.
τq(i) The position of image oi in the ranked list τq.
τi,q Ranked list of image of index q calculated by ranker i.
τi Ranked list of image of index i.
τq,f (i) Position of image oi in the ranked list of oq according to feature f .
τq(i) Position of the image i in the ranked list of query image q.
xℓ The ℓ-th image in the batch.
x̃ℓ The ℓ-th augmented image in the batch.
yℓ The label corresponding to the ℓ-th image.
ỹℓ The label corresponding to the ℓ-th augmented image.
Ã Adjusted adjacency matrix, A + I.
D̃ Degree matrix of Ã.
× Multiplication operator.
ξ The current epoch number.
ξtotal The total number of epochs.
aij Entry in the adjacency matrix indicating the presence (1) or absence

(0) of an edge between vertices oi and oj.
c Number of classes (or categories).
c(i, j) Element of matrix C.
cp Pairwise similarity relationship based on Cartesian product.
d Number of vector dimensions.
de The dimensionality of the RFE embedding space in which each object

is represented.
ei A hyperedge of index i.
fg Function that, given a hypergraph and an incidence matrix, calculates

a graph (RFE).
fh Function that, given ranked lists, calculates a hypergraph and an

incidence matrix by re-ranking through hypergraph embeddings (RFE).
fm Manifold learning function that processes a set of ranked lists T .
fp(oq, i) Function returning the i-th neighbor of image q.
fr Function representing unsupervised similarity learning.
fs Function for ranker selection.
fgcn Function representing the graph convolutional network model.


h(ei, vj) Reliance of vertex vj to belong to a hyperedge ei.
hp(eq) Weight of hyperedge eq.
hij An element of H representing the similarity of object oj in the context

of the hyperedge ei.
k Size of the neighborhood set.
kd A variable representing a specific depth for computing a correlation

measure.
kstart The initial value of k for the first epoch.
kv Size of virtual classes for synthetic data.
m Number of features, i.e., size of the set R.
n Size of a combination.
nk Number of candidate edges for RFE graph.
oi Indicates any object (element) belonging to the dataset, whose index is

i.
obji Object of index i, often abbreviated as oi.
p Pairwise relationship function defined by RFE.
px Pixel of position x in a grayscale image.
sc RFE Similarity measure attributed to pairs based on the similarity

between h-embeddings and confidence of the hyperedge.
tc Threshold for edge computation in the connected components stage of

the RFE.
thend Final threshold of the Correlation Graph.
thinc Correlation Graph threshold increment.
thstart Initial threshold of the Correlation Graph.
vi A node in the node set V representing an image oi.
vl A labeled node.
w Selection measure for combinations of rankers.
w(ei) A positive weight assigned to a hyperedge ei.
wp(i, x) A weight function that assigns relevance to a vertex ox based on its

position in a ranked list.
wp Selection measure for pairs of rankers proposed by HRSF.
Th

(T ) Set of ranked lists after T iterations of RFE.
yi Label (class) of object oi, i.e., i-th row of Y .


Contents

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.2 Research Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.3 Dissertation Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.4 Goals and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.1 Machine Learning and Categories of Supervision . . . . . . . . . . . 38
2.2 Content-Based Image Retrieval (CBIR) . . . . . . . . . . . . . . . . . 39
2.2.1 Feature Extraction and Ranking . . . . . . . . . . . . . . . . . . . . . . . 41
2.2.2 Unsupervised Similarity Learning for Re-Ranking . . . . . . . . . . . . . . . 43
2.2.3 Formal Definitions and Notations . . . . . . . . . . . . . . . . . . . . . . 45
2.3 Feature Selection and Fusion . . . . . . . . . . . . . . . . . . . . . . . 46
2.3.1 Query Performance Prediction . . . . . . . . . . . . . . . . . . . . . . . . 47
2.3.2 Rank Correlation Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.3.3 Rank-Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.4 Person Re-Identification . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.4.1 Concepts and Terminologies . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.5 Graph-Based Semi-Supervised Classification . . . . . . . . . . . . . . 55
2.5.1 Graph Convolutional Networks (GCNs) . . . . . . . . . . . . . . . . . . . . 57
2.5.2 Formal Definitions and Notations . . . . . . . . . . . . . . . . . . . . . . 59
2.6 Hypergraph Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.6.1 Formal Definitions and Notations . . . . . . . . . . . . . . . . . . . . . . 61

3 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.1 Similarity Learning in Image Retrieval . . . . . . . . . . . . . . . . . . 62
3.2 Person Re-Identification . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2.2 Metric Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2.3 Evolution of the State-of-the-Art . . . . . . . . . . . . . . . . . . . . . . . 69
3.3 Query Performance Prediction . . . . . . . . . . . . . . . . . . . . . . 73
3.4 Semi-Supervised Classification and Graph Convolutional Networks . 75
3.5 Contrastive Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4 EXPERIMENTAL PROTOCOL . . . . . . . . . . . . . . . . . . . . . 79


4.1 Effectiveness Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.1.1 Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.1.2 Query Performance Prediction . . . . . . . . . . . . . . . . . . . . . . . . 82
4.1.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2 Datasets and Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.2.1 General-Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2.2 Person Re-Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2.3 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5 SELF-SUPERVISED CONTEXTUAL EFFECTIVENESS
ESTIMATION MEASURES . . . . . . . . . . . . . . . . . . . . . . . 89

5.1 Synthetic Data Generation . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2 Deep Rank Noise Estimator (DRNE) . . . . . . . . . . . . . . . . . . 95
5.2.1 Computing Contextual Images from Ranked Lists . . . . . . . . . . . . . . 96
5.2.2 Denoising Convolutional Neural Network for Effectiveness Estimation . . . . 97
5.3 Regression for Query Performance Prediction Framework (RQPPF) 100
5.3.1 Background Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3.2 Contextual Rank-based Features . . . . . . . . . . . . . . . . . . . . . . . 102
5.3.3 Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4.1 Experimental Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4.2 DRNE Parameter Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.4.3 DRNE Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.4.4 RQPPF Parameter Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.4.5 RQPPF Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.4.6 Joint Comparison and Discussion . . . . . . . . . . . . . . . . . . . . . . . 112

6 RANK CORRELATION MEASURES FOR MANIFOLD LEARNING
ON IMAGE RETRIEVAL . . . . . . . . . . . . . . . . . . . . . . . . 115

6.1 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.1.1 Jaccard Max Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.1.2 Application on Manifold Learning . . . . . . . . . . . . . . . . . . . . . . 117
6.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.2.1 Experimental Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.2.3 Visual Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

7 HYPERGRAPH RANK SELECTION AND FUSION (HRSF) . . . . 122
7.1 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.1.1 Unsupervised Ranker Selection . . . . . . . . . . . . . . . . . . . . . . . . 126


7.1.2 Hypergraph Query Performance Prediction . . . . . . . . . . . . . . . . . . 127
7.1.3 Hypergraph Manifold Rank Aggregation . . . . . . . . . . . . . . . . . . . 129
7.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.2.1 Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.2.2 Comparison with Fusion Baselines . . . . . . . . . . . . . . . . . . . . . . 134
7.2.3 State-of-the-Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.2.4 Visual Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

8 RANK FLOW EMBEDDING (RFE) . . . . . . . . . . . . . . . . . . 142
8.1 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.1.1 Formal Definition for Rank-based Manifold and Representation Learning . . 145
8.1.2 Rank Normalization by Reciprocal Sigmoid . . . . . . . . . . . . . . . . . 146
8.1.3 Re-Ranking by Hypergraph Embeddings . . . . . . . . . . . . . . . . . . . 147
8.1.4 Re-Ranking by Cartesian Product . . . . . . . . . . . . . . . . . . . . . . 150
8.1.5 Graph over Hypergraph and Connected Components . . . . . . . . . . . . 150
8.1.6 Embeddings for Classification . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.1.7 Unseen Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.1.8 Rank Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.2.1 Experimental Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.2.2 Parametric Space Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.2.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.2.4 Retrieval Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.2.5 Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.2.6 Unseen queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.2.7 Comparison with State-of-the-art for Unsupervised Image Retrieval . . . . . 163
8.2.8 Comparison with State-of-the-art for Semi-Supervised Image Classification . 167
8.2.9 Visual Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

9 CONTEXTUAL MANIFOLD LEARNING ON GRAPH
CONVOLUTIONAL NETWORKS (MANIFOLD-GCN) . . . . . . . 171

9.1 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.1.1 Similarity Measurement and Ranking Model . . . . . . . . . . . . . . . . . 173
9.1.2 Unsupervised Manifold Learning . . . . . . . . . . . . . . . . . . . . . . . 174
9.1.3 Graph Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
9.1.4 Graph Convolutional Networks . . . . . . . . . . . . . . . . . . . . . . . . 176
9.1.5 GCN Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
9.1.6 Manifold Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . 177
9.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 178
9.2.1 Experimental Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178


9.2.2 Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
9.2.3 Person Re-ID Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
9.2.4 Visualization Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
9.2.5 Comparison with Other Approaches . . . . . . . . . . . . . . . . . . . . . 185
9.2.6 Efficiency Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

10 CONTEXTUAL CONTRASTIVE LOSS (CCL) . . . . . . . . . . . . 191
10.1 Supervised Contrastive Loss . . . . . . . . . . . . . . . . . . . . . . . 192
10.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
10.2.1 Pairwise Similarity and Contextual Information . . . . . . . . . . . . . . . 194
10.2.2 Neighborhood Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
10.2.3 Contextual Similarity and Symmetry Discussion . . . . . . . . . . . . . . . 195
10.2.4 Neighborhood Size and Logarithmic Decay . . . . . . . . . . . . . . . . . . 195
10.2.5 Proposed Contextual Contrastive Loss (CCL) . . . . . . . . . . . . . . . . 195
10.2.6 Proposed Training Workflow . . . . . . . . . . . . . . . . . . . . . . . . . 196
10.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 198

11 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
11.1 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
11.1.1 Query Performance Prediction . . . . . . . . . . . . . . . . . . . . . . . . 203
11.1.2 Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
11.1.3 Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
11.2 Contributions and Research Questions . . . . . . . . . . . . . . . . . 211
11.3 Publications and International Fellowship . . . . . . . . . . . . . . . . 214
11.4 Code Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
11.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222


26

1 Introduction

Effectively defining the similarity between images is a central challenge in retrieval
and machine learning applications. This issue is deeply connected to: (i): how information
is represented; and (ii): the measures used to compare these representations [321, 294,
371, 250]. This work presents contributions in both directions, proposing seven novel
approaches.

This dissertation discusses and presents contributions aimed at improving the
effectiveness of image retrieval by visual content and classification tasks using contextual
similarity learning. This introductory chapter outlines an overview of this work and is
organized as follows: Section 1.1 discusses the motivations of the conducted research.
Section 1.2 presents the challenges and research questions addressed. Section 1.3 states
the main hypothesis validated in this dissertation. Section 1.4 discusses the objectives and
contributions of the study. Section 1.5 describes the overall structure of this document,
including a summary of each chapter’s content and an illustration of how concepts and
terms relate to the contributions.

1.1 Motivation
In recent years, there has been an exponential increase in the volume of image data,

primarily due to advancements in technologies for generating, storing, and sharing visual
information [320, 80]. Additionally, there are numerous applications (e.g., surveillance
cameras [390, 137, 426], medical imaging [341, 6, 1], remote sensing systems [160], social
media [140]) that generate vast amounts of visual data.

In this scenario, image retrieval and machine learning tasks such as image
classification are increasingly being utilized in many applications [255]. Remarkable
progress has been made in these methods, particularly due to the consistent evolution
of deep learning [109, 31]. However, the majority of them are supervised and depend on
large volumes of labeled data for training. In contrast, the production of labeled data
is challenging since it is often expensive and time-consuming to obtain [91]. It may also
require a specialist for labeling, especially according to the specificity of the domain.
Aiming at filling this gap, many unsupervised, semi-supervised, and even self-supervised
approaches have been proposed to deal with such a challenge [107]. In most of these
methods, effectively modeling data is crucial for exploiting the information available in
the unlabeled data.

For most approaches, the essence of learning hinges on the ability to model data


Chapter 1. Introduction 27

accurately, which involves different concepts, in particular, representation approaches
and distance or similarity measures [321]. This is especially important for Content-Based
Image Retrieval (CBIR) systems, which retrieve images based on visual content rather
than metadata [316]. These systems usually employ feature extraction and representation
methods, which have evolved considerably [294]. Such methods have transitioned from
traditional hand-crafted features [254] to more advanced deep learning approaches [294,
438], including Vision Transformers [77, 202]. However, most comparison tasks still rely on
pairwise measures [294, 80], which do not exploit contextual information [245]. In general,
the term context can be broadly understood as all the relevant information pertinent to
an application and its users. This work considers the idea of contextual similarity that
consists of exploiting the relationships beyond pairwise analysis, involving other elements,
such as the neighborhood or more related additional information [246, 245]. The term
contextual similarity learning is used for the learning process that employs contextual
similarity for more effectively capturing the underlying relationships among elements.

An essential aspect is that contextual similarity information can be modeled in many
different forms, using different representations and structures [235, 416], among them: (i)
graphs [86, 366]: they can be used for exploiting the relationships between neighbors, which
is a key aspect for understanding the local context and influence among interconnected
entities; (ii) ranked lists [246, 245]: in image retrieval, each ranked list contains the most
similar elements for a given query. The similarities between elements can be redefined
according to the analysis of the neighborhoods available in these lists. The position of each
element in each list also contains valuable information; (iii) clustering [404, 373]: identify
and group data points that are similar according to predefined criteria. These groups allow
the discovery of inherent patterns or relationships that may not be apparent upon initial
observation.

Besides these approaches, there is still a wide range of methodologies that can
be proposed to exploit contextual information in numerous scenarios. Similarity learning
applied to retrieval is generally explored by re-ranking tasks. Despite the crescent popularity
of these methods, more robust structures have not yet been extensively employed in most
cases. Structures that represent higher-order similarities, such as relationships among
neighbors of neighbors, can be particularly advantageous. Hypergraphs, for example, allow
edges to connect multiple vertices, offering a sophisticated technique for capturing these
relationships [403, 251]. Additionally, most unsupervised re-ranking approaches [18, 282,
108] provide a new ranked list representation as output, but new features are not produced
in return, which could be used to encode contextual information for classifiers, for example.

Another application that could deeply take advantage of the enrichment of
contextual information is feature selection and fusion [260, 389, 424, 329, 327]. Among
different strategies, the selection of features can be done through effectiveness estimation


Chapter 1. Introduction 28

and correlation measures [329]. They consider the idea that fusion benefits from elements
with high effectiveness and that are also complementary. The creation and usage of
contextual structures for estimating the effectiveness and measuring correlations is still an
area that requires further research.

This work exploits contextual similarity learning for general-purpose image retrieval
and person re-identification, usually abbreviated as person Re-ID. Person Re-ID is a type
of surveillance application that has been gaining a lot of attention and nowadays is
of fundamental importance in many camera surveillance systems. The task consists of
identifying individuals across multiple cameras that have no overlapping views [137].
A Re-ID system broadly consists of three main steps [426]: person detection, feature
extraction, and person retrieval or matching. This work focuses on the final step, which
can be viewed as a specific image retrieval application [137].

Person Re-ID is a complex task that presents numerous difficulties [390, 137, 426],
including (i) varying angles of view between cameras, (ii) low-resolution images, (iii)
changes in lighting conditions, (iv) occlusions blocking part of the view, (v) the difficulty of
manually labeling images for use in training algorithms, (vi) unbalanced classes or classes
with very few elements, (vii) the complexity of modeling data, and (viii) the extensive
volume of data that needs to be processed.

Amid these challenges in Re-ID, many approaches introduced more robust deep
learning models [390], such as Vision Transformers [111, 163, 221], metric learning [185,
156, 185, 396], and Siamese networks [340, 339]. Other strategies include dataset expansion
with augmentations [132, 291] or artificial data considering appearance attributes, body
parts, temporal information, and different types of multimodal information. Additionally,
metric learning is often employed for Re-ID due to its capacity to be effective when dealing
with unseen data [185] since it focuses on learning distances or similarities rather than
specific features of the training data. This approach allows the model to generalize better
to new examples that were not present in the training set.

Apart from all these advancements, post-processing methods that exploit contextual
information have gained significant attention due to their ability to improve results provided
by latent features of different deep learning models. Various unsupervised post-processing
strategies are based on the idea of exploiting the information of reciprocal neighborhoods
and measuring the co-occurrence of elements in ranked lists [429, 174, 225, 165, 211,
96, 388, 95, 108], demonstrating substantial improvements. Although these approaches
are becoming increasingly common in Re-ID, methods for selection and fusion remain
relatively scarce. This scenario highlights the importance of investigating methods capable
of effectively exploiting contextual information.

In addition to retrieval, it is also imperative to address scenarios with limited
labeled data for classification [135]. Graph convolutional networks (GCNs) offer a promising


Chapter 1. Introduction 29

solution for semi-supervised classification by learning from both labeled and unlabeled data
considering graph structures [412]. Moreover, GCNs can learn node and graph embeddings
that capture complex dependencies and structural relationships [141]. However, GCNs
are not widely used for image classification since graphs are typically not available in
image domains [274, 343, 307]. Therefore, effectively modeling these graphs, which can be
utilized to exploit contextual information, is a crucial topic for research.

Another approach that has recently demonstrated continuous advances for
improving classification results is the use of contrastive learning [143, 47]. Unlike the
commonly used cross-entropy loss, which aims to minimize the difference between the
predicted and true class probabilities, contrastive loss focuses on learning similarities
and dissimilarities between data points rather than merely categorizing them [47].
Despite this, most contrastive losses consider only pairwise measures [143, 47, 49, 47],
with only a few incorporating some type of neighborhood information [441, 82, 183].
Moreover, these approaches often require huge volumes of data for training (i.e., labeled or
unlabeled) [47, 49], even in self-supervised scenarios, which is a challenge in circumstances
where data is scarce.

In light of the presented discussion and all challenges, the focus of this dissertation
is to exploit the use of contextual similarity information with the objective of improving
the effectiveness of image retrieval and classification, particularly in cases where labeled
data is limited or non-existent. This dissertation primarily concentrates on unsupervised
learning, while also proposing semi-supervised and supervised approaches.

1.2 Research Challenges
Contextual similarity information can be applied in a variety of fields. However,

appropriately representing and exploiting contextual information in each scenario poses
significant difficulties. There are many research challenges related to various applications
that can be used to improve the effectiveness of image retrieval and classification tasks.
In the following, several topics are discussed and corresponding research questions are
presented for each:

• Selection and fusion in person Re-ID: The selection and fusion involves choosing
the most relevant features from the data and combining them to enhance the retrieval
effectiveness [260]. There are various feature extractors and possible combinations
between them. Selecting the right features is crucial because manually evaluating
all combinations becomes impractical as the number of features increases linearly
and the number of combinations increases exponentially. The concept is based on
the idea that fusion is most effective when it involves elements that are both highly
efficient and complementary. For person Re-ID, the complexity of accurately matching


Chapter 1. Introduction 30

individuals across different camera views becomes significantly more challenging
in unsupervised applications due to the absence of labeled data [137]. Effectively
modeling and exploiting patterns in the data is crucial in this scenario.
Research question:

– How can contextual similarity information be used for selection and fusion in
unsupervised person Re-ID?

• Query performance prediction: Also known as effectiveness estimation, query
performance prediction (QPP) encompasses techniques for assessing the quality of
ranked lists in scenarios where no labels are provided. In this context, the ability
to assess the effectiveness of the retrieval process provides a significant advantage
for different tasks, including enabling the selection of more effective ranked lists.
However, QPP is very challenging, especially in unsupervised tasks. One of the
main difficulties is elaborating an approach that effectively generalizes across diverse
scenarios [262]. Bridging this gap represents a major challenge that can be mitigated
by incorporating contextual similarity information.
Research question:

– How can data be modeled using contextual similarity information for query
performance prediction?

• Synthetic data: Recently, self-supervised approaches have been proposed to address
scenarios where labeled data is scarce. Among the different means of self-supervision,
one of them is by using synthetic data. There are many advantages and benefits
of using synthetic data, primarily due to its flexibility and control in generating
large volumes of annotated data. In domains where safety and privacy are relevant,
using real data can raise privacy concerns and legal issues. Synthetic data does not
carry these risks. However, creating representative synthetic data presents many
difficulties. One of the primary challenges is to accurately reflect the complexity
and variability of real-world data [72]. The generated synthetic data is expected to
encompass a wide range of scenarios, including rare events and edge cases, to ensure
comprehensive learning.
Research questions:

– How can contextual similarity information be used to generate synthetic data?

– How can contextual similarity learning be employed on synthetically generated
data?

• Unsupervised similarity learning methods: Despite the potential of
unsupervised similarity learning methods to improve retrieval results, effectively
representing and encoding the maximum amount of contextual information remains


Chapter 1. Introduction 31

a challenge. This difficulty is amplified because these methods operate without
labels and cannot utilize relevance feedback [312] as supervised algorithms do.
These methods usually exploit the relationships among images through ranked
lists and similarity among elements [95, 108, 250]. The primary challenge lies in
modeling and leveraging this similarity information, which can be approached through
various strategies such as graph structures [381], contextual measures [128], and
more [382, 384]. Utilizing more complex structures to represent second-order similarity
(i.e., relationships such as neighbors of neighbors) can be particularly relevant, for
example.
Research question:

– How can more complex structures, which encode contextual information more
effectively, be applied to unsupervised similarity learning?

• Representation learning and embeddings: Feature learning is of fundamental
importance in many retrieval and classification applications [321]. However, the
capacity to encode information of an image in an embedding is very challenging.
When converting an image into an embedding, some information is inevitably lost.
This loss must be minimized to ensure that the most critical features of the image
are retained. There is also the semantic gap [24, 115] between the raw pixel data
of an image and the human interpretation of the image’s content. Unsupervised
similarity learning approaches usually post-process ranked lists to enhance image
retrieval results but do not provide any form of embeddings that can be used for
other tasks, such as classification.
Research question:

– How can contextual information from similarity learning approaches be encoded
to generate embeddings that are useful for tasks beyond retrieval, such as
classification?

• Contextual similarity and Graph Convolutional Networks (GCNs): The
GCNs effectively capture relationships and interactions within complex networks,
enhancing results in tasks involving structured data. However, graphs are not
inherently available for most image datasets, and GCNs heavily rely on these
structures to deliver significant results [141, 307]. The main challenge involves
accurately modeling the graph for effective use by the GCN.
Research question:

– How can contextual similarity information be incorporated into the input
graph utilized by Graph Convolutional Networks (GCNs) and improve their
classification results?


Chapter 1. Introduction 32

• Correlation measures and manifold learning: Manifold learning is a technique
for uncovering simpler, underlying structures in complex high-dimensional data [133].
Correlation measures quantify the similarity between data points, which is very useful
to model relationships in the data. However, this is challenging since data can be
complex and heterogeneous, involving multiple variables with nonlinear relationships
that are difficult to capture [16]. Also, outliers may present significant challenges in
data analysis.
Research questions:

– Can rank-based information be utilized to measure the correlation between images
more effectively?

– Can a correlation measure be proposed and applied to enhance image retrieval
with manifold learning?

• Contrastive learning: It has been extensively used in self-supervised and supervised
learning due to its effectiveness in learning representations that distinguish between
similar and dissimilar images. It offers an alternative to cross-entropy by yielding
more semantically meaningful image embeddings. However, most contrastive losses
rely on pairwise measures to assess the similarity between elements [143, 47], ignoring
more general neighborhood information that can be leveraged to enhance model
robustness and generalization [441].
Research question:

– How can contextual similarity information be incorporated into metric learning,
including its direct integration into losses such as contrastive loss?

The contributions presented and discussed in this work address those important
research challenges.

1.3 Dissertation Statement
Driven by the challenges identified in the literature, primarily the difficulty of

obtaining a large amount of labeled data and the increasing need for methods that exploit
contextual information, we explore the application of contextual similarity learning in
different scenarios. The main hypothesis of the work is briefly stated as follows:

Contextual similarity learning can improve the effectiveness of image retrieval
and classification tasks across general-purpose and person re-identification
(Re-ID) applications. This concept is applicable to unsupervised, semi-supervised,
and supervised approaches, particularly in contexts where labeled data is limited.


Chapter 1. Introduction 33

The hypothesis is validated by the proposed approaches and a comprehensive 
experimental evaluation presented in this dissertation.


203

11 Conclusions

This chapter concludes this dissertation by discussing the contributions and other
relevant aspects. Section 11.1 reviews the main results obtained for each task: query
performance prediction, image retrieval, and image classification. Additionally, it provides
a comparative analysis of the proposed methods alongside other approaches from the
literature. Section 11.2 discusses how the contributions address the research questions
of this study. Section 11.3 lists the publications and submissions obtained, along with
the international Fulbright fellowship. Section 11.4 mentions the available codes for the
proposed approaches. Finally, Section 11.5 presents potential extensions and future work,
describing their connections to the contributions achieved in this research.

11.1 Discussion of Results
Given the notable outcomes achieved by contextual similarity learning across all

scenarios considered, this section discusses the results for each task. For query performance
prediction, the results of RQPPF and DRNE are jointly compared and discussed in
Section 11.1.1. Section 11.1.2 compares the approaches evaluated for image retrieval in
person Re-ID and general-purpose datasets, including comparisons with the state-of-the-art.
Section 11.1.3 overviews Manifold-GCN and RFE semi-supervised classification results
obtained and a comparison with the state-of-the-art. Moreover, the gains achieved by CCL
are briefly discussed.

11.1.1 Query Performance Prediction

A great variety of experiments was conducted to evaluate DRNE and RQPPF,
showing their capacity to effectively perform QPP in various datasets. Table 11.1 presents
a summary of the relative gains for both approaches in comparison to Authority [243] and
Reciprocal Density [248], which are used as baselines. Notice that RQPPF provided gains
in all the evaluated scenarios, while DRNE showed inferior performance in some cases,
especially when compared to Reciprocal in the MPEG-7 dataset. However, for the most
part, DRNE provided higher and more consistent gains when compared to Reciprocal.
Specifically for the AIR descriptor, DRNE revealed superior results in all cases.

In general, the results showed that the proposed methods are better than the
baselines in most cases. Additionally, the choice of the best method depends not only on
the dataset but also on the descriptor. RQPPF is more flexible and also uses Authority
and Reciprocal as part of its formulation, while DRNE does not. However, DRNE seems


Chapter 11. Conclusions 204

more robust to outlier descriptors, while RQPPF does not. Among potential extensions,
combining DRNE and RQPPF is one of the possibilities for future work.

Table 11.1 – Relative gains of DRNE and RQPPF when compared to Authority (Auth.) and
Reciprocal Density (Rec.). Average gains are reported for each dataset.

Descriptor Original Compared to Auth. [243] Compared to Rec. [248]
MAP DRNE RQPPF DRNE RQPPF

MPEG-7
AIR [103] 89.39% +14.81% +3.50% +16.84% +12.99%
ASC [191] 85.28% -2.50% +3.76% -8.29% +1.84%
IDSC [190] 81.70% -3.93% +3.69% -7.58% +1.88%
CFD [244] 80.71% +3.47% +3.99% -1.24% +2.71%
BAS [13] 71.52% +0.85% +3.69% -5.13% +1.09%
SS [317] 37.67% +7.08% +6.01% +3.32% +3.52%

Average Gain +3.30% +4.11% -0.35% +4.01%
Brodatz

LAS [308] 75.15% +7.50% +9.01% +9.59% +11.15%
CCOM [148] 57.57% +3.30% +7.33% +8.60% +11.18%
LBP [231] 48.40% +0.75% +5.10% +18.42% +15.36%

Average Gain +3.85% +7.15% +12.20% +12.56%
Market

OSNET [436] 43.30% -2.63% +1.19% +5.40% +5.45%
ResNet [110] 22.82% -0.89% +0.13% +7.95% +5.29%

Average Gain -1.76% +0.66% +6.68% +5.37%
Duke

OSNET [436] 52.69% +0.71% +2.35% +1.00% +3.37%
ResNet [110] 32.00% -2.14% +0.52% -0.12% +2.46%

Average Gain -0.72% +1.44% +0.44% +2.92%

11.1.2 Image Retrieval

Four of the seven proposed methods were evaluated in image retrieval: HRSF,
JaccardMax, RFE, and Manifold-GCN. The results obtained are reviewed for both person
Re-ID and general-purpose datasets, including comparisons against each other and with
the state-of-the-art. A brief discussion about the gains is also presented.

• Person Re-ID

Considering the wide variety of descriptors employed and to provide a fair
comparison, Table 11.2 presents the best results obtained for each method using only the
OSNET model and its variants (i.e., OSNET, OSNET-IBN, and OSNET-AIN) on Market,
DukeMTMC, and CUHK03 datasets.

For the Market and CUHK03 datasets, HRSF leads with the best results for both
R1 and MAP. HRSF is the only method that performs selection, which is an advantage
over the others since it can select the best combination of descriptors among the OSNET
variants. For the DukeMTMC dataset, RFE and JaccardMax compete for the best results.
The worst results in this table are the ones obtained by the Manifold-GCN. Besides
Manifold-GCN being semi-supervised, while all the other approaches are unsupervised,
this result highlights the importance of future research for this method. Since it was mainly
proposed for classification, the results for retrieval are significantly behind others, probably


Chapter 11. Conclusions 205

due to the features not being properly distributed in the latent space, which requires
further investigation in future work.

Table 11.3 presents the methods ranked according to their results for each measure
and dataset. The average rank reveals that, while HRSF shows the best results in most
cases, JaccardMax and RFE follow closely, with average ranks of 2.0 and 2.2, respectively.
As previously discussed, Manifold-GCN is behind with an average rank of 3.7.

Table 11.2 – Comparison between the proposed approaches on person Re-ID considering MAP
(%) and R-01 (%). The best results obtained with the OSNET descriptor and its
variants are reported.

Datasets
Method Year Market1501 DukeMTMC CUHK03

R1 MAP R1 MAP R1 MAP
HRSF (X∗, best result) [331] 2022 75.71 62.94 77.24 68.88 39.04 39.69
Correlation Graph + Jaccard Max [324] 2022 73.25 59.84 76.21 69.27 — —
RFE [334] 2023 72.42 59.51 77.69 69.21 36.89 39.24
Manifold-GCN [333] 2023 70.30 57.48 74.22 65.83 35.19 35.99

Table 11.3 – Proposed approaches on person Re-ID ranked according to their effectiveness (R1
and MAP). The best results obtained with the OSNET descriptor and its variants
were considered.

Datasets
Method Year Market1501 DukeMTMC CUHK03 Average

R1 MAP R1 MAP R1 MAP Rank
HRSF (X∗, best result) [331] 2022 1 1 2 3 1 1 1.5
Correlation Graph + Jaccard Max [324] 2022 2 2 3 1 — — 2.0
RFE [334] 2023 3 3 1 2 2 2 2.2
Manifold-GCN [333] 2023 4 4 4 4 3 3 3.7

To compare the proposed methods with the state-of-the-art in person Re-ID, which
is presented in Table 11.4, the best results for each approach were considered. For HRSF,
JaccardMax, and RFE the best results used OSNET and its variants. Unlike the others,
the JaccardMax evaluation employed the TransReID descriptor, which provided better
results for this method. This table highlights in bold the highest value for each column.
The best among our methods is also highlighted. All the baseline results are the ones
reported in the literature, following the same protocol as ours.

In general, it can be observed that the proposed approaches provide better results
for MAP than R1 when compared to other methods. This evinces that they can significantly
improve the top positions of ranked lists, but not necessarily achieve the best results when
considering only the first position. In this case, Market1501 was revealed as the most
challenging dataset, where the proposed methods are better or comparable to the ones up
to 2020. After that, the baselines show a considerable improvement. In contrast, for the
DukeMTMC dataset, the MAP of 73.96% obtained by the proposed JaccardMax (2022) is
the best result achieved, surpassed only by VAL-PAT, which is a very recent approach
from 2023. For the CUHK03 dataset, many of the methods have no results reported in


Chapter 11. Conclusions 206

Table 11.4 – Proposed approaches compared to state-of-the-art on person Re-ID considering
MAP (%) and R-01 (%).

Datasets
Method Year Market1501 DukeMTMC CUHK03

R1 MAP R1 MAP R1 MAP
Unsupervised Methods

ARN [181] 2018 70.3 39.4 60.2 33.4 — —
EANet [118] 2018 66.4 40.6 45.0 26.4 51.4 31.7
TAUDL [170] 2018 63.7 41.2 61.7 43.5 44.7 31.2
ECN [431] 2019 75.1 43.0 63.3 40.4 — —
MAR [397] 2019 67.7 40.0 87.1 48.0 — —
UTAL [171] 2019 69.2 46.2 62.3 44.6 56.3 42.3
SSL [189] 2020 71.7 37.8 52.5 28.6 — —
HCT [402] 2020 80.0 56.4 69.6 50.7 — —
CAP [353] 2021 91.4 79.2 81.1 67.3 — —
IICS [376] 2021 89.5 72.9 80.0 64.4 — —
RLCC [415] 2021 90.8 77.7 83.2 69.2 — —
ICE [43] 2021 93.8 82.3 83.3 69.9 — —
MGH [368] 2021 93.2 81.7 83.7 70.2 — —
MGCE-HCL [297] 2022 92.1 79.6 82.5 67.5 — —
MCRN [367] 2022 92.5 80.8 83.5 69.9 — —
O2CAP [354] 2022 92.5 82.7 83.9 71.2 — —
DIDAL [201] 2023 94.2 84.8 — — — —
VAL-PAT [23] 2023 — — 86.1 74.9 — —

Domain Adaptative Methods
HHL (D,M) [430] 2018 62.2 31.4 46.9 27.2 — —
HHL (C03) [430] 2018 56.8 29.8 42.7 23.4 — —
ATNet (D,M) [197] 2019 55.7 25.6 45.1 24.9 — —
CSGLP (D,M) [273] 2019 63.7 33.9 56.1 36.0 — —
ISSDA (D,M) [306] 2019 81.3 63.1 72.8 54.1 — —
ECN++ (D,M) [432] 2020 84.1 63.8 74.0 54.4 — —
MMCL (D,M) [348] 2020 84.4 60.4 72.4 51.4 — —
JVCT+ (D,M) [44] 2021 90.5 75.4 81.9 67.6 — —
MCRN (D,M) [367] 2022 93.8 83.8 84.5 71.5 — —

Cross-Domain Methods (single-source)
EANet (C03) [118] 2018 59.4 33.3 39.3 22.0 — —
EANet (D,M) [118] 2018 61.7 32.9 51.4 31.7 — —
SPGAN (D,M) [71] 2018 43.1 17.0 33.1 16.7 — —
DAAM (D,M) [121] 2019 42.3 17.5 29.3 14.5 — —
AF3 (D,M) [195] 2019 67.2 36.3 56.8 37.4 — —
AF3 (MT) [195] 2019 68.0 37.7 66.3 46.2 — —
PAUL (MT) [380] 2019 68.5 40.1 72.0 53.2 — —

Cross-Domain Methods (multi-source)
CAMEL [396] 2017 54.5 26.3 — — 31.9 —
EMTL [370] 2018 52.8 25.1 39.7 22.3 — —
Baseline by [153] 2019 80.5 56.8 67.4 46.9 29.4 27.4

Proposed Methods (contributions)
HRSF (X∗, best result) [331] 2022 75.71 62.94 77.24 68.88 39.04 39.69
Correlation Graph + Jaccard Max [324] 2022 75.42 63.53 78.59 73.96 — —
RFE [334] 2023 72.42 59.51 77.69 69.21 36.89 39.24
Manifold-GCN [333] 2023 70.30 57.48 74.22 65.83 35.19 35.99

the literature, since this dataset is not as commonly evaluated as the others. However, all
methods provided a better MAP than the baselines, being only behind UTAL.

The presented comparisons raise two topics for discussion: (i) Why the obtained
results are significantly better in DukeMTMC? Why does Market1501 appear to be


Chapter 11. Conclusions 207

considerably more difficult? (ii) RFE and CG [249] + JaccardMax exhibit close results
when using the same descriptors for Re-ID. Is this also true in other scenarios?

The first topic is challenging to answer, especially because Market1501 and
DukeMTMC datasets have very similar characteristics (e.g., dataset size, number of
individuals, images per person, size of the train and evaluation sets, and number of
cameras). However, one particular difference might explain it. The Market1501 dataset
was annotated using an automated detector, the Deformable Part Model (DPM), which is
known to be prone to noise and potential misalignment. Conversely, the DukeMTMC was
manually annotated by humans providing cleaner data with well-aligned bounding boxes.
Further investigation to address this aspect can be conducted as future work.

Regarding the close results of RFE and CG [249] + JaccardMax for Re-ID, these
methods are compared on general-purpose datasets to evaluate if they exhibit similar
behavior.

• General-Purpose Datasets

Tables 11.5 and 11.6 compare RFE and CG [249] + JaccardMax with the
state-of-the-art in image retrieval tasks for the datasets Holidays and UKBench, respectively.
In both datasets, RFE outperformed all the baselines. For Holidays, CG [249] + JaccardMax
is behind RFE with 91.12% but still surpasses most of the other methods. In contrast,
for UKbench, both achieved the same result of 3.97, which is very close to the maximum
score (i.e., 4).

Table 11.5 – State-of-the-art (SOTA) comparison on Holidays dataset (MAP).

MAP for state-of-the-art methods
Jégou Tolias Paulin Qin Zheng Sun Zheng

et al. [127] et al. [315] et al. [238] et al. [268] et al. [425] et al. [299] et al. [423]
75.07% 82.20% 82.90% 84.40% 85.20% 85.50% 85.80%

Pedronette Arandjelovic Li Razavian Pedronette Gordo Valem
et al. [241] et al. [12] et al. [178] et al. [271] et al. [253] et al. [104] et al. [329]

86.16% 87.50% 89.20% 89.60% 90.02% 90.30% 90.51%
Valem Liu Pedronette Pedronette Yu Berman

et al. [328] et al. [203] et al. [251] et al. [252] et al. [398] et al. [26]
90.51% 90.89% 90.94% 91.25% 91.40% 91.80%

Proposed Approaches
CG + JacMax RFE

91.12% 91.97%


Chapter 11. Conclusions 208

Table 11.6 – State-of-the-art (SOTA) comparison on UKBench dataset (NS-Score).

N-S-Scores for state-of-the-art methods
Qin Zhang Zheng Bai Xie Lv Liu Pedronette

et al. [267] et al. [413] et al. [424] et al. [16] et al. [371] et al. [210] et al. [203] et al. [241]
3.67 3.83 3.84 3.86 3.89 3.91 3.92 3.93

Bai Liu Valem Bai Valem Valem Chen
et al. [20] et al. [159] et al. [328] et al. [17] et al. [329] et al. [327] et al. [50]

3.93 3.93 3.93 3.94 3.94 3.95 3.96

Proposed Approaches
CG + JacMax RFE

3.97 3.97

• Discussion about Gains

From the observed results, we can notice that the proposed approaches are
comparable or better than state-of-the-art approaches in most cases. The best method
in each scenario varies since each dataset and descriptor presents different aspects. An
important attribute of the proposed approaches is their capacity to improve the input
data by employing contextual similarity learning. Figure 11.1 presents the relative gains
of RFE and JaccardMax for different datasets and descriptors. This demonstrates the
capacity of contextual similarity learning to improve the results across multiple scenarios.
The Holidays and UKBench datasets exhibited smaller gains because their descriptors
already achieved higher results, making further enhancements more challenging compared
to other datasets. Despite this, it is impressive that, despite the advancements in feature
extraction, for different deep learning models from CNNs to Vision Transformers, the
potential to obtain improved results was achieved across all cases.

Core
l5k

 (R
esN

et)

Core
l5k

 (V
IT-B

16
)

Core
l5k

 (S
WIN-TF

)

Holid
ay

s (
Re

sN
et)

Holid
ay

s (
VIT-B

16
)

Holid
ay

s (
CNN-OLD

FP)

Ukb
en

ch 
(Re

sN
et)

Ukb
en

ch 
(VIT-B

16
)

Ukb
en

ch 
(CNN-OLD

FP)

Mark
et 

(OSN
et-

AIN)

Duke
 (O

SN
et-

AIN)

Datasets (Descriptors)

0

5

10

15

20

25

30

35

40

Re
la

tiv
e 

Ga
in

s (
%

)

33.6

19.9

28.7

1.0 1.3 2.0 2.8 3.2
1.2

32.6

29.9

38.6

24.1

30.7

3.3 2.9
1.9

2.9 3.2
1.2

37.5

32.2

Relative gains (%) compared to original MAP of descriptors
CG + JaccardMax
RFE

Figure 11.1 – RFE and JaccardMax relative gains (%) over MAP of descriptors.


Chapter 11. Conclusions 209

11.1.3 Image Classification

Since both Manifold-GCN and RFE were employed for semi-supervised classification,
Table 11.7 compares them to baselines, both traditional and recent, on Flowers and Corel5k
datasets. The values achieved by Manifold-GCN are the highest in all the cases, and they
are closely followed by RFE. These results reveal the high effectiveness of the proposed
approaches that, besides the significant gains, are also comparable or superior to various
methods in the literature.

Table 11.7 – Accuracy comparison (%) for baselines on Flowers and Corel5k datasets. We
compared the proposed RFE and Manifold-GCN with semi-supervised classification
baselines. The methods are compared with different input features. The results of
our methods are highlighted with a gray background; the best results for each pair
of features and dataset are marked in bold.

Method Input Flowers Corel5k
CoMatch [169] Images 82.55 85.70
kNN 63.67 76.80
SVM [54] 80.54 88.73
OPF [8] 71.77 83.56
SL-Perceptron 75.44 83.56
ML-Perceptron 78.88 87.10
PseudoLabel+SGD [162] 82.69 89.76
LS+kNN [433] ResNet 73.49 83.98
LS+SVM [433, 54] Features 73.53 83.26
LS+OPF [433, 8] 72.66 82.32
LS+SL-Perceptron [433] 72.34 82.38
LS+ML-Perceptron [433] 73.03 82.53
GNN-LDS [90] 54.98 62.69
GNN-KNN-LDS [90] 79.32 88.94
WSEF [264] 85.12 91.68
RFE 84.95 91.54
Manifold-GCN 85.88 93.08

kNN 48.71 58.78
SVM [54] 73.30 85.89
OPF [8] 64.00 81.33
SL-Perceptron 71.84 82.28
ML-Perceptron 72.62 86.90
PseudoLabel+SGD [162] 76.87 89.85
LS+kNN [433] SENet 58.05 72.16
LS+SVM [433, 54] Features 59.84 72.79
LS+OPF [433, 8] 59.25 72.20
LS+SL-Perceptron [433] 59.27 72.19
LS+ML-Perceptron [433] 59.39 72.24
GNN-LDS [90] 52.24 65.80
GNN-KNN-LDS [90] 73.69 89.95
WSEF [264] 76.16 89.74
RFE 77.56 92.20
Manifold-GCN 78.82 92.79


Chapter 11. Conclusions 210

The CCL was also proposed and evaluated for classification but in supervised
scenarios. Since it is the only method proposed in this category and it uses different
datasets in the protocol, a direct comparison of it with other proposed methods is not
feasible. Therefore, a discussion about its gains is presented. The experimental evaluation
in Chapter 10 showed that the results are consistently better than those of SupCon, which
CCL is based on, and SimCLR, another method commonly used as a baseline in this
task. Figure 11.2 presents a plot that evinces the capacity of CCL to provide gains when
compared to SupCon for three datasets and with higher values as the training set size
decreases. The integration of contextual information within the contrastive loss significantly
improved the results, as initially hypothesized, with gains up to 10.759%.

Food101 MiniImageNet CIFAR-100
Dataset

0

2

4

6

8

10

Re
la

tiv
e 

Ga
in

s (
%

)

2.67

0.985

1.61

3.532

1.849 1.917

5.335

2.964
2.437

10.759

7.042

4.774

CCL: Relative gains by dataset and training/testing splits
Train = 20%, Test = 80%
Train = 40%, Test = 60%
Train = 60%, Test = 40%
Train = 80%, Test = 20%

Figure 11.2 – Relative gains (%) obtained by CCL in comparison to SupCon for different train/test
splits considering 100 training epochs.


222

Bibliography

[1] Agarwal, M. and Mostafa, J. (2011). Content-based image retrieval for alzheimer’s
disease detection. In 2011 9th International Workshop on Content-Based Multimedia
Indexing (CBMI), pages 13–18.

[2] Albawi, S.; Mohammed, T. A.; and Al-Zawi, S. (2017). Understanding of a convolutional
neural network. In 2017 International Conference on Engineering and Technology
(ICET), pages 1–6.

[3] Ali, N.; Zafar, B.; Iqbal, M. K.; Sajid, M.; Younis, M. Y.; Dar, S. H.; Mahmood, M. T.;
and Lee, I. H. (2019). Modeling global geometric spatial information for rotation
invariant classification of satellite images. PLoS One, 14(7):e0219833.

[4] Alnissany, A. and Dayoub, Y. (2023). Modified centroid triplet loss for person
re-identification. Journal of Big Data, 10(1):74.

[5] Alqasemi, F. A.; Alabbasi, H. Q.; Sabeha, F. G.; Alawadhi, A.; Kahlid, S.; and Zahary,
A. (2019). Feature selection approach using knn supervised learning for content-based
image retrieval. In 2019 First International Conference of Intelligent Computing and
Engineering (ICOICE), pages 1–5.

[6] Alves, C. and Traina, A. J. M. (2022). Variational autoencoders for medical image
retrieval. In 2022 International Conference on INnovations in Intelligent SysTems
and Applications (INISTA), pages 1–6.

[7] Alvin, Y. H. Y. and Chakraborty, D. (2023). Approximate Maximum Rank Aggregation:
Beyond the Worst-Case. In Bouyer, P. and Srinivasan, S., editors, 43rd IARCS Annual
Conference on Foundations of Software Technology and Theoretical Computer Science
(FSTTCS 2023), volume 284 of Leibniz International Proceedings in Informatics
(LIPIcs), pages 12:1–12:21, Dagstuhl, Germany. Schloss Dagstuhl – Leibniz-Zentrum
für Informatik.

[8] Amorim, W. P.; Falcão, A. X.; and d. Carvalho, M. H. (2014). Semi-supervised pattern
classification using optimum-path forest. In 2014 27th SIBGRAPI Conference on
Graphics, Patterns and Images, pages 111–118.

[9] An, L.; Chen, X.; Yang, S.; and Li, X. (2017). Person re-identification by
multi-hypergraph fusion. IEEE Transactions on Neural Networks and Learning Systems,
28(11):2763–2774.


Bibliography 223

[10] Anand, A.; Leonhardt, J.; Rudra, K.; and Anand, A. (2022). Supervised contrastive
learning approach for contextual ranking. In Proceedings of the 2022 ACM SIGIR
International Conference on Theory of Information Retrieval, ICTIR ’22, page 61–71,
New York, NY, USA. Association for Computing Machinery.

[11] Antelmi, A.; Cordasco, G.; Polato, M.; Scarano, V.; Spagnuolo, C.; and Yang, D.
(2023). A survey on hypergraph representation learning. ACM Comput. Surv., 56(1).

[12] Arandjelovic, R.; Gronat, P.; Torii, A.; Pajdla, T.; and Sivic, J. (2016). Netvlad: Cnn
architecture for weakly supervised place recognition. In 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pages 5297–5307.

[13] Arica, N. and Vural, F. T. Y. (2003). BAS: a perceptual shape descriptor based on
the beam angle statistics. Pattern Recognition Letters, 24(9-10):1627–1639.

[14] Awad, M. and Khanna, R. (2015). Machine Learning, pages 1–18. Apress, Berkeley,
CA.

[15] Baeza-Yates, R. and Ribeiro-Neto, B. (2013). Recuperação de Informação: Conceitos
e Tecnologia das Máquinas de Busca. Editora Bookman.

[16] Bai, S. and Bai, X. (2016). Sparse contextual activation for efficient visual re-ranking.
IEEE Trans. on Image Processing (TIP), 25(3):1056–1069.

[17] Bai, S.; Bai, X.; Tian, Q.; and Latecki, L. J. (2017). Regularized diffusion process for
visual retrieval. In Conf. on Artificial Intelligence (AAAI), pages 3967–3973.

[18] Bai, S.; Bai, X.; Tian, Q.; and Latecki, L. J. (2019). Regularized diffusion process on
bidirectional context for object retrieval. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 41(5):1213–1226.

[19] Bai, S.; Zhang, F.; and Torr, P. H. (2021a). Hypergraph convolution and hypergraph
attention. Pattern Recognition, 110:107637.

[20] Bai, S.; Zhou, Z.; Wang, J.; Bai, X.; Latecki, L. J.; and Tian, Q. (2017). Ensemble
diffusion for retrieval. In 2017 IEEE International Conference on Computer Vision
(ICCV), pages 774–783.

[21] Bai, X.; Bai, S.; and Wang, X. (2015). Beyond diffusion process: Neighbor set similarity
for fast re-ranking. Information Sciences, 325:342 – 354.

[22] Bai, Z.; Wang, Z.; Wang, J.; Hu, D.; and Ding, E. (2021b). Unsupervised multi-source
domain adaptation for person re-identification. In 2021 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), pages 12909–12918.


Bibliography 224

[23] Bao, L.-N.; Wei, L.; Qiu, X.; gang Zhou, W.; Li, H.; and Tian, Q. (2023). Learning
transferable pedestrian representation from multimodal information supervision. ArXiv,
abs/2304.05554.

[24] Barz, B. and Denzler, J. (2021). Content-based image retrieval and the semantic
gap in the deep learning era. In Del Bimbo, A.; Cucchiara, R.; Sclaroff, S.; Farinella,
G. M.; Mei, T.; Bertini, M.; Escalante, H. J.; and Vezzani, R., editors, Pattern
Recognition. ICPR International Workshops and Challenges, pages 245–260, Cham.
Springer International Publishing.

[25] Bedagkar-Gala, A. and Shah, S. K. (2014). A survey of approaches and trends in
person re-identification. Image and Vision Computing, 32(4):270 – 286.

[26] Berman, M.; Jégou, H.; Andrea, V.; Kokkinos, I.; and Douze, M. (2019). MultiGrain:
a unified image embedding for classes and instances. arXiv e-prints.

[27] Berthelot, D.; Carlini, N.; Cubuk, E. D.; Kurakin, A.; Sohn, K.; Zhang, H.; and Raffel,
C. (2020). Remixmatch: Semi-supervised learning with distribution matching and
augmentation anchoring. In International Conference on Learning Representations.

[28] Berthelot, D.; Carlini, N.; Goodfellow, I. J.; Papernot, N.; Oliver, A.; and Raffel,
C. (2019). Mixmatch: A holistic approach to semi-supervised learning. CoRR,
abs/1905.02249.

[29] Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; and Upcroft, B. (2016). Simple online and
realtime tracking. In 2016 IEEE International Conference on Image Processing (ICIP),
pages 3464–3468.

[30] Bianchi, F. M.; Grattarola, D.; Livi, L.; and Alippi, C. (2021). Graph neural networks
with convolutional arma filters. IEEE TPAMI, pages 1–1.

[31] Black, E. and Fredrikson, M. (2021). Leave-one-out unfairness. In Proceedings of
the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21,
page 285–295, New York, NY, USA. Association for Computing Machinery.

[32] Bolme, D. S.; Beveridge, J. R.; Draper, B. A.; and Lui, Y. M. (2010). Visual object
tracking using adaptive correlation filters. In 2010 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition, pages 2544–2550.

[33] Bossard, L.; Guillaumin, M.; and Van Gool, L. (2014). Food-101 – mining discriminative
components with random forests. In Fleet, D.; Pajdla, T.; Schiele, B.; and
Tuytelaars, T., editors, Computer Vision – ECCV 2014, pages 446–461, Cham. Springer
International Publishing.


Bibliography 225

[34] Bretto, A. (2013). Hypergraph Theory: An Introduction. Springer International
Publishing.

[35] Brodatz, P. (1966). Textures: A Photographic Album for Artists and Designers. Dover.

[36] Cai, D.; Zhang, C.; and He, X. (2010). Unsupervised feature selection for multi-cluster
data. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, KDD ’10, page 333–342, New York, NY, USA. Association
for Computing Machinery.

[37] Camps, O.; Gou, M.; Hebble, T.; Karanam, S.; Lehmann, O.; Li, Y.; Radke, R. J.;
Wu, Z.; and Xiong, F. (2017). From the lab to the real world: Re-identification in
an airport camera network. IEEE Transactions on Circuits and Systems for Video
Technology, 27(3):540–553.

[38] Chakraborty, D.; Das, S.; Khan, A.; and Subramanian, A. (2022). Fair rank aggregation.
In Advances in Neural Information Processing Systems 35 (NeurIPS 2022).

[39] Chang, X.; Hospedales, T. M.; and Xiang, T. (2018). Multi-level factorisation net for
person re-identification. In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR).

[40] Chatzichristofis, S. A. and Boutalis, Y. S. (2008a). Cedd: color and edge directivity
descriptor: a compact descriptor for image indexing and retrieval. In Proceedings of
the 6th international conference on Computer vision systems, ICVS’08, pages 312–322.

[41] Chatzichristofis, S. A. and Boutalis, Y. S. (2008b). Fcth: Fuzzy color and texture
histogram - a low level feature for accurate image retrieval. In WIAMIS, pages 191–196.

[42] Chaudhuri, U.; Banerjee, B.; and Bhattacharya, A. (2019). Siamese graph convolutional
network for content based remote sensing image retrieval. Computer Vision and Image
Understanding, 184:22–30.

[43] Chen, H.; Lagadec, B.; and Bremond, F. (2021a). Ice: Inter-instance contrastive
encoding for unsupervised person re-identification. In 2021 IEEE/CVF International
Conference on Computer Vision (ICCV), pages 14940–14949.

[44] Chen, H.; Wang, Y.; Lagadec, B.; Dantcheva, A.; and Bremond, F. (2021b). Joint
generative and contrastive learning for unsupervised person re-identification. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), pages 2004–2013.

[45] Chen, S.-B.; Tian, X.-Z.; Ding, C. H. Q.; Luo, B.; Liu, Y.; Huang, H.; and Li, Q.
(2020a). Graph convolutional network based on manifold similarity learning. Cognitive
Computation, 12(6):1144.


Bibliography 226

[46] Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, KDD ’16, page 785–794, New York, NY, USA. Association
for Computing Machinery.

[47] Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. (2020b). A simple framework for
contrastive learning of visual representations. In Proceedings of the 37th International
Conference on Machine Learning, ICML’20. JMLR.org.

[48] Chen, W.; Liu, Y.; Wang, W.; Bakker, E. M.; Georgiou, T.; Fieguth, P. W.; Liu, L.;
and Lew, M. S. (2021c). Deep image retrieval: A survey. CoRR, abs/2101.11282.

[49] Chen, X. and He, K. (2021). Exploring simple siamese representation learning. In
2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
pages 15745–15753.

[50] Chen, X. and Li, Y. (2020). Deep feature learning with manifold embedding for robust
image retrieval. Algorithms, 13(12).

[51] Chen, Y.; Li, J.; Xiao, H.; Jin, X.; Yan, S.; and Feng, J. (2017). Dual path networks.
In Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan,
S.; and Garnett, R., editors, Advances in Neural Information Processing Systems 30,
pages 4467–4475. Curran Associates, Inc.

[52] Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages
1800–1807.

[53] Cieplinski, L. (2001). Mpeg-7 color descriptors and their applications. In Skarbek, W.,
editor, Computer Analysis of Images and Patterns, pages 11–20, Berlin, Heidelberg.
Springer Berlin Heidelberg.

[54] Cortes, C. and Vapnik, V. (1995). Support-vector networks. Mach. Learn.,
20(3):273–297.

[55] Cristani, M. and Murino, V. (2018). Chapter 10 - person re-identification. In Chellappa,
R. and Theodoridis, S., editors, Academic Press Library in Signal Processing, Volume
6, pages 365 – 394. Academic Press.

[56] Cronen-Townsend, S.; Zhou, Y.; and Croft, W. B. (2002). Predicting query performance.
In Proceedings of the 25th Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval, SIGIR ’02, page 299–306.


Bibliography 227

[57] Dabov, K.; Foi, A.; Katkovnik, V.; and Egiazarian, K. (2007). Image denoising by
sparse 3-d transform-domain collaborative filtering. IEEE Transactions on Image
Processing, 16(8):2080–2095.

[58] Dai, J.; Zhang, P.; Lu, H.; and Wang, H. (2018). Video person re-identification by
temporal residual learning. IEEE Transactions on Image Processing, PP.

[59] Dalal, N. and Triggs, B. (2005). Histograms of oriented gradients for human
detection. In 2005 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR’05), volume 1, pages 886–893 vol. 1.

[60] Datta, R.; Joshi, D.; Li, J.; and Wang, J. Z. (2008). Image retrieval: Ideas, influences,
and trends of the new age. ACM Computing Surveys, 40(2):5:1–5:60.

[61] Datta, S.; Ganguly, D.; Greene, D.; and Mitra, M. (2022). Deep-qpp: A pairwise
interaction-based deep learning model for supervised query performance prediction. In
Proceedings of the Fifteenth ACM International Conference on Web Search and Data
Mining, WSDM ’22, page 201–209, New York, NY, USA. Association for Computing
Machinery.

[62] De Almeida, L. B.; Pereira-Ferrero, V. H.; Valem, L. P.; Almeida, J.; and Pedronette,
D. C. G. (2021). Representation learning for image retrieval through 3d cnn and
manifold ranking. In 2021 34th SIBGRAPI Conference on Graphics, Patterns and
Images (SIBGRAPI), pages 417–424.

[63] De Almeida, L. B.; Valem, L. P.; and Pedronette, D. C. G. (2022). Graph convolutional
networks and manifold ranking for multimodal video retrieval. In 2022 IEEE
International Conference on Image Processing (ICIP), pages 2811–2815.

[64] De Fernando, F. A.; Pedronette, D. C. G.; de Sousa, G. J.; Valem, L. P.; and Guilherme,
I. R.