Wilson Estécio Marcílio Júnior 

Toward Interpretable and Hierarchical Methods for Dimensionality 
Reduction 

São José do Rio Preto 
2023 

Câmpus de São José do Rio Preto


 Wilson Estécio Marcílio Júnior  

Toward Interpretable and Hierarchical Methods for Dimensionality 
Reduction 

Tese apresentada como parte dos requisitos para 
obtenção do título de Doutor em Ciência da 
Computação, junto ao Programa de Pós-
Graduação em Ciência da Computação, do 
Instituto de Biociências, Letras e Ciências Exatas 
da Universidade Estadual Paulista “Júlio de 
Mesquita Filho”, Câmpus de São José do Rio 
Preto. 

Financiadora: CAPES 
(Caso tenha recebido bolsa.  
Obs: As bolsas CAPES não tem número de processo, mas é 
necessário acrescentar uma nota nos Agradecimentos, assim 
como as da FAPESP.) 

Orientador: Prof. Dr. Danilo Medeiros Eler 
                                                         

São José do Rio Preto 
2023 


M319t
Marcílio Júnior, Wilson Estécio

    Toward Interpretable and Hierarchical Methods for Dimensionality

Reduction / Wilson Estécio Marcílio Júnior. -- São José do Rio Preto,

2023

    178 f. : il., tabs.

    Tese (doutorado) - Universidade Estadual Paulista (Unesp),

Instituto de Biociências Letras e Ciências Exatas, São José do Rio

Preto

    Orientador: Danilo Medeiros Eler

    1. Dimensionality Reduction. 2. Unsupervised Learning. 3. Data

Visualization. 4. Hierarchical Exploration. I. Título.

Sistema de geração automática de fichas catalográficas da Unesp. Biblioteca do Instituto de
Biociências Letras e Ciências Exatas, São José do Rio Preto. Dados fornecidos pelo autor(a).

Essa ficha não pode ser modificada.


Wilson Estécio Marcílio Júnior  

    
 Toward Interpretable and Hierarchical Methods for Dimensionality 
Reduction 

Tese apresentada como parte dos requisitos para 
obtenção do título de Doutor em Ciência da 
Computação, junto ao Programa de Pós-
Graduação em Ciência da Computação, do 
Instituto de Biociências, Letras e Ciências Exatas 
da Universidade Estadual Paulista “Júlio de 
Mesquita Filho”, Câmpus de São José do Rio 
Preto. 

Financiadora: CAPES 

Comissão Examinadora 

Prof. Dr. Danilo Medeiros Eler 
UNESP – Câmpus de Presidente Prudente 
Orientador 

Profª. Drª. Maria Cristina Ferreira de Oliveira 
USP - ICMC - São Carlos 

Prof. Dr. Jorge Luis Poco Medina 
FGV - EMAp - Rio de Janeiro 

Prof. Dr. Wallace Correa de Oliveira Casaca 
UNESP - Câmpus de Rosana 

Prof. Dr. Milton Hirokazu Shimabukuro 
UNESP – Câmpus de Presidente Prudente 

Presidente Prudente 
07 de fevereiro de 2023 


Para minha mãe, Raquel Rocha.


AGRADECIMENTOS

Aos meus pais, Raquel e Wilson, agradeço por todo apoio que me deram para
superar os desafios da vida acadêmica e acreditar que eu posso sempre mais—vocês são as
pessoas mais especiais do mundo, muito obrigado por garantir tranquilidade em muitas
das vezes que eu pensei que não iria conseguir. Aos meus irmãos, Larissa e Peterson,
agradeço pela parceria de sempre e pelo interesse nos assuntos do meu trabalho—fiquem
sabendo que meus sobrinhos serão todos evangelizados para seguirem carreira na pesquisa,
muito obrigado pelas conversas e momentos de descontração durante esse período. À
minha noiva, Karla, que esteve comigo a maior parte do tempo durante o desenvolvimento
deste trabalho e, querendo ou não, teve de observar uma pessoa lidando com os altos e
baixos da vida acadêmica—agradeço por todo apoio, e por mudar as perspectivas quando
eu me encontrava em uma rua sem saída. Agradeço ao meu orientador, Danilo, pela
parceria (faz dez anos desde a graduação) e pela ajuda a me tornar o pesquisador que
sou hoje—agradeço muito pelas oportunidades e paciência em todos esses anos. Gostaria
também de agradecer aos professores que contribuíram para as atividades de pesquisa
(especialmente Fernando Paulovich, Rafael Martins e Rogério Garcia) e ao pessoal do
laboratório pelas diversas conversas e momentos de descontração. Por fim, agradeço aos
funcionários da FCT-UNESP, que indiretamente forneceram um ambiente confortável para
desenvolvimento deste trabalho. Agradeço a Deus por estar ao meu lado todo o tempo.

Este trabalho foi financiado por: Fundação de Ampara à Pesquisa do Estado de
São Paulo (FAPESP), processos #2018/17881-3, #2018/25755-8; e Coordenação de
Aperfeiçoamento de Pessoal de Nível Superior (CAPES), processo #88887.487331/2020-
00.


“There will be time, there will be time

To prepare a face to meet the faces that you meet;

There will be time to murder and create,
And time for all the works and days of hands

That lift and drop a question on your plate;
Time for you and time for me,

And time yet for a hundred indecisions,
And for a hundred visions and revisions,

Before the taking of a toast and tea.”

– T.S. Eliot, The Love Song of J. Alfred Prufrock


RESUMO

Análise Exploratória de Dados (EDA) é uma ferramenta valiosa para descobrir novos
insights a partir de dados de alta dimensão. Os procedimentos de EDA ajudam cientis-
tas e profissionais a entender a relação entre instâncias de dados e estruturas usando
metáforas visuais que enfatizam informações sobre o objetivo da análise. Amplamente
empregadas para análise exploratória, técnicas de Redução de Dimensionalidade mapeiam
o relacionamento presente em espaços de alta dimensão para espaços de dimensão menor,
normalmente no R2, e permitem descobertas científicas que vão do entendimento do
comportamento de redes neurais à anotação de tipos celulares. Esta tese apresenta um
conjunto de abordagens para extender o poder de análise utilizando técnicas de Redução
de Dimensionalidade, com o objetivo principal de ajudar no entendimento dos embeddings
gerados por técnicas não-lineares, além do estabelecimento de abordagens de exploração
hierárquica para análise de acordo com a demanda de informações. Tais abordagens foram
resultados das seguintes atividades de pesquisa principais: (i) seleção de representativos no
espaço visual com preservação do contexto e estruturas mapeadas; (ii) técnicas para análise
de clustering utilizando contrastividade, permitindo o entendimento das diferenças entre
instâncias de dados para a formação de clusters; (iii) explicação de forma aditiva do layout
gerado por técnicas não lineares; (iv) técnica de redução hierárquica de dimensionalidade
para análise granular de conjuntos de dados, que permite a preservação das estruturas
entre níveis hierárquicos.

Palavras-chave: Redução hierárquica de dimensionalidade. Interpretação. Análise con-
trastiva.


ABSTRACT

Exploratory Data Analysis (EDA) is a valuable tool for discovering new insights from
high-dimensional data. EDA procedures help scientists and practitioners understand the
relationship between data instances and structures using visual metaphors that emphasize
information about the purpose of analysis. Widely used for exploratory analysis, Dimen-
sionality Reduction techniques map the relationship present in high-dimensional spaces
to smaller-dimensional spaces, usually in R2, and allow scientific discoveries that range
from understanding the behavior of neural networks to the annotation of cell types. This
thesis presents a set of approaches to extend the power of analysis using Dimensionality
Reduction techniques, with the main objective of helping to understand the embeddings
generated by non-linear techniques, in addition to establishing hierarchical exploration
approaches for analysis according to the demand for information. Such approaches were the
result of the following main research activities: (i) selection of representatives in the visual
space with preservation of the context and mapped structures; (ii) techniques for analyzing
clustering using contrastivity, allowing the understanding of differences between data
instances for the formation of clusters; (iii) additive explanation of the layout generated
by non-linear techniques; (iv) hierarchical dimensionality reduction technique for granular
analysis of datasets, which allows the preservation of structures between hierarchical levels.

Keywords: Hierarchical dimensionality reduction. Interpretation. Contrastive analysis.


LIST OF FIGURES

Figure 1 – Defining the grid and selecting samples. (a) A grid with cells of size k

by k is used to divide the R2 space; (b) Only cells with at least one data
point are retrieved; (c) The resulting representative set after redundancy
removal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Figure 2 – Filtering redundant points. In this case, with window of size α = 3. . . 34
Figure 3 – Initial projections of the datasets and their respective sampled sets

according to the sampling techniques. The color of the circles depict
the different classes in the dataset. . . . . . . . . . . . . . . . . . . . . 37

Figure 4 – Redundancy metric measures how the sampled set is redundant by seeing
how each point correlates with each other. The higher the value, the
lower a set is redundant. . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Figure 5 – Coverage metric to assess if a sampled set is able to represent all data
structures. The standard deviation is represented by error bars. Notice
that, greater standard deviation indicates that some instances are far
from a representative (an instance from sampled set). . . . . . . . . . . 39

Figure 6 – Silhouette coefficient of the techniques applied for clustering data. . . . 40
Figure 7 – Runtime of SADIRE algorithm. We can see that the algorithm takes

more time for sampling from high-density embeddings. . . . . . . . . . 41
Figure 8 – UMAP and t-SNE projections. . . . . . . . . . . . . . . . . . . . . . . 42
Figure 9 – Influence of k and α parameters on the returned representative set. . . 43
Figure 10 – Initial embeddings and the layouts that result from reducing visual

disorder with representative selection. SADIRE was applied to several
configurations by varying the k and α parameters. . . . . . . . . . . . . 44

Figure 11 – The outcomes of the user experiments. Each line denotes a unique rep-
resentative selection configuration. The heatmap cells encode how many
times a representative layout was placed in that order position (visual
disorder) and how many times a representative layout was assigned a
rank number for each dataset (similarity). . . . . . . . . . . . . . . . . 45

Figure 12 – ExplorerTree using different sampling algorithms. On the left, the tra-
ditional scatter-plot representation. On the right, the first level of
ExplorerTree using four sampling strategies. . . . . . . . . . . . . . . . 46


Figure 13 – The cExpression pipeline. The user submits a high-dimensional dataset
with R2 coordinates and cluster labels after dimensionality reduction
(A). The contrastive information for each pair of cluster and feature
is computed using the clustering labels and the high-dimensional data
(B). The data generated from this process is used to create visualization
approaches that highlight the unique characteristics of each cluster (C). 50

Figure 14 – Defining distinctive features. The feature’s distribution (a) in a cluster
of interest is compared to its distribution on the remaining dataset
(b). Then, the t-Student test (c) returns the probability of these two
distributions to be different (d). Finally, the features are arranged in
decreasing order according to their t-score to communicate discriminative
power (e). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Figure 15 – UMAP projection of the Vertebral Column dataset and two distinctive
features for the Spondylolisthesis class. . . . . . . . . . . . . . . . . . . 52

Figure 16 – The cExpression visualization tool. Users provide high-dimensional
points with annotated clusters and positions onto R2 (a). The R2 coor-
dinates are used for the scatterplot layout (c), from which the analysis
starts. Users click on points in a cluster to investigate their contrastive
information. For instance, the bipartite graph (c) shows the distinctive
terms for the dark-orange cluster. There is also an interaction mecha-
nism to visualize the feature values on the scatterplot visualization (d,
e). Finally, an overview analysis can be achieved through the heatmap
encoding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Figure 17 – The bipartite graph encoding feature importance. The p-values and
t-values are encoded using red-to-blue and continuous color scales, re-
spectively. It is worth noting that we display the number of decimal
places after zero rather than the p-value itself. The first axis (a) rep-
resents the context of the available features, while the second axis (b)
represents the focus—in this case, they are the same. The connections
between the focus axis and the t-score represent the relationship be-
tween p-values and t-scores, with green representing positive values and
red representing negative values. Finally, the four most distinguishing
characteristics are pre-selected and displayed as a histogram plot (c). . 55

Figure 18 – Construction steps for cell-based encoding. For each cell, the sum of
each feature is obtained (a). The cell is then divided based on user
demand to communicate the features. The space assigned to a feature
is proportional to the number of features in the cell. For example, one
feature in (b), two in (c), and three in (d). . . . . . . . . . . . . . . . . 57


Figure 19 – Cell-based visualization for feature distribution on the scatterplot. After
inspecting the distinctive features on the scatterplot (a), users can
visualize the feature values using cell-based visualizations. For instance,
when one feature is selected (b, syrian), a simple continuous color scale
is sufficient. For more than one feature (c, syrian, kill, protect),
cell-based encoding is employed. . . . . . . . . . . . . . . . . . . . . . . 58

Figure 20 – A comparison of two clusters. The visualization strategy in Fig. 17 is
used to communicate information between two clusters when comparing
two clusters. In essence, while the information for the first cluster (light-
blue in the figure) is encoded as usual (c), the information for the
second is positioned on top (a), and the context axis is positioned in the
center of the visualization (b). Furthermore, the four most distinguishing
features of both clusters are depicted in (d). . . . . . . . . . . . . . . . 59

Figure 21 – Users can inspect the histograms of available features through lasso
selection over the points encoding p-values. . . . . . . . . . . . . . . . . 60

Figure 22 – Focus+context operation. The visualization starts with the context and
focus axes as the same (1). Then, users can select a range of the context
axis to focus analysis (2), resulting in an update of the focus axis (3). . 61

Figure 23 – UMAP projection of the news dataset. . . . . . . . . . . . . . . . . . . 62
Figure 24 – After selecting the five most discriminative features, highlighting the

terms nuclear, plant, and fukushima shows that UMAP distinguished
the news articles separated the news articles related to the nuclear
accident on the left side of the projection. . . . . . . . . . . . . . . . . 62

Figure 25 – Inspecting terms with lower confidence. Terms with lower confidence
in the light-orange cluster are related to political aspects (a) and were
positioned in a different subcluster. . . . . . . . . . . . . . . . . . . . . 63

Figure 26 – Contrastive information showing levels of understanding. The terms
highlighted in red show high-level information, while the terms high-
lighted in blue give specific hints about the news articles. . . . . . . . . 64

Figure 27 – Inspection of germani and europ’s terms shows that the dimensionality
reduction separated news articles focused on the European continent
from the news articles focused on Germany. . . . . . . . . . . . . . . . 64

Figure 28 – UMAP projection of the tweets about COVID-19. . . . . . . . . . . . . 65
Figure 29 – Discriminative terms for the dark-orange cluster—terms related to

respiratory problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Figure 30 – Discriminative terms for the light-orange cluster. Terms indicate users

tweeting their complaints about symptoms expressed during COVID-19
infection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Figure 31 – Comparing two clusters with fever as the main subject. . . . . . . . . 67


Figure 32 – Distinctive terms for the light-blue cluster. In this case, there are more
general aspects of symptoms related to fever. For instance, the term
sore, could be related to sore throat or a new COVID-19 symptom. . . 68

Figure 33 – Distinctive terms from the dark-blue cluster can be considered as in-
dicative of Dengue symptoms. . . . . . . . . . . . . . . . . . . . . . . . 68

Figure 34 – Evaluating another term from the dark-blue cluster. The terms such
as difficulty and breathing indicate the most severe COVID-19
symptoms, shortness of breath. . . . . . . . . . . . . . . . . . . . . . . 69

Figure 35 – Heatmap showing that the red cluster comprises the tweets where people
are worried about wearing masks and coryza. . . . . . . . . . . . . . . 69

Figure 36 – UMAP projection of the Vertebral Column dataset next to the heatmap
visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

Figure 37 – The class of patients with Spondylolisthesis shows greater values for all
of the features, except for Pelvic Radius. . . . . . . . . . . . . . . . . . 70

Figure 38 – Inspection of the distribution values of the features shows that the
features are truly expressed more in the cluster of patients with Spondy-
lolisthesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Figure 39 – Retrieving topics using cExpression. Given a bag-of-words representation
of a document collection (a), we compute the dataset clusters (b).
cExpression is executed to retrieve distinctive terms for each cluster (c),
and the topics correspond to the first m terms of each cluster. . . . . . 72

Figure 40 – cExpression shows competitive results for topics with a low number of
terms (≤ 20). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Figure 41 – cExpression was able to surpass LDA and NMF for the majority of the
number of topics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Figure 42 – Time execution in logarithmic scale for retrieving distinctive features
using contrastive analysis. . . . . . . . . . . . . . . . . . . . . . . . . . 74

Figure 43 – ClusterShapley framework. In the “dataset annotation” component
(1), a dimensionality reduction (1.a) process helps users to analyze
and to annotate clusters perceived in the visual space (R2); then, the
clusters defined in R2 annotate the high-dimensional dataset (1.c). In the
“Shapley estimation” component (2), the annotated high-dimensional
dataset is used to generate clusters probabilities for each data sample
(2.a) employed for Shapley values estimation (2.b). . . . . . . . . . . . 81

Figure 44 – Cluster definition through manual selection. Users receive a projected
dataset with no clustering imposed, and then users define clusters by
lassoing the groups perceived in the visual space. Notice that black
circles indicate data samples not belonging to any cluster, while circles
with different data hues indicate already assigned data samples. . . . . 82


Figure 45 – An overview of the prototype system. In (A), users can specify datasets
and load previously generated explanations. In (B), we draw a scatterplot
representation and use color to encode classes. Hovering over specific
class instances generates summary representations of Shapley values,
which are used to indicate importance (C). The detailed examination of
Shapley values and feature importance assists analysts in comprehending
the relationship between feature value and Shapley value (D). The
Importance Summary is shown in (E), which summarizes the mean of
the absolute Shapley values for each pair (class, feature). The importance
overview is depicted by a heatmap with the sum of the absolute Shapley
values (F). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

Figure 46 – Summarization approach to encoding two histograms for a given feature
i, Hi

e for feature values equal or greater than the mean (σi) and Hi
o for

features values lower than the mean. The two histograms are encoded
together by alternating their bins. . . . . . . . . . . . . . . . . . . . . . 85

Figure 47 – A tool-tip for summarizing information from the blue cluster of Fig. 45. 85
Figure 48 – Coordination between the scatterplot and the dot plot encoding the

Shapley values. Each line segment represents a point in the scatterplot
representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Figure 49 – Two clusters with low separation. While the first three features show
the same overall importance for these two clusters, sepal width plays
a significant role in differentiating cluster orange (a). . . . . . . . . . . 88

Figure 50 – Explanations for the Vertebral Column dataset. There is a clear separa-
tion among patients with complications (Spondylolisthesis or Hernia)
and the healthy ones. The Importance Summary also supports visu-
alization of such characteristics. . . . . . . . . . . . . . . . . . . . . . . 89

Figure 51 – Correlation between feature values and their importance for categorizing
cluster 1. The four most important features contributed to differentiate
patients with Spondylolistyhesis from the others. . . . . . . . . . . . . 90

Figure 52 – Comparison between regular patients (A) and patients with Hernia
(B). The difference in values for sacral slope was not good enough to
impose a good separation on the dimensionality reduction result. . . . . 91

Figure 53 – Dimensionality reduction results for the Red Wine Quality dataset
colored using intensity to encode quality index and using colors to
depict clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

Figure 54 – Cluster 4 is well defined and distinguishable due to the salty flavor (high
levels of chrolides) of the wines. . . . . . . . . . . . . . . . . . . . . . 92


Figure 55 – Comparison between two clusters projected distant which share two im-
portant features. The difference between alcohol levels was determinant
for differentiating the quality of the wine in these two clusters. . . . . . 93

Figure 56 – A combination of higher values Fixed Acidity and lower values of
Volatile Acidity can be unpleasant at higher levels and can explain
higher quality for the data instances. . . . . . . . . . . . . . . . . . . . 94

Figure 57 – At higher levels, Total Sulfur Dioxide tends to add flavor to the wine
and be noticed in the nose. Such an important feature is the meaning
of higher concentration of low-quality wines in cluster 1. . . . . . . . . 94

Figure 58 – Feature contributions and the dimensionality reduction result for the
Communities and Crime dataset. . . . . . . . . . . . . . . . . . . . . . 95

Figure 59 – Such community is very distinguishable from the other for having a
lower percentage of houses occupied PctHousOccup. . . . . . . . . . . . 95

Figure 60 – Detailed analysis of clusters 4 and 5. . . . . . . . . . . . . . . . . . . . 96
Figure 61 – A community of immigrants due to the high percent of immigrants

(PctForeignBorn, PctRecImmig8, and PctRecImmig10) who speak more
than one language besides English (PctSpeakEngOnly). . . . . . . . . . 97

Figure 62 – Using clustering techniques to feed ClusterShapley with different clus-
tering results. The original dataset (original labels) has three labels,
and we use three clustering configurations: four clusters using k-Means
(k-means (1)), eight clusters using k-Means (k-means (2)), and nine
clusters using Agglomerative clustering (agglomerative). . . . . . . . 98

Figure 63 – Comparison of traditional dimensionality reduction (DR) techniques (A)
vs. hierarchical DR techniques (B). Traditional techniques operate in a
single level of detail and focus on delivering a big picture that depicts
the dataset as a whole, ignoring fine-grained details of intra-cluster
distributions and making it difficult to proceed with more detailed
investigations. On the other hand, hierarchical DR (HDR) approaches
summarize important intra-cluster information already on the top-level
representations of the hierarchy imposed on the input data without
affecting the layout’s overall organization. . . . . . . . . . . . . . . . . 102


Figure 64 – Hierarchy construction and projection with HUMAP. The hierarchy
is constructed from bottom to top. First, a k-nearest neighbor graph
approximates the connection strength among data points in the high-
dimensional space (A). Then, the graph structure and the connections’
strengths are used to find the most visited nodes (landmarks) after
several random walk steps (B), which correspond to points on superior
levels (C). The representation neighborhood—local and global relation-
ships (D)—is used for computing the similarity among landmarks (F).
For projecting hierarchy levels (or subsets of them), HUMAP applies—
as in UMAP—a symmetrization to the nearest neighbor graph (H) and
use data points from higher levels to preserve the mental map (I). . . . 106

Figure 65 – Mental map preservation across hierarchical levels. . . . . . . . . . . . 113
Figure 66 – Explaining dimensionality reduction results. (A) Explanation of the

whole dataset provides one story about the clusters defined by the
user. (B) Explanation of the top-level HUMAP representation gives
an overview of the explanations that can be refined according to user
demands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

Figure 67 – User interaction with HUMAP to get more detailed information about
the data. At each level explanation (C and F), users can generate
diverse insights by focusing on distinct subsets of data to investigate
using HUMAP’s drill-down operation. . . . . . . . . . . . . . . . . . . 115

Figure 68 – Hierarchical exploration of the learning weights and explanations of
a CNN trained on Fashion MNIST dataset. The top hierarchy level
(A) encodes the overview of the dataset. Exploring lower levels by
drilling-down the hierarchy (B-F) reveals more detailed information
while preserving the mental map. . . . . . . . . . . . . . . . . . . . . . 116

Figure 69 – Visual analysis of the embeddings generated for top and lowest hierar-
chical levels using a three-level hierarchy. For each dataset, top-level
embedding appears on the left, and the lowest level (whole dataset)
appears on the right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

Figure 70 – Run-time execution in seconds using a log10 scale. The boxen-plots show
the run-time for 20 runs of each method under the same condition – we
measure the run-time to fit the hierarchy (Hierarchy fit), to embed the
top level (Level 2), and to embed the whole dataset (Level 0). . . . . . 120

Figure 71 – Neighborhood Preservation after embedding on R2. HSNE outperforms
HUMAP by a great margin only for EB dataset. . . . . . . . . . . . . . 120

Figure 72 – Evaluation of techniques’ ability to represent complex structures such
as clusters, manifolds, and other relationships. . . . . . . . . . . . . . 121


Figure 73 – Mental map preservation measured using Procrustes Analysis. HUMAP
decreases the variation when positioning points across hierarchy levels. 122

Figure 74 – Continuity values for varying neighborhoods. All of the techniques
performed well at preserving clustering structures. . . . . . . . . . . . . 122

Figure 75 – Trustworthiness values showing similar information to Neighborhood
Preservation values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

Figure 76 – Using HUMAP, UMAP, and HSNE to project subset of data. . . . . . . 124
Figure 77 – Mental map preservation when projecting subsets of datapoints. HUMAP

shows stability across subsequent hierarchy levels. . . . . . . . . . . . . 125
Figure 78 – A sampling technique’s ability to uncover the projection structures must

take into account the outliers and less dense clusters. The results show
that SADIRE is the most stable method among the analyzed, covering
the projected dataset being taking into account all of its structures. . . 143

Figure 79 – Results for in Execution time (s), Coverage, and Redundancy for CBR-
ILP-IR, Shuttle, Corel-1k, Corel-5k, and Diabetes (each line) . . . . . . 144

Figure 80 – The overview of ExplorerTree tool. . . . . . . . . . . . . . . . . . . . . 145
Figure 81 – Choosing different dimensionality reduction techniques: a. IDMAP; b.

LSP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Figure 82 – Choosing different sampling techniques: a. SADIRE (ours); b. Affinity

Propagation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Figure 83 – Traditional scatter plot representations of dimensionality reduction

results next to the first level of ExplorerTree using different sampling
strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

Figure 84 – SADIRE shows slightly lower results compared to Bisecting k-means.
However, SADIRE is superior over Bisecting k-means in run-time exe-
cution, an important characteristic for our scenario. . . . . . . . . . . . 149

Figure 85 – Coherence values for different number of topics. . . . . . . . . . . . . . 151
Figure 86 – Run-time execution for cExpression and ccPCA when varying the num-

ber of data points, dimensionality, and a number of clusters. . . . . . . 152
Figure 87 – Interpretation of two dimensionality reduction techniques using cEx-

pression. Since the sets are used to compare the feature distributions,
the result remains the same. . . . . . . . . . . . . . . . . . . . . . . . . 153

Figure 88 – Analyzing the same dataset with different clustering results. cExpression
merges the unique features from the two clusters (dark-blue and dark-
orange) from Figure 87. . . . . . . . . . . . . . . . . . . . . . . . . . . 154

Figure 89 – Results for Time, DEMaP, and Disparity for twenty executions of each
HDR technique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

Figure 90 – Results for Neighborhood Preservation, Continuity, and Trustworthiness
for twenty executions of each HDR technique. . . . . . . . . . . . . . . 156


Figure 91 – Results for Time, DEMaP, and Disparity for twenty executions of each
HDR technique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

Figure 92 – Results for Neighborhood Preservation, Continuity, and Trustworthiness
for twenty executions of each HDR technique. . . . . . . . . . . . . . . 159

Figure 93 – Projections for Embryoid Body and Mammals datasets. . . . . . . . . . 160
Figure 94 – Projections for FMNIST and MNIST datasets. . . . . . . . . . . . . . . 160
Figure 95 – Numerical analysis for twenty executions of each HDR technique. . . . 161
Figure 96 – Mental map preservation measured using Procrustes Analysis. HUMAP

decreases the variation when positioning points across hierarchy levels. 162
Figure 97 – Continuity values for varying neighborhoods. All of the techniques

performed well at preserving clustering structures. . . . . . . . . . . . . 163
Figure 98 – Trustworthiness values showing similar information to Neighborhood

Preservation values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Figure 99 – Scalability analysis for the main steps in HUMAP algorithm. . . . . . . 164
Figure 100 – Scalability analysis for the main hierarchical dimensionality reduction

techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Figure 101 – Results for reproducibility following the procedures of Becht et al. (BECHT

et al., 2018). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Figure 102 – Run-time distributions for HUMAP when generating reproducible results.167
Figure 103 – Embeddings of the MNIST dataset for different values for “min_dist”

and “n_neighbors”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Figure 104 – Embeddings of the Embroyd Body dataset for different values for “min_-

dist” and “n_neighbors”. . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Figure 105 – Metrics for HUMAP with different β values. . . . . . . . . . . . . . . . 170
Figure 106 – Projections for HUMAP with different β values. . . . . . . . . . . . . . 171
Figure 107 – Embeddings of the MNIST dataset for different values for . . . . . . . . 172
Figure 108 – Embeddings of the Embroyd Body dataset for different values for . . . 173
Figure 109 – HUMAP exploration and annotation of a document collection of COVID-

19 tweets. The top-level hierarchy level shows unlabeled data points and
three major structures (A). We annotate these three clusters (B) and
compute their topics computed (F). For each cluster in (B), we also
project their corresponding level (and final) hierarchy to look for other
patterns, annotating the dataset (C, D, E) and computing their topics. 174


LIST OF TABLES

Table 1 – Datasets used for analysis. . . . . . . . . . . . . . . . . . . . . . . . . . 35
Table 2 – Time for estimating Shapley values. . . . . . . . . . . . . . . . . . . . . 100
Table 3 – Datasets used for experimentation. . . . . . . . . . . . . . . . . . . . . . 118
Table 4 – The datasets used in the experiments. All the projections were performed

by LSP (PAULOVICH et al., 2008), except for Shuttle, in which we used
PLMP (PAULOVICH; SILVA; NONATO, 2010a). . . . . . . . . . . . . 142

Table 3 – Scores given by the participants . . . . . . . . . . . . . . . . . . . . . . 148
Table 4 – p-values for the running time metric. . . . . . . . . . . . . . . . . . . . . 155
Table 5 – p-values for the DEMaP metric. . . . . . . . . . . . . . . . . . . . . . . 157
Table 6 – p-values for the disparity metric. . . . . . . . . . . . . . . . . . . . . . . 157
Table 7 – p-values for the running time metric. . . . . . . . . . . . . . . . . . . . . 158
Table 8 – p-values for the running time metric. . . . . . . . . . . . . . . . . . . . . 158
Table 9 – p-values for DEMaP metric. . . . . . . . . . . . . . . . . . . . . . . . . . 159
Table 10 – p-values for the disparity metric. . . . . . . . . . . . . . . . . . . . . . . 159


LIST OF ALGORITHMS

1 Removing redundant data points. . . . . . . . . . . . . . . . . . . . . . . . 34
2 Shapley estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3 Associate Landmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111


CONTENTS

Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.2 Problems & Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.2.1 Summarization & Interpretability . . . . . . . . . . . . . . . . . . . . . . . 24
1.2.2 Hierarchical Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . 25
1.3 Questions & Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.5 Document structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2 SADIRE: SAMPLING FROM DIMENSIONALITY REDUCTION . . 30
2.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 Sampling based on dimensionality reduction . . . . . . . . . . . . . . 32
2.2.1 Selecting candidates for the representative set . . . . . . . . . . . . . . . . 32
2.2.2 Removing redundant data instances . . . . . . . . . . . . . . . . . . . . . 33
2.2.2.1 Time complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.1 Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.2 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.4 Running time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3.5 Parameter Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4 User Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5 Practical Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.6 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3 CONTRASTIVE ANALYSIS FOR SCATTERPLOT REPRESENTA-
TIONS OF DIMENSIONALITY REDUCTION . . . . . . . . . . . . 47


3.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2 cExpression—Tool for contrastive analysis of DR results . . . . . . . 50
3.2.1 Computing constrastive information . . . . . . . . . . . . . . . . . . . . . 51
3.2.2 Visualization Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.2.1 Visual and task requirements . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.2.2 Visualizing the feature importance . . . . . . . . . . . . . . . . . . . . . . . 55
3.2.2.3 Visualizing feature value distribution . . . . . . . . . . . . . . . . . . . . . . 56
3.2.2.4 Comparing clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2.2.5 Summary View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.2.6 Interaction mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3 Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3.1 The news dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3.2 Exploring tweets about COVID-19 symptoms . . . . . . . . . . . . . . . . 64
3.3.3 Vertebral Column . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.4.1 Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.4.2 Run-time execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.5 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4 EXPLAINING DIMENSIONALITY REDUCTION RESULTS USING
SHAPLEY VALUES . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2 Review of Shapley values . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2.1 Shapley values illustration . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3 Using Shapley values to explain clusters of dimensionality reduction

results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3.1 Dataset annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3.2 Shapley estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4 Visualization Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.4.1 Scatterplot component . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.4.2 Coordination between scatterplot and Shapley values . . . . . . . . . . . . 85
4.4.3 Importance Summary and Importance Heatmap . . . . . . . . . . . . . . . 86
4.4.4 Analysis example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.5 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.5.1 Vertebral Column . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.5.2 Red Wine Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.5.3 Communities and Crime . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.5.4 Automatic annotation with clustering techniques . . . . . . . . . . . . . . 97
4.6 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.6.1 Supporting exploratory analysis . . . . . . . . . . . . . . . . . . . . . . . . 99


4.6.2 Run-time execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.6.3 Cluster-oriented analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5 HUMAP: HIERARCHICAL UNIFORM MANIFOLD APPROXIMA-
TION AND PROJECTION . . . . . . . . . . . . . . . . . . . . . . . 101

5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2 Hierarchical Dimensionality Reduction . . . . . . . . . . . . . . . . . . 106
5.2.1 Landmark selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.2.2 Dissimilarity definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.2.3 Generating the embedding . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.2.3.1 Subset projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.2.3.2 Mental map preservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.3 Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.3.1 Supporting hierarchical explanation of dimensionality reduction results . . . 113
5.3.2 Exploring latent features . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.4 Numerical evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.4.1 Running-time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.4.2 Neighborhood Preservation . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.4.3 DEMaP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.4.4 Mental map preservation . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.4.5 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.4.6 Trustworthiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.4.7 Projecting subsets of data . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.5 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.5.1 Embedding initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.5.2 Generalizability and scalability . . . . . . . . . . . . . . . . . . . . . . . . 125

6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.1 Further Discussion and Open Questions . . . . . . . . . . . . . . . . . 126
6.1.1 SADIRE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.1.2 cExpression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.1.3 ClusterShapley . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.1.4 HUMAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.2 Practical Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.3 Scientific production . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.3.1 First-authored publications . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.3.1.1 Journal articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.3.1.2 Conference papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.3.1.3 Preprint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.3.2 Co-authored publications . . . . . . . . . . . . . . . . . . . . . . . . . . . 130


6.3.2.1 Conference papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.3.2.2 Preprint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

A SADIRE - EMPLOYING SADIRE INTO EXPLORERTREE & AD-
DITIONAL USER EXPERIMENTS . . . . . . . . . . . . . . . . . . 141

A.1 Execution Time, Coverage and Redundancy . . . . . . . . . . . . . . 141
A.2 ExplorerTree Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
A.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
A.2.2 Using different dimensionality reduction techniques . . . . . . . . . . . . . 143
A.2.3 Using different sampling selection mechanisms . . . . . . . . . . . . . . . . 144
A.3 ExplorerTree using different sampling strategies . . . . . . . . . . . . 145
A.4 User Experiment - Evaluating sampling strategies for ExplorerTree . 146

B CEXPRESSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
B.1 Topic evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
B.2 Run-time evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
B.3 Different dimensionality reduction techniques . . . . . . . . . . . . . 153

C HUMAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
C.1 Hierarchical Dimensionality Reduction comparison . . . . . . . . . . 155
C.2 Drill-down analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
C.3 Hierarchical Dimensionality Reduction without preserving the mental

map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
C.4 Additional local/global analysis . . . . . . . . . . . . . . . . . . . . . . 162
C.4.1 Mental map preservation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
C.4.2 Continuity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
C.4.3 Trustworthiness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
C.5 Run time analysis - HUMAP steps and scalability . . . . . . . . . . . 164
C.5.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
C.5.2 Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
C.6 Generalizability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
C.6.1 min_dist and n_neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . 168
C.6.2 β for local contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
C.6.3 Number of data points in each level . . . . . . . . . . . . . . . . . . . . . 172
C.7 Additional use case - Exploration of COVID-19 tweets . . . . . . . . 174


23

1 INTRODUCTION

1.1 CONTEXT
Data visualization is critical in assisting humans in understanding natural phenom-

ena as well as discovering ubiquitous information. Exploratory Data Analysis (EDA) (TUKEY,
1962), a set of methods and strategies for understanding results and simplifying analysis,
is the primary reason scientists and other data visualization practitioners use it. EDA is
typically used when the information to be uncovered is unknown.

Exploratory Data Analysis is especially important when the dataset contains
multiple attributes (or features), which is referred to as high-dimensional data. The process
of evaluating high-dimensional data presents unique challenges due to its redundancy,
sparsity, and complex structures. Furthermore, as data such as text, photos, and the inner
weights or logits of deep learning models become more widely available, high-dimensional
data analysis has become a common activity. Efforts are being made to reduce the
cognitive overload required for analysis in order to comprehend and make sense of these
high-dimensional landscapes as well as the inherent relationships among the data points.
One popular strategy is to reduce the number of dimensions while retaining important
details about the structures and similarities found in high-dimensional space.

Dimensionality Reduction (DR) techniques are commonly used to generate a
representation in low dimension (Rd) from a dataset in high dimension (Rm). The end
result may be referred to as embeddings, projections, or even DR layouts. These methods
make an effort to preserve the relationship between the data points that existed in
the original space. Then, while conducting exploratory data analysis, researchers use
scatterplot-based representations to look for trends and other relevant information. There
are numerous studies that use DR techniques for exploratory data analysis, such as
understanding the learned features by ConvNets (PEZZOTTI et al., 2018) at different
epochs, investigating gene expression patterns to discover new cell types (UNEN; HöLLT;
PEZZOTTI, 2018), and many others. Analysts interpret the decisions made by these
algorithms to determine whether the features fed into machine learning models are relevant
to class separation (MARCILIO-JR et al., 2020), in addition to providing an excellent
opportunity for high-dimensional data analysis.

Dimensionality reduction strategies are almost always used in high-dimensional
data pipelines, but they come with their own set of challenges, such as the inability to
integrate categorical data and the runtime execution of non-linear algorithms, which may
be thoroughly read in recent surveys. In the following section, we delineate the problems


Chapter 1. Introduction 24

and motivations in the scope of this thesis.

1.2 PROBLEMS & MOTIVATION
Dimensionality Reduction (DR) techniques are widely used in a wide range of

scientific fields. Embeddings encoded by scatterplots aid in the analysis of document
collections (ELER et al., 2009), gene expression data (UNEN et al., 2017; SALAMON et
al., 2018; HÖLLT et al., 2018; SOMARAKIS et al., 2019; KRUEGER et al., 2020), and
deep learning model features (RAUBER et al., 2017; BENATO; TELEA; FALCãO, 2018).
However, non-linear techniques make it difficult to understand embeddings according
to feature contributions for layout formation. Furthermore, one-level approaches, which
represent all data at once, may conceal important information about the relationship
between dominant structures. These issues are summarized in the two sections that follow.

1.2.1 SUMMARIZATION & INTERPRETABILITY

Scatterplot representations of embeddings become visually cluttered as the dataset
size increases, limiting understanding of the relationship among data points. This limitation
can be addressed by summarizing the data during analysis using common sampling
techniques. Traditional techniques (VITTER, 1985; KNUTH, 1997) use probabilistic
mechanisms to select samples, whereas deterministic methods, such as CSM (JOIA;
PETRONETTO; NONATO, 2015), use PCA (JOLLIFFE, 1986) loadings to select samples.
These techniques are problematic for scatterplot representations because probabilistic
approaches are influenced by denser areas of the projection and fail to account for outliers,
whereas PCA loadings are influenced by global information. The inability of these two
types of sampling commonly used for scatterplot representations to convey information
about class boundaries of varying sizes is a recurring pattern.

Using interpretation techniques on R2 is another method for improving under-
standing and visualization of embeddings. While many approaches have been proposed for
explaining and interpreting the decisions of common machine learning models (RIBEIRO;
SINGH; GUESTRIN, 2016; LUNDBERG et al., 2020), dimensionality reduction inter-
pretation has been performed with enriching layout techniques (NONATO; AUPETIT,
2018), in which visualizations of feature distributions and other statistical charts to rep-
resent attribute information aid analysts in making sense of clusters and other patterns
perceived in the visual space. Simple statistics, on the other hand, cannot fully provide
the differentiation among the clusters of the embedding produced by a DR technique. To
address this issue, a few approaches based on contrastive analysis (FUJIWARA; KWON;
MA, 2019; LE; AKOGLU, 2019) have been proposed, with the main goal of identifying
unique characteristics of the clusters perceived in the visual space R2. These techniques’


Chapter 1. Introduction 25

main limitations can be summarized as prohibitive runtime execution of high-dimension
datasets (FUJIWARA; KWON; MA, 2019) or the inability to extend the analysis for
different DR (LE; AKOGLU, 2019).

Rather than emphasizing the differences between the clusters, one possible solution
is to assess how much each feature contributed to the embedding, that is, how much
each feature contributed to the formation of clusters or other structures in the projected
space (e.g., R2). In gene expression analysis, for example, bioinformaticians want to
know which genes influence each cluster in order to annotate cell types or discover new
ones (LäHNEMANN; KöSTER; SZCZUREK, 2020). Finding the contribution of these
features to the dimensionality reduction result is primarily related to non-linear DR
techniques (MAATEN; HINTON, 2008; MCINNES; HEALY; MELVILLE, 2018), where
there is currently no way to inverse calculations and keep track of feature contributions
while the algorithm is being executed (FUJIWARA; KWON; MA, 2019). Nonetheless,
non-linear DR techniques are best suited for dealing with most datasets due to their ability
to uncover complex structures such as manifolds.

There are a few issues with existing techniques for interpreting contributions for DR
interpretation. Studies on feature values (COIMBRA et al., 2016; PAGLIOSA; PAGLIOSA;
NONATO, 2016; MARCILIO-JR et al., 2020), for example, do not account for the
dimensionality reduction process and instead concentrate on the reduced low-dimensional
space. Other more elaborated works (TURKAY et al., 2012; JOIA; PETRONETTO;
NONATO, 2015) obtain feature importance through principal components (PC). The PCs
produce biased results for classes with high variation (JOIA; PETRONETTO; NONATO,
2015), and their inability to focus on local information (FUJIWARA; KWON; MA, 2019)
hinders cluster-oriented analysis. These techniques cannot explain how much each feature
contributed to the embedding, for example, in terms of clusters. The importance assigned
to the features does not account for their contribution to clusters and other structures.
Furthermore, these feature importance measures do not interact with one another to form
an explanation measure. Instead, the importance of each feature is independent of the
other.

1.2.2 HIERARCHICAL DIMENSIONALITY REDUCTION

Due to the increasing popularity of deep learning (RAUBER et al., 2017; BENATO;
TELEA; FALCãO, 2018) applications and gene expression data analysis (UNEN et al.,
2017; KRUEGER et al., 2020) with dimensionality reduction, practitioners typically seek
DR techniques that can uncover complex structures (e.g., manifolds), retain clusters, and
provide meaningful representations of the global relationship among clusters.

Traditional DR techniques operate at a single level of detail, representing the
relationship between data points in a single low-dimensional scatterplot representation—


Chapter 1. Introduction 26

which limits analysis when important information is prioritized and details are added as
needed. This limitation can be circumvented using hierarchical DR (HDR) techniques,
which project subsets of the input dataset and allow the user to select where to focus during
the analysis. The idea behind HDR techniques is to project dominant structures in areas
where details about their relationship would be obscured by traditional DR techniques.

Current HDR techniques have limitations in terms of scaling thousands of data
points (PAULOVICH et al., 2008), mental map preservation (PEZZOTTI et al., 2016),
or are primarily aimed at mass cytometry data applications, not handling datasets with
hundreds or thousands of dimensions (KUCHROO, 2020). These HDR techniques cannot
fully address two major issues: a good trade-off between preserving global and local
structures in the low-dimensional representation of the dataset and the mental map
preservation across hierarchy levels. While preserving global and local information allows
for a better understanding of clusters (local structures) and their relationships (global
structures), preserving mental maps makes exploration easier by reducing the number
of geometric transformations on the dataset (FUJIWARA et al., 2020). In addition to
being able to deal with high dimensions and be flexible in terms of hierarchy manipulation
and dataset type, HDR approaches must be able to unfold complex structures present
in real-world datasets. Finally, there is a current requirement for techniques that depict
complex structures while preserving global and local relationships between data samples.
For example, when analyzing deep learning models (RAUBER et al., 2017; MARCíLIO-JR
et al., 2020) and gene expression data (LäHNEMANN; KöSTER; SZCZUREK, 2020), the
trade-off between global and local relations is critical.

This thesis focuses on the issue of encoding global and local relationships among
data points during the embedding process, as well as the preservation of the mental map
for hierarchical dimensionality reduction techniques.

1.3 QUESTIONS & HYPOTHESES
With regard to the aforementioned issues and the motivations behind each of them,

this thesis is built around the following questions, from which we conducted differing but
interconnected research activities: Q1. Can non-linear dimensionality reduction techniques
be interpreted for information discovery? Q2. How should information be summarized
while interpreting the main causes of clustering formation? Q3. How can global and local
information, as well as the mental map, be preserved when interacting with hierarchical
embeddings of high-dimensional data?

The following research activities (RAs) address each of the aforementioned ques-
tions.

RA1: We propose to adapt and advance state-of-the-art model-agnostic explainable


Chapter 1. Introduction 27

dimensionality reduction techniques. More specifically, we combine SHAP (LUNDBERG;
LEE, 2017) with carefully designed visualization approaches to provide interpretability
features as well as interaction mechanisms on scatterplot-represented embeddings. To
answer Q1, we hypothesize that addictive explanations of the contribution to cluster
formation, combined with feature analysis, provide a more detailed understanding of the
phenomenon.

RA2: We used contrastive analysis on scatterplots to address the challenge of
summarizing information and interpreting it for cluster analysis. To accomplish this, we
employ statistical tests on feature values in conjunction with focus+context visualizations
that enrich scatterplot representations of embeddings. Users can comprehend the position-
ing strategies of various dimensionality reduction techniques while retaining only a portion
of the information. To specifically answer Q2, we hypothesize that only a few features are
required to begin exploration and interpretation of embeddings on R2, and that algorithms
that can order features according to relevance can filter out a lot of irrelevant information.

RA3: We proposed a novel hierarchical dimensionality reduction technique that
uses a tunable hyper-parameter to choose between global and local preservation. In addition,
the novel technique advances the state-of-the-art in preserving the mental map when
projecting subsequent hierarchy levels. To specifically answer Q3, we hypothesize that
by using different neighborhood concepts (local and global) and fixing previously projected
data points on the optimization step of the hierarchical dimensionality reduction, we will
be able to reduce these specific limitations.

Based on the the above, this thesis considers the following main hypothesis:

Main Hypothesis. Hierarchical exploration and interpretation of embeddings im-
prove the analysis power of dimensionality reduction techniques by assisting in identifying
and emphasizing various facets present in data. While mental map preservation in hierar-
chical embeddings allows for context preservation throughout exploration, dimensionality
reduction interpretation explains feature contributions to layout formation in the visual
space.

1.4 CONTRIBUTIONS
The main contribution of this thesis is to make progress on dimensionality reduction

interpretation and hierarchical exploration. In terms of interpretation, we present two
methods that use visualization approaches to highlight feature contributions for layout
formation—primarily in the form of clusters. Then, to deal with large datasets, we propose
a novel hierarchical dimensionality reduction technique that overcomes a few limitations
of existing approaches. Despite their differences, these contributions can be used together
to deal with large scale datasets while providing interpretation for their structures.


Chapter 1. Introduction 28

The first contribution is concerned with the information summarization of scat-
terplot representations for dimensionality reduction results (see Chapter 2). We make
progress with a fast sampling method in visual space (R2) that can preserve class boundary
structures while keeping outliers visible. Aside from the method itself, we also conducted
experiments to investigate how changing the parameters affects the user’s ability to
assess the information contained. Finally, the method serves as the foundation for Ex-
plorerTree (MARCíLIO-JR et al., 2021a), for which we conducted a user experiment by
comparing it to a few other sampling techniques.

The second contribution is concerned with the visual summarization problem
and the interpretation of DR results (see Chapter 3). We addressed the problem of
understanding cluster formation by leveraging statistical tests on the feature values of
the input datasets for dimensionality reduction. After calculating separation measures for
each feature, we use a focus+context visualization tool to assist analysts in determining
which features contributed the most to layout formation after dimensionality reduction.
Finally, the technique enables the understanding of global positioning and intra-cluster
differentiation through various interaction mechanisms.

The third contribution deals with the issue of explaining the results of dimensionality
reduction (see Chapter 4). We advance the state-of-the-art by adapting SHAP to explain
cluster formation after dimensionality reduction. Unlike other methods for interpreting DR
results in the literature, our approach provides explanations using addictive contributions
that are directly related to the distance measure used to convey similarity on the scatterplot
representation. In addition to the core technique, we propose a visualization tool to aid in
the exploratory and explanatory capabilities of scatterplots.

The fourth contribution advances hierarchical dimensionality reduction by employ-
ing an adaptive hyper-parameter to control the trade-off between global and local structure
preservation (see Chapter 5). HUMAP employs two neighborhood concepts to capture
both local and global information about manifolds, while the hierarchy is built using the
Finite State Markov Chain. Finally, HUMAP preserves the mental map when projecting
different hierarchy levels for fine-grained information.

Aside from the contributions discussed in the document, Section 6.3 presents other
first-authored publications that were part of the research efforts during the development
of this thesis.

1.5 DOCUMENT STRUCTURE
The main structure of this document is based on the research activities carried out

to answer the questions in Section 1.3 and previously mentioned as our main contributions.
As a result, each chapter, aside from the Introduction and Conclusion, is devoted to


Chapter 1. Introduction 29

detailing this thesis’s research contributions. In summary, the structure of this document
is as follows:

• Chapter 2 introduces the SADIRE sampling technique for scatterplot representations.
It covers the following article, which is related to Q2:

– Marcilio-Jr, W.E., Eler, D.M. SADIRE1 a context-preserving sampling tech-
nique for dimensionality reduction visualizations. Journal of Visualization 23,
999–1013 (2020).

• Chapter 3 introduces two techniques for visual summarization and interpretation of
dimensionality reduction plots. It covers the following article, which is related to Q1
and Q2:

– Marcílio-Jr, W.E., Eler, D.M., Garcia, R.E., Contrastive analysis for scatterplot-
based representations of dimensionality reduction, Computers & Graphics 101,
46-58 (2021).

• Chapter 4 addresses the problem of interpretation of dimensionality reduction
techniques. It covers the following article, which is related to Q1 and Q2:

– Marcílio-Jr, W. E., Eler, D. M, Explaining dimensionality reduction results
using Shapley values2, Expert Systems with Applications 178, 2021.

• Chapter 5 describes a novel technique for hierarchical dimensionality reduction that
addresses the issues of local/global structure and mental map preservation. It covers
the following article, which is related to Q3:

– Marcílio-Jr, W.E., Eler, D.M., Paulovich, F.V., Martins, R.M. HUMAP3:
Hierarchical Uniform Manifold Approximation and Projection, arXiv preprint
2106.07718, 2021. Submitted to IEEE Transactions on Visualization and Com-
puter Graphics.

This document is concluded in Chapter 6, which presents final discussions about
the research efforts highlighted in this thesis. It also discusses practical implications and
future research.

1 https://github.com/wilsonjr/SADIRE
2 https://github.com/wilsonjr/ClusterShapley
3 https://github.com/wilsonjr/humap


30

2 SADIRE: SAMPLING FROM DIMENSION-
ALITY REDUCTION

Wilson E. Marcílio-Jr, Danilo M. Eler SADIRE: a context-preserving sampling technique
for dimensionality reduction visualizations, Journal of Visualization, Volume 23, 2020.

Although DR techniques can represent high-dimensional data structures and rela-
tions in embedded spaces (typically R2), the ability to scale computational methods to
massive datasets is also a frequently debated topic that is becoming increasingly important.
For a long time, scientists have recognized that simply increasing raw processing power
cannot guarantee method interactivity (HELLERSTEIN et al., 1999). To address this
issue, sampling mechanisms are frequently used because approximate answers are just as
useful as exact answers (KWON et al., 2017). The visualization community also makes use
of sampling techniques, such as proposing summary visualizations to depict large amounts
of data (SARIKAYA; GLEICHER; SZAFIR, 2018).

For instance, well-known sampling techniques such as Knuth’s (KNUTH, 1997)
and Reservoir’s (VITTER, 1985) rely on probabilistic mechanisms to extract samples that
depict the dominant structures of a dataset. On the other hand, a deterministic method
called Column Selection Method (CSM) proposed by Joia et al. (JOIA; PETRONETTO;
NONATO, 2015) was shown to be effective in selecting representative samples based
on a correlation index computed using matrix decomposition. Although it outperforms
commonly used sampling mechanisms, it selects non-balanced samples when classes have
different coefficients of variation—in this case, important structures of the sampled dataset,
such as class boundaries or outliers, are lost during the sampling process. Furthermore,
when data is not uniformly distributed, random techniques (e.g., Knuth’s and Reservoir’s)
cannot reflect all data structures, impairing visual assessment of the relationship among
data points.

The problem of reducing the visual cluttering of scatterplot representations of
dimensionality reduction results on R2 is the focus of this chapter. SADIRE1, which stands
for SAmpling based on DImensionality REduction, takes a dataset embedded in 2D space
and selects a limited number of points from each cell of a grid built on the embedded
space. As demonstrated in the experiments, this process reduces the dataset size while
preserving the overall structure and class boundaries of a scatterplot-visualized projection.
Furthermore, this chapter discusses a user experiment to better understand how users
perceive the use of representative data points for visual summarization.
1 https://github.com/wilsonjr/SADIRE


Chapter 2. SADIRE: Sampling from Dimensionality Reduction 31

The remaining of this chapter is organized as follows: the related works are shown in
Section 5.1; Section 2.2 details the sampling technique; Section 5.4 delineates the effectivity
of the proposed approach throughout experiments; Section 2.4 discuss the results for the
user experiment; Section 5.5 finishes the chapter with discussions.

2.1 RELATED WORKS
Sampling techniques are becoming increasingly important. They provide a means

to quickly infer patterns and trends in data, in addition to reducing the burden of complex
methods to execute on large datasets. Well-known sampling algorithms (TILLÉ, 2011) can
be used for general purposes; that is, they can be adapted to work with most problems
requiring the reduction of a dataset to a representative sample. This section focuses on
sampling mechanisms designed to solve specific problems. These techniques are intended
to assist computational methods and interactive approaches in scaling when dealing with
large datasets.

Random sampling mechanisms are frequently used in machine learning to split
datasets into train and test sets or to perform cross validation to fine-tune hyperparameters.
To reduce the computational burden of performing projections, dimensionality reduction
techniques frequently employ sampling. LSP (PAULOVICH et al., 2008) uses either
random sampling or clustering medoids as the first stage—control points—to project the
sampled instances so that the remaining dataset can be projected using interpolation.
LAMP (JOIA et al., 2011) constructs a family of orthogonal affine mappings, one for
each instance to be projected, using the same randomly selected control points. Other
dimensionality reduction techniques, such as PLMP (PAULOVICH; SILVA; NONATO,
2010b), Pekalka’s (PEKALSKA et al., 1999), and PLP (PAULOVICH et al., 2011),
also use sampling as preprocessing. Sampling is also a technique for reducing cognitive
overload caused by large datasets represented by traditional visualizations. In fact, such
strategies are used in conjunction with DR methods. HSNE (PEZZOTTI et al., 2016)
technique projects dominant structures in order to provide an exploration process based on
information demand while reducing visual clutter, with Monte Carlo sampling on a Finite
Markov Chain. Nguyen and Song (NGUYEN; SONG, 2016) adopt centrality cluster-based
sampling approaches, which employ centrality measures to obtain more informed samples
than random sampling methods. Joia et al. (JOIA; PETRONETTO; NONATO, 2015)
proposed a sampling strategy based on correlation scores generated by singular value
decomposition that is sensitive to class variability.

Unlike the methods discussed previously, the one presented in this chapter is based
on the advances made in dimensionality reduction techniques over the years. The method
is based on the assumption that DR techniques preserve similarity and class structures


Chapter 2. SADIRE: Sampling from Dimensionality Reduction 32

in multidimensional space. As a result, unlike other techniques, our sampling mechanism
can accurately reflect the data’s structures and class boundaries when given a reliable
projection.

2.2 SAMPLING BASED ON DIMENSIONALITY REDUCTION
When the target dimension is in visual space, dimensionality reduction results

are typically represented by scatterplots, with R2 being the most common. This type of
visualization uses spatial proximity to depict similarity, so that similar data points are
close together and dissimilar data points are far apart. In order to select samples, SADIRE
considers such a projection X ⊂ R2, and it can be divided into two main stages. It begins
by shrinking the dataset and selecting candidates for the representative set using a grid
structure on the projection plane. Second, it eliminates duplicate data instances from the
sampled set. These steps are described in detail in the sections that follow.

2.2.1 SELECTING CANDIDATES FOR THE REPRESENTATIVE SET

The first stage of the sampling strategy involves sampling from an R2 projection.
SADIRE divides the projected space by generating a grid with cells of size k×k (note that
k is measured in pixels). The sampled set is then created by using only the cells with at
least one data point, from which a number m < n of data points can be retrieved, where
n is the number of data points in each cell. While our explanation and experiments are
geared toward m = 1, keep in mind that using m > 1 conveys the density of the clusters.
These procedures are as follows:

1. Create a grid with cells of size k by k and divide the projected space (R2), where k

is measured in pixels;

2. Add to the set S all the data points in each cell s that has at least one data point
(|s| ≥ 1);

3. Add to the set A, m data points of from each cell s ∈ S (if m = 1, add the medoid
to the set A); store the density (|s|) of each cell.

Fig. 1 depicts how sampling is used when m = 1, i.e., by retrieving one data point
from each grid cell. In the first step (a), a grid structure of size k by k (in pixels) is used
to group data points; then, in the second step (b), the density and m data points are
stored for each cell; finally, by using the density information—explained in the following
section—SADIRE reduces the redundancy of the embedding.

Because the width and height of a cell are smaller than the diagonal distance
between two opposite vertices when using a grid structure to sample data points from the


Chapter 2. SADIRE: Sampling from Dimensionality Reduction 33

Figure 1 – Defining the grid and selecting samples. (a) A grid with cells of size k by k is
used to divide the R2 space; (b) Only cells with at least one data point are
retrieved; (c) The resulting representative set after redundancy removal.

scatterplot representation, visual connection in 45 degrees may be lost. While we argue
that such a problem is insignificant enough to mislead users’ exploration, the parameter k

can be used to control it. Lowering the value of k yields a representative set that is more
similar to the original set.

2.2.2 REMOVING REDUNDANT DATA INSTANCES

A windowing process is used in the second main step of the algorithm to reduce
redundancy from neighboring cells. For a window size of α = 3, Fig. 2 shows how the
density information is used to keep useful information about the embedding structures
while reducing redundancy. Starting with the densest cell (1. ), all of its neighboring cells
are eliminated from the competition in order to be included in the resulting representative
set—the result in the scatterplot for m = 1 is indicated by a black arrow. The same
procedure is used on the second densest cell (2. ), and all of its neighboring cells are
eliminated from competition. This process is repeated for all active cells in the scatterplot
that were not removed.

Algorithm 1 describes the process depicted in Fig. 2, which takes as input cells
ordered decreasingly by density (indicated by A). If a cell ai was not removed from the set A

(ai.active is true in line 3), the algorithm looks for at least one cell with data points within
the α-sized window centered in ai (ai is an outlier if it does not have neighboring cells). If
neighboring cells exist, they are removed (line 7) and ai is added to the representative set
(line 6). The output of the Algorithm, G, is a set of cells that contains information about
position and density.

While it is important to reduce redundancy among representative samples, outliers
must be preserved on the projection plane (i.e., added to the representative set) because
they can be useful in analytic tasks. Take note that we consider data points that are far


Chapter 2. SADIRE: Sampling from Dimensionality Reduction 34

Figure 2 – Filtering redundant points. In this case, with window of size α = 3.

Algorithm 1 Removing redundant data points.
1: procedure mergeInstances(A)
2: for each ai in A do
3: if ai.active then
4: g ← cells with data points inside the window centered in ai;
5: if g ̸= ∅ then
6: G ← G ∪ ai;
7: g.active ← false;
8: ai.active ← false.

return G

from clustered data points to be outliers. Thus, given the α-sized window used to remove
redundant data points, our heuristic for detecting and adding outliers works as follows:
because the data points candidates for outliers are in the active cells of set A—the data
points with no neighboring cells—we select as representative data points only outliers that
are truly far away from a representative data point, for example, outliers that are at least
twice the distance from a representative data point (2k).

2.2.2.1 Time complexity

When all the points are far from each other and none has neighbors, the complexity
of Algorithm 1 is O(α2|A|), where |A| is the size of the sampled set retrieved in the first
step and α is the window for reducing redundant points; the complexity of searching for
outliers is O(|A|2) when all the points are far from each other and none has neighbors.
Because the process of creating and retrieving all points can be performed using a quad-tree,
whose worst-case scenario in inserting and searching operations is known to be O(n),
the complexity of our method is O(|A|2). However, because of the similarity constraint
mapped by DR techniques, Algorithm 1 reduces the number of instances in set A. As a


Chapter 2. SADIRE: Sampling from Dimensionality Reduction 35

result, receiving the worst-case scenario is unlikely.

2.3 EXPERIMENTS
This section evaluates SADIRE on its ability to extract representative data points

that inherently reflect the structures imposed by a DR technique. SADIRE was compared
against three sampling mechanisms: Reservoir (VITTER, 1985), Column Selection Method
(CSM ) (JOIA; PETRONETTO; NONATO, 2015), and Knuth (KNUTH, 1997). In the
following, we provide a description of the datasets used in the experiments (see Table 1):

• CBR-ILP-IR is a document collection of three fields of artificial intelligence;

• corel-1k and corel-5k are image datasets of natural scenes and artificial objects;

• segmentation is a dataset of seven outdoor images manually segmented;

• mammals-8k and mammals-20k are datasets composed of mammal features;

• photographers-9k and photographers-36k are image datasets of different photogra-
phers2;

• shuttle is a dataset of shuttle features3;

• fibers is a dataset obtained from the 2009 Pittsburgh Brain Competition4;

• new-wavelet is a x-ray image dataset5.

Table 1 – Datasets used for analysis.

Dataset Size Dimensions Classes DR technique
CBR-ILP-IR 574 35 3 LSP
corel-1k 1000 150 10 LSP
segmentation 2100 19 8 LAMP
corel-5k 5000 4096 50 t-SNE
mammals-8k 8000 72 4 LAMP
new-wavelet 9000 1024 116 LAMP
photographers-9k 9077 4096 41 LAMP
mammals-20k 20000 72 4 LAMP
photographers-36k 36364 4096 41 PLMP
shuttle 43500 9 7 PLMP
fiber 250000 30 9 PLMP

2 http://people.cs.pitt.edu/ chris/photographer/
3 https://archive.ics.uci.edu/ml/datasets/Statlog+(Shuttle)
4 http://pbc.lrdc.pitt.edu/
5 http://ir.shef.ac.uk/imageclef/2006/


Chapter 2. SADIRE: Sampling from Dimensionality Reduction 36

For the projection of these datasets, four dimensionality reduction techniques were
used: LSP (PAULOVICH et al., 2008), t-SNE (MAATEN; HINTON, 2008), LAMP (JOIA
et al., 2011), and PLMP (PAULOVICH; SILVA; NONATO, 2010b). Then, since there is
no currently way to control the number of data points returned by SADIRE, its resulting
set was used to define the number of data points selected by CSM, Reservoir, and Knuth.
We used k = 5 and α = 3 for SADIRE parameters. The experiments were performed in a
computer with the following configuration: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz,
32GB RAM, Windows 10 64 bits.

Fig. 3 shows the R2 projections and the data points retrieved by each sampling
technique. While the points retrieved from Knuth and Reservoir techniques are concentrated
on denser areas due to their probabilistic design, CSM technique prioritizes elements that
give the highest scores according to correlation to the original dataset—such characteristic
retrieves more representatives from classes with higher variation (JOIA; PETRONETTO;
NONATO, 2015). Thus, a few structures may not be preserved and the data points
retrieved from the dataset are not well suited to be representatives, e.g., the sampled
points of mammals-8k, mammals-20k, shuttle, and fiber. On the other hand, SADIRE
reflects the boundaries imposed by the class separation and show the layout structure of
the initial projection, preserving the structures imposed by the DR technique. In this sense,
SADIRE resembles the structure of the dataset—if the projection technique truly reflects
the high dimensional space structure (RAUBER et al., 2017). It is important to emphasize
that the result of SADIRE is related to the R2 embedding generated by the dimensionality
reduction technique, that is, the main purpose of SADIRE is to preserve the structures of
the projected dataset in R2. In this case, the evaluation of how a method is reflecting the
high-dimensional space must be performed considering the high-dimensional space and
the dimensionality reduction technique.

The numerical evaluation uses the concept of representativeness (MA; WEI; CHEN,
2011) given by the metrics redundancy and coverage. While the Redundancy metric is used
to assess how redundant is a sampled set, the Coverage metric assess how a sampled set
can represent the information in the original set. The following sections provide analyses
based on the Redundancy and Coverage of the representative sets based on DR results
visualized by scatterplots. Then, SADIRE is evaluated on clustering task using silhouette
coefficient.

2.3.1 REDUNDANCY

The idea to understand the redundancy of a dataset is to verify whether sampled
point can be removed without information loss. That is, if two or more points are too
close to each other, only one point can represent them due to their similarity. Equation 2.1
shows how to compute Redundancy, which is simply the mean of the lowest distances of


Chapter 2. SADIRE: Sampling from Dimensionality Reduction 37Author One et al. 3
Dataset Projection SADIRE CSM Reservoir Knuth

CBR-ILP-IR

corel-1k

corel-5k

fiber
mammals-8k
mammals-20k

new-wavelet

photographers-5k

photographers-20k

segmentation

shuttle
TABLE 2 Projeções dos conjuntos de dados e seleção de representativos.

CBR-ILP-IR

Corel-1k

Corel-5k

Fiber

Mammals-8k

Mammals-20k

New-wavelet

Photographers-9k

Photographers-9k

CBR-ILP-IR

Corel-1k

Corel-5k

Fiber

Mammals-8k

Mammals-20k

New-wavelet

Photographers-9k

Photographers-20k

Segmentation

Shuttle

Photographers-36k

Figure 3 – Initial projections of the datasets and their respective sampled sets according
to the sampling techniques. The color of the circles depict the different classes
in the dataset.


Chapter 2. SADIRE: Sampling from Dimensionality Reduction 38

each sampled point to the other points in the sampled set. In the equation, ∥ • ∥ stands
for Euclidean distance, X is the sampled set, and x′ and x are different sampled points on
the visual space. Notice that higher values will encode better quality since the sampled
points are far from each other.

1
N

∑
x∈X

minx′∈X(∥ x′ − x ∥). (2.1)

Fig. 4 shows the results for the redundancy metric. Due to their design, Knuth and
Reservoir techniques perform poorly according to redundancy—the sampled points are
too close to each other. Notice in Fig. 3 that the denser is the area of the projection, the
greater is the number of points retrieved from these techniques. On the other hand, CSM
technique tends to select more points from high variability classes (see sampled points
of mammals-8k, mammals-20k, and shuttle of Fig. 3). Our technique is designed to not
return redundant points, so that, it had reached the best results.

Figure 4 – Redundancy metric measures how the sampled set is redundant by seeing how
each point correlates with each other. The higher the value, the lower a set is
redundant.

2.3.2 COVERAGE

We also rely on spatial proximity to assess the coverage of a given representative
set. In this case, we evaluate how far from a sampled point is a non-sampled point.
Equation 2.2 shows the Coverage metric, where lower values encode better quality—notice


Chapter 2. SADIRE: Sampling from Dimensionality Reduction 39

that x represent each sampled point in the sampled set (X), and y represent each non-
sampled point in the original set (Y ). Besides analyzing the returned value, the standard
variation of the differences in Equation 2.2 is used to assess if there are points too far from
the sampled set.

1
N

∑
y∈Y

minx∈X(∥ y − x ∥). (2.2)

Fig. 5 shows the results for coverage metric. For a few cases, SADIRE presents (for
fiber and photographers-36k, for example) higher results than Reservoir and Knuth’s. This
is due to the fact that the majority of the points selected from them are concentrated in
denser areas—mean of lower distances gives lower values. However, the standard deviation
shows that a few points are too far from a representative (sampled point), which implies in
bad results for Knuth and Reservoir techniques. CSM results are related to its formulation,
in which there is no way to control where to focus for extracting sample points.

Figure 5 – Coverage metric to assess if a sampled set is able to represent all data structures.
The standard deviation is represented by error bars. Notice that, greater
standard deviation indicates that some instances are far from a representative
(an instance from sampled set).

2.3.3 CLUSTERING

As in Joia et al.’s work (JOIA; PETRONETTO; NONATO, 2015), we also evaluated
our technique in the clustering task.. Thus, k-Means (MACQUEEN, 1967) and Expectation
Maximization (EM ) (DEMPSTER; LAIRD; RUBIN, 1977) algorithms were applied in the
SADIRE and CSM sampled points, then each non-sampled point was associated to the
cluster of the closest medoid—the reason that we only compared our technique against
CSM is that Joia et al. (JOIA; PETRONETTO; NONATO, 2015) have already compared
CSM with another methods. The number of clusters were defined as the same number
of classes of each dataset. Fig. 6 shows the silhouette coefficient (TAN; STEINBACH;
KUMAR, 2006) for the techniques compared with the EM and k-Means algorithms, where


Chapter 2. SADIRE: Sampling from Dimensionality Reduction 40

values range from −1 to 1 and greater values indicate better cohesion and separation
among clusters.

Figure 6 – Silhouette coefficient of the techniques applied for clustering data.

SADIRE performed similar to the plain clustering processes, reflecting the structures
of these datasets. CSM technique, on the other hand, tends to negligence classes with
lower variation, which negatively affects the clustering process.

2.3.4 RUNNING TIME

Fig. 7 shows the running time of SADIRE for the datasets in Table 1. Note that
rather than the size of the datasets, the running times are more related to the density of the
clusters imposed by the dimensionality reduction techniques on the R2 embedding—see, for
example, the time taken for sampling from the shuttle dataset, which is the second biggest
dataset in terms if size of data points. This analysis gives us the insight that although our
algorithm has its greater complexity of searching for outliers in the worst-case scenario, in
practice, the complexity of inserting and searching data points in a quad-tree dominates
the other steps.


Chapter 2. SADIRE: Sampling from Dimensionality Reduction 41

Figure 7 – Runtime of SADIRE algorithm. We can see that the algorithm takes more time
for sampling from high-density embeddings.

2.3.5 PARAMETER ANALYSIS

This section inspects the influence of the SADIRE’s k and α parameters. We used
the 20 Newsgroups 6 and MNIST (LECUN; CORTES, 2010) datasets projected with
UMAP (MCINNES; HEALY; MELVILLE, 2018) and t-SNE (MAATEN; HINTON, 2008),
as shown in Fig. 8. The projection of the 20 Newsgroups dataset was messier in terms of
class boundaries. However, as seen in Fig. 9a, SADIRE conveys the structures and even
outliers of the embedding for lower values of k and α.

Figs. 9a and 9b show the influence of the parameters k and α for the 20 Newsgroups
and MNIST datasets, respectively. The first characteristic is that the k and α parameters
are inversely proportional to the number of representatives and the redundancy. Moreover,
the layout formed by the representative set lose its similarity to the original embedding
for greater values of these parameters, deceiving user exploration. The second interesting
pattern is that all of the layouts using k = 5 show high similarity to the original embedding,
while a combination of higher values for k and α produce poor layouts in terms of similarity
to the original embedding. Such characteristics can indicate that k is the most critical
parameter.

Depending on the application, SADIRE can be used with various combinations
of parameters. However, a suggestion that balances similarity to the original embedding
6 http://qwone.com/ jason/20Newsgroups/


Chapter 2. SADIRE: Sampling from Dimensionality Reduction 42

(a) 20 Newsgroups projected using UMAP. (b) MNIST projected using t-SNE.

Figure 8 – UMAP and t-SNE projections.

and lower redundancy is to use k = 5 or 10. To support our guiding, the following section
provides an user evaluation on the similarity between SADIRE samples and original
scatterplots.

2.4 USER EXPERIMENT
In this user experiment, the representative sets were evaluated according to their

similarity to the initial embeddings and according to visual clutter. For instance, we used
four datasets with incremental visual disorder to perform the experiment. For each dataset,
we used six different SADIRE’s parameters configurations: k = 5, α = 3; k = 10, α = 3;
k = 10, α = 5; k = 10, α = 7; k = 20, α = 5; k = 20, α = 7. The initial embeddings
and layouts produced by sampling using SADIRE are shown in Fig. 10—where the circle
encodes density. Due to their similarity to the other layouts, a few layout configurations
were omitted, such as k = 5, alpha = 3 and k = 20, alpha = 3.

The experiment was performed by 42 participants, two graduated students and
40 computer science students. The participants were given two tasks based on the six
representative layouts in Fig. 10:

T1: To classify representative layouts from the least visually cluttered to the most visually
cluttered;

T2: To rate (from 0 to 10) the representative layouts according to their similarity to the
initial embedding, where 0 stands for the least similar and 10 stands for the most
similar.


Chapter 2. SADIRE: Sampling from Dimensionality Reduction 43

(a) Influence of the parameters on the 20 Newsgroups dataset.

(b) Influence of the parameters on the MNIST dataset.

Figure 9 – Influence of k and α parameters on the returned representative set.


Chapter 2. SADIRE: Sampling from Dimensionality Reduction 44

Figure 10 – Initial embeddings and the layouts that result from reducing visual disorder
with representative selection. SADIRE was applied to several configurations
by varying the k and α parameters.

For T1, the six layout configurations were presented all at once for the participants
to order—we did not provide the initial embedding to avoid biasing the participants.
To facilitate the rating in T2, the six layouts were presented to the participants one by
one, following the initial embedding. Fig. 11 depicts users’ perceptions of the layouts
produced by the sampling technique. The rows in each heatmap represent the six different
representative layouts (see Fig. 10) generated by varying SADIRE’s parameters, while the
columns represent the visual clutter ordering (from least to most visually cluttered) and
the similarity rating. Each cell’s color represents how many times a participant placed
a representative layout in that position (for visual disorder) and how many times that
participant rated that representative layout (for similarity). It should be noted that for the
majority of the participants, a layout was assigned a position or received a similar rating.

On the heatmaps for T1, we can see a pattern indicating that as the number of
representatives decreases, the visual disorder decreases as well, according to the participants.
In other words, the visual clutter on the projection is related to the number of circles on
the plane by the participants. For T2, we can see that the similarity of the representative
layouts to the initial layout is inversely proportional, implying that as the number of data
instances projected onto the plane decreases, information is lost. Given these two main
patterns, representative layouts that achieve a balance between visual disorder and layout
similarity are the best option. Another intriguing pattern can be seen in the elements
of the heatmaps depicting visual disorder’s lower triangular matrix. This pattern can be
explained by the fact that users may struggle to analyze representative layouts with fewer


Chapter 2. SADIRE: Sampling from Dimensionality Reduction 45

Figure 11 – The outcomes of the user experiments. Each line denotes a unique represen-
tative selection configuration. The heatmap cells encode how many times a
representative layout was placed in that order position (visual disorder) and
how many times a representative layout was assigned a rank number for each
dataset (similarity).

data points, at which point the embeddings begin to lose information.

There is a link between coverage and layout similarity, as well as redundancy and
visual clutter/disorder. According to participant responses, as the number of representatives
is increased, the resulting set becomes more similar to the initial embedding (a natural
observation). Furthermore, the initial embedding is better covered because it is represented
by a greater number of data instances. On the other hand, we can see that in visually
cluttered layouts, redundancy is higher (i.e., very similar data instances are projected
close to each other). This analysis backs up our claim that a good representative selection
technique imposes a balance between coverage (as layout similarity) and redundancy (as
visual clutter).

2.5 PRACTICAL APPLICATION
Fig. 12 depicts SADIRE’s suitability for representing the first level of Explor-

erTree (MARCíLIO-JR et al., 2021a), a hierarchical scatterplot navigation approach. On
the left is an LSP (PAULOVICH et al., 2008) projection of the Corel dataset, with 5,000
data points colored by class. On the right, we show the first level of ExplorerTree using the
same sampling strategies on the same dataset. Unlike the other techniques, SADIRE uses
sampled data points to cover the entire projected space, while the circle area accurately


Chapter 2. SADIRE: Sampling from Dimensionality Reduction 46

depicts the densest parts of the projection. Many structures encoded by SADIRE are
lost in Reservoir and CSM—these structures are highlighted in orange. Finally, bisecting
k-means fails to encode density information well, aggregating data points that do not
correspond to the density of the projection—see the structures highlighted in blue

Figure 12 – ExplorerTree using different sampling algorithms. On the left, the traditional
scatter-plot representation. On the right, the first level of ExplorerTree using
four sampling strategies.

Finally, in Appendix A, the results and rationale for using SADIRE in the Explor-
erTree technique are documented, along with additional user experiments.

2.6 FINAL REMARKS
Sampling mechanisms are an important tool for dealing with increasing data avail-

ability, the high complexity of computational methods, and the interaction requirements
of user-centered approaches.

In the preceding sections, we show how our sampling method can be used to generate
meaningful representative sets of visual representations of dimensionality reduction (DR)
results. As a result, SADIRE is dependent on the ability of DR techniques to represent
the underlying data structures in multidimensional space. This means that our technique
is appropriate for visual tasks in which datasets are represented by R2 embeddings and a
thorough examination of the DR technique and feature space has been performed.

Another intriguing feature of our proposal is that the sampling mechanism can be
modified to convey cluster density—this can be accomplished by returning a percentage
of data points (or simply more than one data point) from each cell (see Section 2.2.1).
Finally, the parameters k and α can be used to adjust the size and redundancy of the
representative set.


47

3 CONTRASTIVE ANALYSIS FOR SCAT-
TERPLOT REPRESENTATIONS OF DI-
MENSIONALITY REDUCTION

Wilson E. Marcílio-Jr, Danilo M. Eler, Rogério E. Garcia Contrastive analysis for
scatterplot-based representations of dimensionality reduction, Computers & Graphics,

Volume 101, 2021.

Analysts inspect clusters using scatterplot representations of Dimensionality Reduc-
tion (DR) results to understand data nuances and features’ contributions to embedding in
the projected space. Contrastive analysis (FUJIWARA; KWON; MA, 2019; LE; AKOGLU,
2019) is a promising strategy for analyzing DR results that involves understanding how fea-
tures differentiate and contribute to the formation of clusters in the visual space (R2). Such
an approach has a wide range of practical applications. A tool for labeling textual data,
for example, would benefit from contrastive analysis to highlight the differences among the
clusters (e.g., topics)—these clusters can represent candidates for new classes due to their
distinct characteristics. Another important application is determining which characteristics
describe two distinct groups of patients following medical experiments (ABID et al., 2018).

Only a few studies provide contrastive analysis (FUJIWARA; KWON; MA, 2019;
LE; AKOGLU, 2019). Instead, the literature presents a small number of studies that
support the analysis of DR results via visualization of global information (TURKAY et
al., 2012; JOIA; PETRONETTO; NONATO, 2015; COIMBRA et al., 2016; STAHNKE
et al., 2016; PAGLIOSA; PAGLIOSA; NONATO, 2016), emphasizing the importance of
features (attributes) imposed on the embedded space. The principal components (PCs)
from PCA (JOLLIFFE, 1986) are used in the main approaches to determine which features
contribute to cluster formation (TURKAY et al., 2012; JOIA; PETRONETTO; NONATO,
2015). These contributions, however, do not highlight the clusters’ distinct characteristics.
ccPCA (FUJIWARA; KWON; MA, 2019), on the other hand, uses contrastive PCA (ABDI;
VALENTIN, 2007) to find contrastive information (i.e., the unique characteristics) for each
cluster. The main limitation of ccPCA stems from PCA, specifically the prohibitively long
run-time execution time for high-dimensional datasets—it takes O(m3) on the number
of dimensions. ContraVis (LE; AKOGLU, 2019), another approach, is only applicable to
textual data and cannot assist in the interpretation of DR layouts because it is already a
dimensionality reduction approach. Furthermore, the study does not present a strategy for
comprehending the contributions of the terms to the embedding organization.


Chapter 3. Contrastive analysis for scatterplot representations of dimensionality reduction 48

In this chapter, we will discuss cExpression, a method for analyzing DR results
that employs contrastive analysis and a