Wilson Estécio Marcílio Júnior Toward Interpretable and Hierarchical Methods for Dimensionality Reduction São José do Rio Preto 2023 Câmpus de São José do Rio Preto Wilson Estécio Marcílio Júnior Toward Interpretable and Hierarchical Methods for Dimensionality Reduction Tese apresentada como parte dos requisitos para obtenção do título de Doutor em Ciência da Computação, junto ao Programa de Pós- Graduação em Ciência da Computação, do Instituto de Biociências, Letras e Ciências Exatas da Universidade Estadual Paulista “Júlio de Mesquita Filho”, Câmpus de São José do Rio Preto. Financiadora: CAPES (Caso tenha recebido bolsa. Obs: As bolsas CAPES não tem número de processo, mas é necessário acrescentar uma nota nos Agradecimentos, assim como as da FAPESP.) Orientador: Prof. Dr. Danilo Medeiros Eler São José do Rio Preto 2023 M319t Marcílio Júnior, Wilson Estécio Toward Interpretable and Hierarchical Methods for Dimensionality Reduction / Wilson Estécio Marcílio Júnior. -- São José do Rio Preto, 2023 178 f. : il., tabs. Tese (doutorado) - Universidade Estadual Paulista (Unesp), Instituto de Biociências Letras e Ciências Exatas, São José do Rio Preto Orientador: Danilo Medeiros Eler 1. Dimensionality Reduction. 2. Unsupervised Learning. 3. Data Visualization. 4. Hierarchical Exploration. I. Título. Sistema de geração automática de fichas catalográficas da Unesp. Biblioteca do Instituto de Biociências Letras e Ciências Exatas, São José do Rio Preto. Dados fornecidos pelo autor(a). Essa ficha não pode ser modificada. Wilson Estécio Marcílio Júnior Toward Interpretable and Hierarchical Methods for Dimensionality Reduction Tese apresentada como parte dos requisitos para obtenção do título de Doutor em Ciência da Computação, junto ao Programa de Pós- Graduação em Ciência da Computação, do Instituto de Biociências, Letras e Ciências Exatas da Universidade Estadual Paulista “Júlio de Mesquita Filho”, Câmpus de São José do Rio Preto. Financiadora: CAPES Comissão Examinadora Prof. Dr. Danilo Medeiros Eler UNESP – Câmpus de Presidente Prudente Orientador Profª. Drª. Maria Cristina Ferreira de Oliveira USP - ICMC - São Carlos Prof. Dr. Jorge Luis Poco Medina FGV - EMAp - Rio de Janeiro Prof. Dr. Wallace Correa de Oliveira Casaca UNESP - Câmpus de Rosana Prof. Dr. Milton Hirokazu Shimabukuro UNESP – Câmpus de Presidente Prudente Presidente Prudente 07 de fevereiro de 2023 Para minha mãe, Raquel Rocha. AGRADECIMENTOS Aos meus pais, Raquel e Wilson, agradeço por todo apoio que me deram para superar os desafios da vida acadêmica e acreditar que eu posso sempre mais—vocês são as pessoas mais especiais do mundo, muito obrigado por garantir tranquilidade em muitas das vezes que eu pensei que não iria conseguir. Aos meus irmãos, Larissa e Peterson, agradeço pela parceria de sempre e pelo interesse nos assuntos do meu trabalho—fiquem sabendo que meus sobrinhos serão todos evangelizados para seguirem carreira na pesquisa, muito obrigado pelas conversas e momentos de descontração durante esse período. À minha noiva, Karla, que esteve comigo a maior parte do tempo durante o desenvolvimento deste trabalho e, querendo ou não, teve de observar uma pessoa lidando com os altos e baixos da vida acadêmica—agradeço por todo apoio, e por mudar as perspectivas quando eu me encontrava em uma rua sem saída. Agradeço ao meu orientador, Danilo, pela parceria (faz dez anos desde a graduação) e pela ajuda a me tornar o pesquisador que sou hoje—agradeço muito pelas oportunidades e paciência em todos esses anos. Gostaria também de agradecer aos professores que contribuíram para as atividades de pesquisa (especialmente Fernando Paulovich, Rafael Martins e Rogério Garcia) e ao pessoal do laboratório pelas diversas conversas e momentos de descontração. Por fim, agradeço aos funcionários da FCT-UNESP, que indiretamente forneceram um ambiente confortável para desenvolvimento deste trabalho. Agradeço a Deus por estar ao meu lado todo o tempo. Este trabalho foi financiado por: Fundação de Ampara à Pesquisa do Estado de São Paulo (FAPESP), processos #2018/17881-3, #2018/25755-8; e Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES), processo #88887.487331/2020- 00. “There will be time, there will be time To prepare a face to meet the faces that you meet; There will be time to murder and create, And time for all the works and days of hands That lift and drop a question on your plate; Time for you and time for me, And time yet for a hundred indecisions, And for a hundred visions and revisions, Before the taking of a toast and tea.” – T.S. Eliot, The Love Song of J. Alfred Prufrock RESUMO Análise Exploratória de Dados (EDA) é uma ferramenta valiosa para descobrir novos insights a partir de dados de alta dimensão. Os procedimentos de EDA ajudam cientis- tas e profissionais a entender a relação entre instâncias de dados e estruturas usando metáforas visuais que enfatizam informações sobre o objetivo da análise. Amplamente empregadas para análise exploratória, técnicas de Redução de Dimensionalidade mapeiam o relacionamento presente em espaços de alta dimensão para espaços de dimensão menor, normalmente no R2, e permitem descobertas científicas que vão do entendimento do comportamento de redes neurais à anotação de tipos celulares. Esta tese apresenta um conjunto de abordagens para extender o poder de análise utilizando técnicas de Redução de Dimensionalidade, com o objetivo principal de ajudar no entendimento dos embeddings gerados por técnicas não-lineares, além do estabelecimento de abordagens de exploração hierárquica para análise de acordo com a demanda de informações. Tais abordagens foram resultados das seguintes atividades de pesquisa principais: (i) seleção de representativos no espaço visual com preservação do contexto e estruturas mapeadas; (ii) técnicas para análise de clustering utilizando contrastividade, permitindo o entendimento das diferenças entre instâncias de dados para a formação de clusters; (iii) explicação de forma aditiva do layout gerado por técnicas não lineares; (iv) técnica de redução hierárquica de dimensionalidade para análise granular de conjuntos de dados, que permite a preservação das estruturas entre níveis hierárquicos. Palavras-chave: Redução hierárquica de dimensionalidade. Interpretação. Análise con- trastiva. ABSTRACT Exploratory Data Analysis (EDA) is a valuable tool for discovering new insights from high-dimensional data. EDA procedures help scientists and practitioners understand the relationship between data instances and structures using visual metaphors that emphasize information about the purpose of analysis. Widely used for exploratory analysis, Dimen- sionality Reduction techniques map the relationship present in high-dimensional spaces to smaller-dimensional spaces, usually in R2, and allow scientific discoveries that range from understanding the behavior of neural networks to the annotation of cell types. This thesis presents a set of approaches to extend the power of analysis using Dimensionality Reduction techniques, with the main objective of helping to understand the embeddings generated by non-linear techniques, in addition to establishing hierarchical exploration approaches for analysis according to the demand for information. Such approaches were the result of the following main research activities: (i) selection of representatives in the visual space with preservation of the context and mapped structures; (ii) techniques for analyzing clustering using contrastivity, allowing the understanding of differences between data instances for the formation of clusters; (iii) additive explanation of the layout generated by non-linear techniques; (iv) hierarchical dimensionality reduction technique for granular analysis of datasets, which allows the preservation of structures between hierarchical levels. Keywords: Hierarchical dimensionality reduction. Interpretation. Contrastive analysis. LIST OF FIGURES Figure 1 – Defining the grid and selecting samples. (a) A grid with cells of size k by k is used to divide the R2 space; (b) Only cells with at least one data point are retrieved; (c) The resulting representative set after redundancy removal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Figure 2 – Filtering redundant points. In this case, with window of size α = 3. . . 34 Figure 3 – Initial projections of the datasets and their respective sampled sets according to the sampling techniques. The color of the circles depict the different classes in the dataset. . . . . . . . . . . . . . . . . . . . . 37 Figure 4 – Redundancy metric measures how the sampled set is redundant by seeing how each point correlates with each other. The higher the value, the lower a set is redundant. . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Figure 5 – Coverage metric to assess if a sampled set is able to represent all data structures. The standard deviation is represented by error bars. Notice that, greater standard deviation indicates that some instances are far from a representative (an instance from sampled set). . . . . . . . . . . 39 Figure 6 – Silhouette coefficient of the techniques applied for clustering data. . . . 40 Figure 7 – Runtime of SADIRE algorithm. We can see that the algorithm takes more time for sampling from high-density embeddings. . . . . . . . . . 41 Figure 8 – UMAP and t-SNE projections. . . . . . . . . . . . . . . . . . . . . . . 42 Figure 9 – Influence of k and α parameters on the returned representative set. . . 43 Figure 10 – Initial embeddings and the layouts that result from reducing visual disorder with representative selection. SADIRE was applied to several configurations by varying the k and α parameters. . . . . . . . . . . . . 44 Figure 11 – The outcomes of the user experiments. Each line denotes a unique rep- resentative selection configuration. The heatmap cells encode how many times a representative layout was placed in that order position (visual disorder) and how many times a representative layout was assigned a rank number for each dataset (similarity). . . . . . . . . . . . . . . . . 45 Figure 12 – ExplorerTree using different sampling algorithms. On the left, the tra- ditional scatter-plot representation. On the right, the first level of ExplorerTree using four sampling strategies. . . . . . . . . . . . . . . . 46 Figure 13 – The cExpression pipeline. The user submits a high-dimensional dataset with R2 coordinates and cluster labels after dimensionality reduction (A). The contrastive information for each pair of cluster and feature is computed using the clustering labels and the high-dimensional data (B). The data generated from this process is used to create visualization approaches that highlight the unique characteristics of each cluster (C). 50 Figure 14 – Defining distinctive features. The feature’s distribution (a) in a cluster of interest is compared to its distribution on the remaining dataset (b). Then, the t-Student test (c) returns the probability of these two distributions to be different (d). Finally, the features are arranged in decreasing order according to their t-score to communicate discriminative power (e). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Figure 15 – UMAP projection of the Vertebral Column dataset and two distinctive features for the Spondylolisthesis class. . . . . . . . . . . . . . . . . . . 52 Figure 16 – The cExpression visualization tool. Users provide high-dimensional points with annotated clusters and positions onto R2 (a). The R2 coor- dinates are used for the scatterplot layout (c), from which the analysis starts. Users click on points in a cluster to investigate their contrastive information. For instance, the bipartite graph (c) shows the distinctive terms for the dark-orange cluster. There is also an interaction mecha- nism to visualize the feature values on the scatterplot visualization (d, e). Finally, an overview analysis can be achieved through the heatmap encoding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Figure 17 – The bipartite graph encoding feature importance. The p-values and t-values are encoded using red-to-blue and continuous color scales, re- spectively. It is worth noting that we display the number of decimal places after zero rather than the p-value itself. The first axis (a) rep- resents the context of the available features, while the second axis (b) represents the focus—in this case, they are the same. The connections between the focus axis and the t-score represent the relationship be- tween p-values and t-scores, with green representing positive values and red representing negative values. Finally, the four most distinguishing characteristics are pre-selected and displayed as a histogram plot (c). . 55 Figure 18 – Construction steps for cell-based encoding. For each cell, the sum of each feature is obtained (a). The cell is then divided based on user demand to communicate the features. The space assigned to a feature is proportional to the number of features in the cell. For example, one feature in (b), two in (c), and three in (d). . . . . . . . . . . . . . . . . 57 Figure 19 – Cell-based visualization for feature distribution on the scatterplot. After inspecting the distinctive features on the scatterplot (a), users can visualize the feature values using cell-based visualizations. For instance, when one feature is selected (b, syrian), a simple continuous color scale is sufficient. For more than one feature (c, syrian, kill, protect), cell-based encoding is employed. . . . . . . . . . . . . . . . . . . . . . . 58 Figure 20 – A comparison of two clusters. The visualization strategy in Fig. 17 is used to communicate information between two clusters when comparing two clusters. In essence, while the information for the first cluster (light- blue in the figure) is encoded as usual (c), the information for the second is positioned on top (a), and the context axis is positioned in the center of the visualization (b). Furthermore, the four most distinguishing features of both clusters are depicted in (d). . . . . . . . . . . . . . . . 59 Figure 21 – Users can inspect the histograms of available features through lasso selection over the points encoding p-values. . . . . . . . . . . . . . . . . 60 Figure 22 – Focus+context operation. The visualization starts with the context and focus axes as the same (1). Then, users can select a range of the context axis to focus analysis (2), resulting in an update of the focus axis (3). . 61 Figure 23 – UMAP projection of the news dataset. . . . . . . . . . . . . . . . . . . 62 Figure 24 – After selecting the five most discriminative features, highlighting the terms nuclear, plant, and fukushima shows that UMAP distinguished the news articles separated the news articles related to the nuclear accident on the left side of the projection. . . . . . . . . . . . . . . . . 62 Figure 25 – Inspecting terms with lower confidence. Terms with lower confidence in the light-orange cluster are related to political aspects (a) and were positioned in a different subcluster. . . . . . . . . . . . . . . . . . . . . 63 Figure 26 – Contrastive information showing levels of understanding. The terms highlighted in red show high-level information, while the terms high- lighted in blue give specific hints about the news articles. . . . . . . . . 64 Figure 27 – Inspection of germani and europ’s terms shows that the dimensionality reduction separated news articles focused on the European continent from the news articles focused on Germany. . . . . . . . . . . . . . . . 64 Figure 28 – UMAP projection of the tweets about COVID-19. . . . . . . . . . . . . 65 Figure 29 – Discriminative terms for the dark-orange cluster—terms related to respiratory problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Figure 30 – Discriminative terms for the light-orange cluster. Terms indicate users tweeting their complaints about symptoms expressed during COVID-19 infection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Figure 31 – Comparing two clusters with fever as the main subject. . . . . . . . . 67 Figure 32 – Distinctive terms for the light-blue cluster. In this case, there are more general aspects of symptoms related to fever. For instance, the term sore, could be related to sore throat or a new COVID-19 symptom. . . 68 Figure 33 – Distinctive terms from the dark-blue cluster can be considered as in- dicative of Dengue symptoms. . . . . . . . . . . . . . . . . . . . . . . . 68 Figure 34 – Evaluating another term from the dark-blue cluster. The terms such as difficulty and breathing indicate the most severe COVID-19 symptoms, shortness of breath. . . . . . . . . . . . . . . . . . . . . . . 69 Figure 35 – Heatmap showing that the red cluster comprises the tweets where people are worried about wearing masks and coryza. . . . . . . . . . . . . . . 69 Figure 36 – UMAP projection of the Vertebral Column dataset next to the heatmap visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Figure 37 – The class of patients with Spondylolisthesis shows greater values for all of the features, except for Pelvic Radius. . . . . . . . . . . . . . . . . . 70 Figure 38 – Inspection of the distribution values of the features shows that the features are truly expressed more in the cluster of patients with Spondy- lolisthesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Figure 39 – Retrieving topics using cExpression. Given a bag-of-words representation of a document collection (a), we compute the dataset clusters (b). cExpression is executed to retrieve distinctive terms for each cluster (c), and the topics correspond to the first m terms of each cluster. . . . . . 72 Figure 40 – cExpression shows competitive results for topics with a low number of terms (≤ 20). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Figure 41 – cExpression was able to surpass LDA and NMF for the majority of the number of topics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Figure 42 – Time execution in logarithmic scale for retrieving distinctive features using contrastive analysis. . . . . . . . . . . . . . . . . . . . . . . . . . 74 Figure 43 – ClusterShapley framework. In the “dataset annotation” component (1), a dimensionality reduction (1.a) process helps users to analyze and to annotate clusters perceived in the visual space (R2); then, the clusters defined in R2 annotate the high-dimensional dataset (1.c). In the “Shapley estimation” component (2), the annotated high-dimensional dataset is used to generate clusters probabilities for each data sample (2.a) employed for Shapley values estimation (2.b). . . . . . . . . . . . 81 Figure 44 – Cluster definition through manual selection. Users receive a projected dataset with no clustering imposed, and then users define clusters by lassoing the groups perceived in the visual space. Notice that black circles indicate data samples not belonging to any cluster, while circles with different data hues indicate already assigned data samples. . . . . 82 Figure 45 – An overview of the prototype system. In (A), users can specify datasets and load previously generated explanations. In (B), we draw a scatterplot representation and use color to encode classes. Hovering over specific class instances generates summary representations of Shapley values, which are used to indicate importance (C). The detailed examination of Shapley values and feature importance assists analysts in comprehending the relationship between feature value and Shapley value (D). The Importance Summary is shown in (E), which summarizes the mean of the absolute Shapley values for each pair (class, feature). The importance overview is depicted by a heatmap with the sum of the absolute Shapley values (F). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Figure 46 – Summarization approach to encoding two histograms for a given feature i, Hi e for feature values equal or greater than the mean (σi) and Hi o for features values lower than the mean. The two histograms are encoded together by alternating their bins. . . . . . . . . . . . . . . . . . . . . . 85 Figure 47 – A tool-tip for summarizing information from the blue cluster of Fig. 45. 85 Figure 48 – Coordination between the scatterplot and the dot plot encoding the Shapley values. Each line segment represents a point in the scatterplot representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Figure 49 – Two clusters with low separation. While the first three features show the same overall importance for these two clusters, sepal width plays a significant role in differentiating cluster orange (a). . . . . . . . . . . 88 Figure 50 – Explanations for the Vertebral Column dataset. There is a clear separa- tion among patients with complications (Spondylolisthesis or Hernia) and the healthy ones. The Importance Summary also supports visu- alization of such characteristics. . . . . . . . . . . . . . . . . . . . . . . 89 Figure 51 – Correlation between feature values and their importance for categorizing cluster 1. The four most important features contributed to differentiate patients with Spondylolistyhesis from the others. . . . . . . . . . . . . 90 Figure 52 – Comparison between regular patients (A) and patients with Hernia (B). The difference in values for sacral slope was not good enough to impose a good separation on the dimensionality reduction result. . . . . 91 Figure 53 – Dimensionality reduction results for the Red Wine Quality dataset colored using intensity to encode quality index and using colors to depict clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Figure 54 – Cluster 4 is well defined and distinguishable due to the salty flavor (high levels of chrolides) of the wines. . . . . . . . . . . . . . . . . . . . . . 92 Figure 55 – Comparison between two clusters projected distant which share two im- portant features. The difference between alcohol levels was determinant for differentiating the quality of the wine in these two clusters. . . . . . 93 Figure 56 – A combination of higher values Fixed Acidity and lower values of Volatile Acidity can be unpleasant at higher levels and can explain higher quality for the data instances. . . . . . . . . . . . . . . . . . . . 94 Figure 57 – At higher levels, Total Sulfur Dioxide tends to add flavor to the wine and be noticed in the nose. Such an important feature is the meaning of higher concentration of low-quality wines in cluster 1. . . . . . . . . 94 Figure 58 – Feature contributions and the dimensionality reduction result for the Communities and Crime dataset. . . . . . . . . . . . . . . . . . . . . . 95 Figure 59 – Such community is very distinguishable from the other for having a lower percentage of houses occupied PctHousOccup. . . . . . . . . . . . 95 Figure 60 – Detailed analysis of clusters 4 and 5. . . . . . . . . . . . . . . . . . . . 96 Figure 61 – A community of immigrants due to the high percent of immigrants (PctForeignBorn, PctRecImmig8, and PctRecImmig10) who speak more than one language besides English (PctSpeakEngOnly). . . . . . . . . . 97 Figure 62 – Using clustering techniques to feed ClusterShapley with different clus- tering results. The original dataset (original labels) has three labels, and we use three clustering configurations: four clusters using k-Means (k-means (1)), eight clusters using k-Means (k-means (2)), and nine clusters using Agglomerative clustering (agglomerative). . . . . . . . 98 Figure 63 – Comparison of traditional dimensionality reduction (DR) techniques (A) vs. hierarchical DR techniques (B). Traditional techniques operate in a single level of detail and focus on delivering a big picture that depicts the dataset as a whole, ignoring fine-grained details of intra-cluster distributions and making it difficult to proceed with more detailed investigations. On the other hand, hierarchical DR (HDR) approaches summarize important intra-cluster information already on the top-level representations of the hierarchy imposed on the input data without affecting the layout’s overall organization. . . . . . . . . . . . . . . . . 102 Figure 64 – Hierarchy construction and projection with HUMAP. The hierarchy is constructed from bottom to top. First, a k-nearest neighbor graph approximates the connection strength among data points in the high- dimensional space (A). Then, the graph structure and the connections’ strengths are used to find the most visited nodes (landmarks) after several random walk steps (B), which correspond to points on superior levels (C). The representation neighborhood—local and global relation- ships (D)—is used for computing the similarity among landmarks (F). For projecting hierarchy levels (or subsets of them), HUMAP applies— as in UMAP—a symmetrization to the nearest neighbor graph (H) and use data points from higher levels to preserve the mental map (I). . . . 106 Figure 65 – Mental map preservation across hierarchical levels. . . . . . . . . . . . 113 Figure 66 – Explaining dimensionality reduction results. (A) Explanation of the whole dataset provides one story about the clusters defined by the user. (B) Explanation of the top-level HUMAP representation gives an overview of the explanations that can be refined according to user demands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Figure 67 – User interaction with HUMAP to get more detailed information about the data. At each level explanation (C and F), users can generate diverse insights by focusing on distinct subsets of data to investigate using HUMAP’s drill-down operation. . . . . . . . . . . . . . . . . . . 115 Figure 68 – Hierarchical exploration of the learning weights and explanations of a CNN trained on Fashion MNIST dataset. The top hierarchy level (A) encodes the overview of the dataset. Exploring lower levels by drilling-down the hierarchy (B-F) reveals more detailed information while preserving the mental map. . . . . . . . . . . . . . . . . . . . . . 116 Figure 69 – Visual analysis of the embeddings generated for top and lowest hierar- chical levels using a three-level hierarchy. For each dataset, top-level embedding appears on the left, and the lowest level (whole dataset) appears on the right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Figure 70 – Run-time execution in seconds using a log10 scale. The boxen-plots show the run-time for 20 runs of each method under the same condition – we measure the run-time to fit the hierarchy (Hierarchy fit), to embed the top level (Level 2), and to embed the whole dataset (Level 0). . . . . . 120 Figure 71 – Neighborhood Preservation after embedding on R2. HSNE outperforms HUMAP by a great margin only for EB dataset. . . . . . . . . . . . . . 120 Figure 72 – Evaluation of techniques’ ability to represent complex structures such as clusters, manifolds, and other relationships. . . . . . . . . . . . . . 121 Figure 73 – Mental map preservation measured using Procrustes Analysis. HUMAP decreases the variation when positioning points across hierarchy levels. 122 Figure 74 – Continuity values for varying neighborhoods. All of the techniques performed well at preserving clustering structures. . . . . . . . . . . . . 122 Figure 75 – Trustworthiness values showing similar information to Neighborhood Preservation values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Figure 76 – Using HUMAP, UMAP, and HSNE to project subset of data. . . . . . . 124 Figure 77 – Mental map preservation when projecting subsets of datapoints. HUMAP shows stability across subsequent hierarchy levels. . . . . . . . . . . . . 125 Figure 78 – A sampling technique’s ability to uncover the projection structures must take into account the outliers and less dense clusters. The results show that SADIRE is the most stable method among the analyzed, covering the projected dataset being taking into account all of its structures. . . 143 Figure 79 – Results for in Execution time (s), Coverage, and Redundancy for CBR- ILP-IR, Shuttle, Corel-1k, Corel-5k, and Diabetes (each line) . . . . . . 144 Figure 80 – The overview of ExplorerTree tool. . . . . . . . . . . . . . . . . . . . . 145 Figure 81 – Choosing different dimensionality reduction techniques: a. IDMAP; b. LSP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Figure 82 – Choosing different sampling techniques: a. SADIRE (ours); b. Affinity Propagation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Figure 83 – Traditional scatter plot representations of dimensionality reduction results next to the first level of ExplorerTree using different sampling strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Figure 84 – SADIRE shows slightly lower results compared to Bisecting k-means. However, SADIRE is superior over Bisecting k-means in run-time exe- cution, an important characteristic for our scenario. . . . . . . . . . . . 149 Figure 85 – Coherence values for different number of topics. . . . . . . . . . . . . . 151 Figure 86 – Run-time execution for cExpression and ccPCA when varying the num- ber of data points, dimensionality, and a number of clusters. . . . . . . 152 Figure 87 – Interpretation of two dimensionality reduction techniques using cEx- pression. Since the sets are used to compare the feature distributions, the result remains the same. . . . . . . . . . . . . . . . . . . . . . . . . 153 Figure 88 – Analyzing the same dataset with different clustering results. cExpression merges the unique features from the two clusters (dark-blue and dark- orange) from Figure 87. . . . . . . . . . . . . . . . . . . . . . . . . . . 154 Figure 89 – Results for Time, DEMaP, and Disparity for twenty executions of each HDR technique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Figure 90 – Results for Neighborhood Preservation, Continuity, and Trustworthiness for twenty executions of each HDR technique. . . . . . . . . . . . . . . 156 Figure 91 – Results for Time, DEMaP, and Disparity for twenty executions of each HDR technique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 Figure 92 – Results for Neighborhood Preservation, Continuity, and Trustworthiness for twenty executions of each HDR technique. . . . . . . . . . . . . . . 159 Figure 93 – Projections for Embryoid Body and Mammals datasets. . . . . . . . . . 160 Figure 94 – Projections for FMNIST and MNIST datasets. . . . . . . . . . . . . . . 160 Figure 95 – Numerical analysis for twenty executions of each HDR technique. . . . 161 Figure 96 – Mental map preservation measured using Procrustes Analysis. HUMAP decreases the variation when positioning points across hierarchy levels. 162 Figure 97 – Continuity values for varying neighborhoods. All of the techniques performed well at preserving clustering structures. . . . . . . . . . . . . 163 Figure 98 – Trustworthiness values showing similar information to Neighborhood Preservation values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Figure 99 – Scalability analysis for the main steps in HUMAP algorithm. . . . . . . 164 Figure 100 – Scalability analysis for the main hierarchical dimensionality reduction techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Figure 101 – Results for reproducibility following the procedures of Becht et al. (BECHT et al., 2018). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 Figure 102 – Run-time distributions for HUMAP when generating reproducible results.167 Figure 103 – Embeddings of the MNIST dataset for different values for “min_dist” and “n_neighbors”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Figure 104 – Embeddings of the Embroyd Body dataset for different values for “min_- dist” and “n_neighbors”. . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Figure 105 – Metrics for HUMAP with different β values. . . . . . . . . . . . . . . . 170 Figure 106 – Projections for HUMAP with different β values. . . . . . . . . . . . . . 171 Figure 107 – Embeddings of the MNIST dataset for different values for . . . . . . . . 172 Figure 108 – Embeddings of the Embroyd Body dataset for different values for . . . 173 Figure 109 – HUMAP exploration and annotation of a document collection of COVID- 19 tweets. The top-level hierarchy level shows unlabeled data points and three major structures (A). We annotate these three clusters (B) and compute their topics computed (F). For each cluster in (B), we also project their corresponding level (and final) hierarchy to look for other patterns, annotating the dataset (C, D, E) and computing their topics. 174 LIST OF TABLES Table 1 – Datasets used for analysis. . . . . . . . . . . . . . . . . . . . . . . . . . 35 Table 2 – Time for estimating Shapley values. . . . . . . . . . . . . . . . . . . . . 100 Table 3 – Datasets used for experimentation. . . . . . . . . . . . . . . . . . . . . . 118 Table 4 – The datasets used in the experiments. All the projections were performed by LSP (PAULOVICH et al., 2008), except for Shuttle, in which we used PLMP (PAULOVICH; SILVA; NONATO, 2010a). . . . . . . . . . . . . 142 Table 3 – Scores given by the participants . . . . . . . . . . . . . . . . . . . . . . 148 Table 4 – p-values for the running time metric. . . . . . . . . . . . . . . . . . . . . 155 Table 5 – p-values for the DEMaP metric. . . . . . . . . . . . . . . . . . . . . . . 157 Table 6 – p-values for the disparity metric. . . . . . . . . . . . . . . . . . . . . . . 157 Table 7 – p-values for the running time metric. . . . . . . . . . . . . . . . . . . . . 158 Table 8 – p-values for the running time metric. . . . . . . . . . . . . . . . . . . . . 158 Table 9 – p-values for DEMaP metric. . . . . . . . . . . . . . . . . . . . . . . . . . 159 Table 10 – p-values for the disparity metric. . . . . . . . . . . . . . . . . . . . . . . 159 LIST OF ALGORITHMS 1 Removing redundant data points. . . . . . . . . . . . . . . . . . . . . . . . 34 2 Shapley estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3 Associate Landmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 CONTENTS Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.2 Problems & Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.2.1 Summarization & Interpretability . . . . . . . . . . . . . . . . . . . . . . . 24 1.2.2 Hierarchical Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . 25 1.3 Questions & Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.5 Document structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2 SADIRE: SAMPLING FROM DIMENSIONALITY REDUCTION . . 30 2.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.2 Sampling based on dimensionality reduction . . . . . . . . . . . . . . 32 2.2.1 Selecting candidates for the representative set . . . . . . . . . . . . . . . . 32 2.2.2 Removing redundant data instances . . . . . . . . . . . . . . . . . . . . . 33 2.2.2.1 Time complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.3.1 Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.3.2 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.3.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.3.4 Running time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.3.5 Parameter Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.4 User Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.5 Practical Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.6 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3 CONTRASTIVE ANALYSIS FOR SCATTERPLOT REPRESENTA- TIONS OF DIMENSIONALITY REDUCTION . . . . . . . . . . . . 47 3.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2 cExpression—Tool for contrastive analysis of DR results . . . . . . . 50 3.2.1 Computing constrastive information . . . . . . . . . . . . . . . . . . . . . 51 3.2.2 Visualization Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.2.2.1 Visual and task requirements . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.2.2.2 Visualizing the feature importance . . . . . . . . . . . . . . . . . . . . . . . 55 3.2.2.3 Visualizing feature value distribution . . . . . . . . . . . . . . . . . . . . . . 56 3.2.2.4 Comparing clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.2.2.5 Summary View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.2.2.6 Interaction mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.3 Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.3.1 The news dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.3.2 Exploring tweets about COVID-19 symptoms . . . . . . . . . . . . . . . . 64 3.3.3 Vertebral Column . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.4.1 Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.4.2 Run-time execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.5 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4 EXPLAINING DIMENSIONALITY REDUCTION RESULTS USING SHAPLEY VALUES . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.2 Review of Shapley values . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2.1 Shapley values illustration . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.3 Using Shapley values to explain clusters of dimensionality reduction results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.3.1 Dataset annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.3.2 Shapley estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.4 Visualization Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.4.1 Scatterplot component . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.4.2 Coordination between scatterplot and Shapley values . . . . . . . . . . . . 85 4.4.3 Importance Summary and Importance Heatmap . . . . . . . . . . . . . . . 86 4.4.4 Analysis example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.5 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.5.1 Vertebral Column . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.5.2 Red Wine Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.5.3 Communities and Crime . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.5.4 Automatic annotation with clustering techniques . . . . . . . . . . . . . . 97 4.6 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.6.1 Supporting exploratory analysis . . . . . . . . . . . . . . . . . . . . . . . . 99 4.6.2 Run-time execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.6.3 Cluster-oriented analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5 HUMAP: HIERARCHICAL UNIFORM MANIFOLD APPROXIMA- TION AND PROJECTION . . . . . . . . . . . . . . . . . . . . . . . 101 5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.2 Hierarchical Dimensionality Reduction . . . . . . . . . . . . . . . . . . 106 5.2.1 Landmark selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.2.2 Dissimilarity definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.2.3 Generating the embedding . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.2.3.1 Subset projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.2.3.2 Mental map preservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.3 Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.3.1 Supporting hierarchical explanation of dimensionality reduction results . . . 113 5.3.2 Exploring latent features . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.4 Numerical evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.4.1 Running-time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.4.2 Neighborhood Preservation . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.4.3 DEMaP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.4.4 Mental map preservation . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.4.5 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.4.6 Trustworthiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.4.7 Projecting subsets of data . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.5 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.5.1 Embedding initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.5.2 Generalizability and scalability . . . . . . . . . . . . . . . . . . . . . . . . 125 6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.1 Further Discussion and Open Questions . . . . . . . . . . . . . . . . . 126 6.1.1 SADIRE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.1.2 cExpression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.1.3 ClusterShapley . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.1.4 HUMAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.2 Practical Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.3 Scientific production . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.3.1 First-authored publications . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.3.1.1 Journal articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.3.1.2 Conference papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.3.1.3 Preprint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.3.2 Co-authored publications . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.3.2.1 Conference papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.3.2.2 Preprint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 A SADIRE - EMPLOYING SADIRE INTO EXPLORERTREE & AD- DITIONAL USER EXPERIMENTS . . . . . . . . . . . . . . . . . . 141 A.1 Execution Time, Coverage and Redundancy . . . . . . . . . . . . . . 141 A.2 ExplorerTree Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 A.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 A.2.2 Using different dimensionality reduction techniques . . . . . . . . . . . . . 143 A.2.3 Using different sampling selection mechanisms . . . . . . . . . . . . . . . . 144 A.3 ExplorerTree using different sampling strategies . . . . . . . . . . . . 145 A.4 User Experiment - Evaluating sampling strategies for ExplorerTree . 146 B CEXPRESSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 B.1 Topic evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 B.2 Run-time evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 B.3 Different dimensionality reduction techniques . . . . . . . . . . . . . 153 C HUMAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 C.1 Hierarchical Dimensionality Reduction comparison . . . . . . . . . . 155 C.2 Drill-down analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 C.3 Hierarchical Dimensionality Reduction without preserving the mental map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 C.4 Additional local/global analysis . . . . . . . . . . . . . . . . . . . . . . 162 C.4.1 Mental map preservation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 C.4.2 Continuity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 C.4.3 Trustworthiness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 C.5 Run time analysis - HUMAP steps and scalability . . . . . . . . . . . 164 C.5.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 C.5.2 Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 C.6 Generalizability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 C.6.1 min_dist and n_neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . 168 C.6.2 β for local contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 C.6.3 Number of data points in each level . . . . . . . . . . . . . . . . . . . . . 172 C.7 Additional use case - Exploration of COVID-19 tweets . . . . . . . . 174 23 1 INTRODUCTION 1.1 CONTEXT Data visualization is critical in assisting humans in understanding natural phenom- ena as well as discovering ubiquitous information. Exploratory Data Analysis (EDA) (TUKEY, 1962), a set of methods and strategies for understanding results and simplifying analysis, is the primary reason scientists and other data visualization practitioners use it. EDA is typically used when the information to be uncovered is unknown. Exploratory Data Analysis is especially important when the dataset contains multiple attributes (or features), which is referred to as high-dimensional data. The process of evaluating high-dimensional data presents unique challenges due to its redundancy, sparsity, and complex structures. Furthermore, as data such as text, photos, and the inner weights or logits of deep learning models become more widely available, high-dimensional data analysis has become a common activity. Efforts are being made to reduce the cognitive overload required for analysis in order to comprehend and make sense of these high-dimensional landscapes as well as the inherent relationships among the data points. One popular strategy is to reduce the number of dimensions while retaining important details about the structures and similarities found in high-dimensional space. Dimensionality Reduction (DR) techniques are commonly used to generate a representation in low dimension (Rd) from a dataset in high dimension (Rm). The end result may be referred to as embeddings, projections, or even DR layouts. These methods make an effort to preserve the relationship between the data points that existed in the original space. Then, while conducting exploratory data analysis, researchers use scatterplot-based representations to look for trends and other relevant information. There are numerous studies that use DR techniques for exploratory data analysis, such as understanding the learned features by ConvNets (PEZZOTTI et al., 2018) at different epochs, investigating gene expression patterns to discover new cell types (UNEN; HöLLT; PEZZOTTI, 2018), and many others. Analysts interpret the decisions made by these algorithms to determine whether the features fed into machine learning models are relevant to class separation (MARCILIO-JR et al., 2020), in addition to providing an excellent opportunity for high-dimensional data analysis. Dimensionality reduction strategies are almost always used in high-dimensional data pipelines, but they come with their own set of challenges, such as the inability to integrate categorical data and the runtime execution of non-linear algorithms, which may be thoroughly read in recent surveys. In the following section, we delineate the problems Chapter 1. Introduction 24 and motivations in the scope of this thesis. 1.2 PROBLEMS & MOTIVATION Dimensionality Reduction (DR) techniques are widely used in a wide range of scientific fields. Embeddings encoded by scatterplots aid in the analysis of document collections (ELER et al., 2009), gene expression data (UNEN et al., 2017; SALAMON et al., 2018; HÖLLT et al., 2018; SOMARAKIS et al., 2019; KRUEGER et al., 2020), and deep learning model features (RAUBER et al., 2017; BENATO; TELEA; FALCãO, 2018). However, non-linear techniques make it difficult to understand embeddings according to feature contributions for layout formation. Furthermore, one-level approaches, which represent all data at once, may conceal important information about the relationship between dominant structures. These issues are summarized in the two sections that follow. 1.2.1 SUMMARIZATION & INTERPRETABILITY Scatterplot representations of embeddings become visually cluttered as the dataset size increases, limiting understanding of the relationship among data points. This limitation can be addressed by summarizing the data during analysis using common sampling techniques. Traditional techniques (VITTER, 1985; KNUTH, 1997) use probabilistic mechanisms to select samples, whereas deterministic methods, such as CSM (JOIA; PETRONETTO; NONATO, 2015), use PCA (JOLLIFFE, 1986) loadings to select samples. These techniques are problematic for scatterplot representations because probabilistic approaches are influenced by denser areas of the projection and fail to account for outliers, whereas PCA loadings are influenced by global information. The inability of these two types of sampling commonly used for scatterplot representations to convey information about class boundaries of varying sizes is a recurring pattern. Using interpretation techniques on R2 is another method for improving under- standing and visualization of embeddings. While many approaches have been proposed for explaining and interpreting the decisions of common machine learning models (RIBEIRO; SINGH; GUESTRIN, 2016; LUNDBERG et al., 2020), dimensionality reduction inter- pretation has been performed with enriching layout techniques (NONATO; AUPETIT, 2018), in which visualizations of feature distributions and other statistical charts to rep- resent attribute information aid analysts in making sense of clusters and other patterns perceived in the visual space. Simple statistics, on the other hand, cannot fully provide the differentiation among the clusters of the embedding produced by a DR technique. To address this issue, a few approaches based on contrastive analysis (FUJIWARA; KWON; MA, 2019; LE; AKOGLU, 2019) have been proposed, with the main goal of identifying unique characteristics of the clusters perceived in the visual space R2. These techniques’ Chapter 1. Introduction 25 main limitations can be summarized as prohibitive runtime execution of high-dimension datasets (FUJIWARA; KWON; MA, 2019) or the inability to extend the analysis for different DR (LE; AKOGLU, 2019). Rather than emphasizing the differences between the clusters, one possible solution is to assess how much each feature contributed to the embedding, that is, how much each feature contributed to the formation of clusters or other structures in the projected space (e.g., R2). In gene expression analysis, for example, bioinformaticians want to know which genes influence each cluster in order to annotate cell types or discover new ones (LäHNEMANN; KöSTER; SZCZUREK, 2020). Finding the contribution of these features to the dimensionality reduction result is primarily related to non-linear DR techniques (MAATEN; HINTON, 2008; MCINNES; HEALY; MELVILLE, 2018), where there is currently no way to inverse calculations and keep track of feature contributions while the algorithm is being executed (FUJIWARA; KWON; MA, 2019). Nonetheless, non-linear DR techniques are best suited for dealing with most datasets due to their ability to uncover complex structures such as manifolds. There are a few issues with existing techniques for interpreting contributions for DR interpretation. Studies on feature values (COIMBRA et al., 2016; PAGLIOSA; PAGLIOSA; NONATO, 2016; MARCILIO-JR et al., 2020), for example, do not account for the dimensionality reduction process and instead concentrate on the reduced low-dimensional space. Other more elaborated works (TURKAY et al., 2012; JOIA; PETRONETTO; NONATO, 2015) obtain feature importance through principal components (PC). The PCs produce biased results for classes with high variation (JOIA; PETRONETTO; NONATO, 2015), and their inability to focus on local information (FUJIWARA; KWON; MA, 2019) hinders cluster-oriented analysis. These techniques cannot explain how much each feature contributed to the embedding, for example, in terms of clusters. The importance assigned to the features does not account for their contribution to clusters and other structures. Furthermore, these feature importance measures do not interact with one another to form an explanation measure. Instead, the importance of each feature is independent of the other. 1.2.2 HIERARCHICAL DIMENSIONALITY REDUCTION Due to the increasing popularity of deep learning (RAUBER et al., 2017; BENATO; TELEA; FALCãO, 2018) applications and gene expression data analysis (UNEN et al., 2017; KRUEGER et al., 2020) with dimensionality reduction, practitioners typically seek DR techniques that can uncover complex structures (e.g., manifolds), retain clusters, and provide meaningful representations of the global relationship among clusters. Traditional DR techniques operate at a single level of detail, representing the relationship between data points in a single low-dimensional scatterplot representation— Chapter 1. Introduction 26 which limits analysis when important information is prioritized and details are added as needed. This limitation can be circumvented using hierarchical DR (HDR) techniques, which project subsets of the input dataset and allow the user to select where to focus during the analysis. The idea behind HDR techniques is to project dominant structures in areas where details about their relationship would be obscured by traditional DR techniques. Current HDR techniques have limitations in terms of scaling thousands of data points (PAULOVICH et al., 2008), mental map preservation (PEZZOTTI et al., 2016), or are primarily aimed at mass cytometry data applications, not handling datasets with hundreds or thousands of dimensions (KUCHROO, 2020). These HDR techniques cannot fully address two major issues: a good trade-off between preserving global and local structures in the low-dimensional representation of the dataset and the mental map preservation across hierarchy levels. While preserving global and local information allows for a better understanding of clusters (local structures) and their relationships (global structures), preserving mental maps makes exploration easier by reducing the number of geometric transformations on the dataset (FUJIWARA et al., 2020). In addition to being able to deal with high dimensions and be flexible in terms of hierarchy manipulation and dataset type, HDR approaches must be able to unfold complex structures present in real-world datasets. Finally, there is a current requirement for techniques that depict complex structures while preserving global and local relationships between data samples. For example, when analyzing deep learning models (RAUBER et al., 2017; MARCíLIO-JR et al., 2020) and gene expression data (LäHNEMANN; KöSTER; SZCZUREK, 2020), the trade-off between global and local relations is critical. This thesis focuses on the issue of encoding global and local relationships among data points during the embedding process, as well as the preservation of the mental map for hierarchical dimensionality reduction techniques. 1.3 QUESTIONS & HYPOTHESES With regard to the aforementioned issues and the motivations behind each of them, this thesis is built around the following questions, from which we conducted differing but interconnected research activities: Q1. Can non-linear dimensionality reduction techniques be interpreted for information discovery? Q2. How should information be summarized while interpreting the main causes of clustering formation? Q3. How can global and local information, as well as the mental map, be preserved when interacting with hierarchical embeddings of high-dimensional data? The following research activities (RAs) address each of the aforementioned ques- tions. RA1: We propose to adapt and advance state-of-the-art model-agnostic explainable Chapter 1. Introduction 27 dimensionality reduction techniques. More specifically, we combine SHAP (LUNDBERG; LEE, 2017) with carefully designed visualization approaches to provide interpretability features as well as interaction mechanisms on scatterplot-represented embeddings. To answer Q1, we hypothesize that addictive explanations of the contribution to cluster formation, combined with feature analysis, provide a more detailed understanding of the phenomenon. RA2: We used contrastive analysis on scatterplots to address the challenge of summarizing information and interpreting it for cluster analysis. To accomplish this, we employ statistical tests on feature values in conjunction with focus+context visualizations that enrich scatterplot representations of embeddings. Users can comprehend the position- ing strategies of various dimensionality reduction techniques while retaining only a portion of the information. To specifically answer Q2, we hypothesize that only a few features are required to begin exploration and interpretation of embeddings on R2, and that algorithms that can order features according to relevance can filter out a lot of irrelevant information. RA3: We proposed a novel hierarchical dimensionality reduction technique that uses a tunable hyper-parameter to choose between global and local preservation. In addition, the novel technique advances the state-of-the-art in preserving the mental map when projecting subsequent hierarchy levels. To specifically answer Q3, we hypothesize that by using different neighborhood concepts (local and global) and fixing previously projected data points on the optimization step of the hierarchical dimensionality reduction, we will be able to reduce these specific limitations. Based on the the above, this thesis considers the following main hypothesis: Main Hypothesis. Hierarchical exploration and interpretation of embeddings im- prove the analysis power of dimensionality reduction techniques by assisting in identifying and emphasizing various facets present in data. While mental map preservation in hierar- chical embeddings allows for context preservation throughout exploration, dimensionality reduction interpretation explains feature contributions to layout formation in the visual space. 1.4 CONTRIBUTIONS The main contribution of this thesis is to make progress on dimensionality reduction interpretation and hierarchical exploration. In terms of interpretation, we present two methods that use visualization approaches to highlight feature contributions for layout formation—primarily in the form of clusters. Then, to deal with large datasets, we propose a novel hierarchical dimensionality reduction technique that overcomes a few limitations of existing approaches. Despite their differences, these contributions can be used together to deal with large scale datasets while providing interpretation for their structures. Chapter 1. Introduction 28 The first contribution is concerned with the information summarization of scat- terplot representations for dimensionality reduction results (see Chapter 2). We make progress with a fast sampling method in visual space (R2) that can preserve class boundary structures while keeping outliers visible. Aside from the method itself, we also conducted experiments to investigate how changing the parameters affects the user’s ability to assess the information contained. Finally, the method serves as the foundation for Ex- plorerTree (MARCíLIO-JR et al., 2021a), for which we conducted a user experiment by comparing it to a few other sampling techniques. The second contribution is concerned with the visual summarization problem and the interpretation of DR results (see Chapter 3). We addressed the problem of understanding cluster formation by leveraging statistical tests on the feature values of the input datasets for dimensionality reduction. After calculating separation measures for each feature, we use a focus+context visualization tool to assist analysts in determining which features contributed the most to layout formation after dimensionality reduction. Finally, the technique enables the understanding of global positioning and intra-cluster differentiation through various interaction mechanisms. The third contribution deals with the issue of explaining the results of dimensionality reduction (see Chapter 4). We advance the state-of-the-art by adapting SHAP to explain cluster formation after dimensionality reduction. Unlike other methods for interpreting DR results in the literature, our approach provides explanations using addictive contributions that are directly related to the distance measure used to convey similarity on the scatterplot representation. In addition to the core technique, we propose a visualization tool to aid in the exploratory and explanatory capabilities of scatterplots. The fourth contribution advances hierarchical dimensionality reduction by employ- ing an adaptive hyper-parameter to control the trade-off between global and local structure preservation (see Chapter 5). HUMAP employs two neighborhood concepts to capture both local and global information about manifolds, while the hierarchy is built using the Finite State Markov Chain. Finally, HUMAP preserves the mental map when projecting different hierarchy levels for fine-grained information. Aside from the contributions discussed in the document, Section 6.3 presents other first-authored publications that were part of the research efforts during the development of this thesis. 1.5 DOCUMENT STRUCTURE The main structure of this document is based on the research activities carried out to answer the questions in Section 1.3 and previously mentioned as our main contributions. As a result, each chapter, aside from the Introduction and Conclusion, is devoted to Chapter 1. Introduction 29 detailing this thesis’s research contributions. In summary, the structure of this document is as follows: • Chapter 2 introduces the SADIRE sampling technique for scatterplot representations. It covers the following article, which is related to Q2: – Marcilio-Jr, W.E., Eler, D.M. SADIRE1 a context-preserving sampling tech- nique for dimensionality reduction visualizations. Journal of Visualization 23, 999–1013 (2020). • Chapter 3 introduces two techniques for visual summarization and interpretation of dimensionality reduction plots. It covers the following article, which is related to Q1 and Q2: – Marcílio-Jr, W.E., Eler, D.M., Garcia, R.E., Contrastive analysis for scatterplot- based representations of dimensionality reduction, Computers & Graphics 101, 46-58 (2021). • Chapter 4 addresses the problem of interpretation of dimensionality reduction techniques. It covers the following article, which is related to Q1 and Q2: – Marcílio-Jr, W. E., Eler, D. M, Explaining dimensionality reduction results using Shapley values2, Expert Systems with Applications 178, 2021. • Chapter 5 describes a novel technique for hierarchical dimensionality reduction that addresses the issues of local/global structure and mental map preservation. It covers the following article, which is related to Q3: – Marcílio-Jr, W.E., Eler, D.M., Paulovich, F.V., Martins, R.M. HUMAP3: Hierarchical Uniform Manifold Approximation and Projection, arXiv preprint 2106.07718, 2021. Submitted to IEEE Transactions on Visualization and Com- puter Graphics. This document is concluded in Chapter 6, which presents final discussions about the research efforts highlighted in this thesis. It also discusses practical implications and future research. 1 https://github.com/wilsonjr/SADIRE 2 https://github.com/wilsonjr/ClusterShapley 3 https://github.com/wilsonjr/humap 30 2 SADIRE: SAMPLING FROM DIMENSION- ALITY REDUCTION Wilson E. Marcílio-Jr, Danilo M. Eler SADIRE: a context-preserving sampling technique for dimensionality reduction visualizations, Journal of Visualization, Volume 23, 2020. Although DR techniques can represent high-dimensional data structures and rela- tions in embedded spaces (typically R2), the ability to scale computational methods to massive datasets is also a frequently debated topic that is becoming increasingly important. For a long time, scientists have recognized that simply increasing raw processing power cannot guarantee method interactivity (HELLERSTEIN et al., 1999). To address this issue, sampling mechanisms are frequently used because approximate answers are just as useful as exact answers (KWON et al., 2017). The visualization community also makes use of sampling techniques, such as proposing summary visualizations to depict large amounts of data (SARIKAYA; GLEICHER; SZAFIR, 2018). For instance, well-known sampling techniques such as Knuth’s (KNUTH, 1997) and Reservoir’s (VITTER, 1985) rely on probabilistic mechanisms to extract samples that depict the dominant structures of a dataset. On the other hand, a deterministic method called Column Selection Method (CSM) proposed by Joia et al. (JOIA; PETRONETTO; NONATO, 2015) was shown to be effective in selecting representative samples based on a correlation index computed using matrix decomposition. Although it outperforms commonly used sampling mechanisms, it selects non-balanced samples when classes have different coefficients of variation—in this case, important structures of the sampled dataset, such as class boundaries or outliers, are lost during the sampling process. Furthermore, when data is not uniformly distributed, random techniques (e.g., Knuth’s and Reservoir’s) cannot reflect all data structures, impairing visual assessment of the relationship among data points. The problem of reducing the visual cluttering of scatterplot representations of dimensionality reduction results on R2 is the focus of this chapter. SADIRE1, which stands for SAmpling based on DImensionality REduction, takes a dataset embedded in 2D space and selects a limited number of points from each cell of a grid built on the embedded space. As demonstrated in the experiments, this process reduces the dataset size while preserving the overall structure and class boundaries of a scatterplot-visualized projection. Furthermore, this chapter discusses a user experiment to better understand how users perceive the use of representative data points for visual summarization. 1 https://github.com/wilsonjr/SADIRE Chapter 2. SADIRE: Sampling from Dimensionality Reduction 31 The remaining of this chapter is organized as follows: the related works are shown in Section 5.1; Section 2.2 details the sampling technique; Section 5.4 delineates the effectivity of the proposed approach throughout experiments; Section 2.4 discuss the results for the user experiment; Section 5.5 finishes the chapter with discussions. 2.1 RELATED WORKS Sampling techniques are becoming increasingly important. They provide a means to quickly infer patterns and trends in data, in addition to reducing the burden of complex methods to execute on large datasets. Well-known sampling algorithms (TILLÉ, 2011) can be used for general purposes; that is, they can be adapted to work with most problems requiring the reduction of a dataset to a representative sample. This section focuses on sampling mechanisms designed to solve specific problems. These techniques are intended to assist computational methods and interactive approaches in scaling when dealing with large datasets. Random sampling mechanisms are frequently used in machine learning to split datasets into train and test sets or to perform cross validation to fine-tune hyperparameters. To reduce the computational burden of performing projections, dimensionality reduction techniques frequently employ sampling. LSP (PAULOVICH et al., 2008) uses either random sampling or clustering medoids as the first stage—control points—to project the sampled instances so that the remaining dataset can be projected using interpolation. LAMP (JOIA et al., 2011) constructs a family of orthogonal affine mappings, one for each instance to be projected, using the same randomly selected control points. Other dimensionality reduction techniques, such as PLMP (PAULOVICH; SILVA; NONATO, 2010b), Pekalka’s (PEKALSKA et al., 1999), and PLP (PAULOVICH et al., 2011), also use sampling as preprocessing. Sampling is also a technique for reducing cognitive overload caused by large datasets represented by traditional visualizations. In fact, such strategies are used in conjunction with DR methods. HSNE (PEZZOTTI et al., 2016) technique projects dominant structures in order to provide an exploration process based on information demand while reducing visual clutter, with Monte Carlo sampling on a Finite Markov Chain. Nguyen and Song (NGUYEN; SONG, 2016) adopt centrality cluster-based sampling approaches, which employ centrality measures to obtain more informed samples than random sampling methods. Joia et al. (JOIA; PETRONETTO; NONATO, 2015) proposed a sampling strategy based on correlation scores generated by singular value decomposition that is sensitive to class variability. Unlike the methods discussed previously, the one presented in this chapter is based on the advances made in dimensionality reduction techniques over the years. The method is based on the assumption that DR techniques preserve similarity and class structures Chapter 2. SADIRE: Sampling from Dimensionality Reduction 32 in multidimensional space. As a result, unlike other techniques, our sampling mechanism can accurately reflect the data’s structures and class boundaries when given a reliable projection. 2.2 SAMPLING BASED ON DIMENSIONALITY REDUCTION When the target dimension is in visual space, dimensionality reduction results are typically represented by scatterplots, with R2 being the most common. This type of visualization uses spatial proximity to depict similarity, so that similar data points are close together and dissimilar data points are far apart. In order to select samples, SADIRE considers such a projection X ⊂ R2, and it can be divided into two main stages. It begins by shrinking the dataset and selecting candidates for the representative set using a grid structure on the projection plane. Second, it eliminates duplicate data instances from the sampled set. These steps are described in detail in the sections that follow. 2.2.1 SELECTING CANDIDATES FOR THE REPRESENTATIVE SET The first stage of the sampling strategy involves sampling from an R2 projection. SADIRE divides the projected space by generating a grid with cells of size k×k (note that k is measured in pixels). The sampled set is then created by using only the cells with at least one data point, from which a number m < n of data points can be retrieved, where n is the number of data points in each cell. While our explanation and experiments are geared toward m = 1, keep in mind that using m > 1 conveys the density of the clusters. These procedures are as follows: 1. Create a grid with cells of size k by k and divide the projected space (R2), where k is measured in pixels; 2. Add to the set S all the data points in each cell s that has at least one data point (|s| ≥ 1); 3. Add to the set A, m data points of from each cell s ∈ S (if m = 1, add the medoid to the set A); store the density (|s|) of each cell. Fig. 1 depicts how sampling is used when m = 1, i.e., by retrieving one data point from each grid cell. In the first step (a), a grid structure of size k by k (in pixels) is used to group data points; then, in the second step (b), the density and m data points are stored for each cell; finally, by using the density information—explained in the following section—SADIRE reduces the redundancy of the embedding. Because the width and height of a cell are smaller than the diagonal distance between two opposite vertices when using a grid structure to sample data points from the Chapter 2. SADIRE: Sampling from Dimensionality Reduction 33 Figure 1 – Defining the grid and selecting samples. (a) A grid with cells of size k by k is used to divide the R2 space; (b) Only cells with at least one data point are retrieved; (c) The resulting representative set after redundancy removal. scatterplot representation, visual connection in 45 degrees may be lost. While we argue that such a problem is insignificant enough to mislead users’ exploration, the parameter k can be used to control it. Lowering the value of k yields a representative set that is more similar to the original set. 2.2.2 REMOVING REDUNDANT DATA INSTANCES A windowing process is used in the second main step of the algorithm to reduce redundancy from neighboring cells. For a window size of α = 3, Fig. 2 shows how the density information is used to keep useful information about the embedding structures while reducing redundancy. Starting with the densest cell (1. ), all of its neighboring cells are eliminated from the competition in order to be included in the resulting representative set—the result in the scatterplot for m = 1 is indicated by a black arrow. The same procedure is used on the second densest cell (2. ), and all of its neighboring cells are eliminated from competition. This process is repeated for all active cells in the scatterplot that were not removed. Algorithm 1 describes the process depicted in Fig. 2, which takes as input cells ordered decreasingly by density (indicated by A). If a cell ai was not removed from the set A (ai.active is true in line 3), the algorithm looks for at least one cell with data points within the α-sized window centered in ai (ai is an outlier if it does not have neighboring cells). If neighboring cells exist, they are removed (line 7) and ai is added to the representative set (line 6). The output of the Algorithm, G, is a set of cells that contains information about position and density. While it is important to reduce redundancy among representative samples, outliers must be preserved on the projection plane (i.e., added to the representative set) because they can be useful in analytic tasks. Take note that we consider data points that are far Chapter 2. SADIRE: Sampling from Dimensionality Reduction 34 Figure 2 – Filtering redundant points. In this case, with window of size α = 3. Algorithm 1 Removing redundant data points. 1: procedure mergeInstances(A) 2: for each ai in A do 3: if ai.active then 4: g ← cells with data points inside the window centered in ai; 5: if g ̸= ∅ then 6: G ← G ∪ ai; 7: g.active ← false; 8: ai.active ← false. return G from clustered data points to be outliers. Thus, given the α-sized window used to remove redundant data points, our heuristic for detecting and adding outliers works as follows: because the data points candidates for outliers are in the active cells of set A—the data points with no neighboring cells—we select as representative data points only outliers that are truly far away from a representative data point, for example, outliers that are at least twice the distance from a representative data point (2k). 2.2.2.1 Time complexity When all the points are far from each other and none has neighbors, the complexity of Algorithm 1 is O(α2|A|), where |A| is the size of the sampled set retrieved in the first step and α is the window for reducing redundant points; the complexity of searching for outliers is O(|A|2) when all the points are far from each other and none has neighbors. Because the process of creating and retrieving all points can be performed using a quad-tree, whose worst-case scenario in inserting and searching operations is known to be O(n), the complexity of our method is O(|A|2). However, because of the similarity constraint mapped by DR techniques, Algorithm 1 reduces the number of instances in set A. As a Chapter 2. SADIRE: Sampling from Dimensionality Reduction 35 result, receiving the worst-case scenario is unlikely. 2.3 EXPERIMENTS This section evaluates SADIRE on its ability to extract representative data points that inherently reflect the structures imposed by a DR technique. SADIRE was compared against three sampling mechanisms: Reservoir (VITTER, 1985), Column Selection Method (CSM ) (JOIA; PETRONETTO; NONATO, 2015), and Knuth (KNUTH, 1997). In the following, we provide a description of the datasets used in the experiments (see Table 1): • CBR-ILP-IR is a document collection of three fields of artificial intelligence; • corel-1k and corel-5k are image datasets of natural scenes and artificial objects; • segmentation is a dataset of seven outdoor images manually segmented; • mammals-8k and mammals-20k are datasets composed of mammal features; • photographers-9k and photographers-36k are image datasets of different photogra- phers2; • shuttle is a dataset of shuttle features3; • fibers is a dataset obtained from the 2009 Pittsburgh Brain Competition4; • new-wavelet is a x-ray image dataset5. Table 1 – Datasets used for analysis. Dataset Size Dimensions Classes DR technique CBR-ILP-IR 574 35 3 LSP corel-1k 1000 150 10 LSP segmentation 2100 19 8 LAMP corel-5k 5000 4096 50 t-SNE mammals-8k 8000 72 4 LAMP new-wavelet 9000 1024 116 LAMP photographers-9k 9077 4096 41 LAMP mammals-20k 20000 72 4 LAMP photographers-36k 36364 4096 41 PLMP shuttle 43500 9 7 PLMP fiber 250000 30 9 PLMP 2 http://people.cs.pitt.edu/ chris/photographer/ 3 https://archive.ics.uci.edu/ml/datasets/Statlog+(Shuttle) 4 http://pbc.lrdc.pitt.edu/ 5 http://ir.shef.ac.uk/imageclef/2006/ Chapter 2. SADIRE: Sampling from Dimensionality Reduction 36 For the projection of these datasets, four dimensionality reduction techniques were used: LSP (PAULOVICH et al., 2008), t-SNE (MAATEN; HINTON, 2008), LAMP (JOIA et al., 2011), and PLMP (PAULOVICH; SILVA; NONATO, 2010b). Then, since there is no currently way to control the number of data points returned by SADIRE, its resulting set was used to define the number of data points selected by CSM, Reservoir, and Knuth. We used k = 5 and α = 3 for SADIRE parameters. The experiments were performed in a computer with the following configuration: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz, 32GB RAM, Windows 10 64 bits. Fig. 3 shows the R2 projections and the data points retrieved by each sampling technique. While the points retrieved from Knuth and Reservoir techniques are concentrated on denser areas due to their probabilistic design, CSM technique prioritizes elements that give the highest scores according to correlation to the original dataset—such characteristic retrieves more representatives from classes with higher variation (JOIA; PETRONETTO; NONATO, 2015). Thus, a few structures may not be preserved and the data points retrieved from the dataset are not well suited to be representatives, e.g., the sampled points of mammals-8k, mammals-20k, shuttle, and fiber. On the other hand, SADIRE reflects the boundaries imposed by the class separation and show the layout structure of the initial projection, preserving the structures imposed by the DR technique. In this sense, SADIRE resembles the structure of the dataset—if the projection technique truly reflects the high dimensional space structure (RAUBER et al., 2017). It is important to emphasize that the result of SADIRE is related to the R2 embedding generated by the dimensionality reduction technique, that is, the main purpose of SADIRE is to preserve the structures of the projected dataset in R2. In this case, the evaluation of how a method is reflecting the high-dimensional space must be performed considering the high-dimensional space and the dimensionality reduction technique. The numerical evaluation uses the concept of representativeness (MA; WEI; CHEN, 2011) given by the metrics redundancy and coverage. While the Redundancy metric is used to assess how redundant is a sampled set, the Coverage metric assess how a sampled set can represent the information in the original set. The following sections provide analyses based on the Redundancy and Coverage of the representative sets based on DR results visualized by scatterplots. Then, SADIRE is evaluated on clustering task using silhouette coefficient. 2.3.1 REDUNDANCY The idea to understand the redundancy of a dataset is to verify whether sampled point can be removed without information loss. That is, if two or more points are too close to each other, only one point can represent them due to their similarity. Equation 2.1 shows how to compute Redundancy, which is simply the mean of the lowest distances of Chapter 2. SADIRE: Sampling from Dimensionality Reduction 37Author One et al. 3 Dataset Projection SADIRE CSM Reservoir Knuth CBR-ILP-IR corel-1k corel-5k fiber mammals-8k mammals-20k new-wavelet photographers-5k photographers-20k segmentation shuttle TABLE 2 Projeções dos conjuntos de dados e seleção de representativos. CBR-ILP-IR Corel-1k Corel-5k Fiber Mammals-8k Mammals-20k New-wavelet Photographers-9k Photographers-9k CBR-ILP-IR Corel-1k Corel-5k Fiber Mammals-8k Mammals-20k New-wavelet Photographers-9k Photographers-20k Segmentation Shuttle Photographers-36k Figure 3 – Initial projections of the datasets and their respective sampled sets according to the sampling techniques. The color of the circles depict the different classes in the dataset. Chapter 2. SADIRE: Sampling from Dimensionality Reduction 38 each sampled point to the other points in the sampled set. In the equation, ∥ • ∥ stands for Euclidean distance, X is the sampled set, and x′ and x are different sampled points on the visual space. Notice that higher values will encode better quality since the sampled points are far from each other. 1 N ∑ x∈X minx′∈X(∥ x′ − x ∥). (2.1) Fig. 4 shows the results for the redundancy metric. Due to their design, Knuth and Reservoir techniques perform poorly according to redundancy—the sampled points are too close to each other. Notice in Fig. 3 that the denser is the area of the projection, the greater is the number of points retrieved from these techniques. On the other hand, CSM technique tends to select more points from high variability classes (see sampled points of mammals-8k, mammals-20k, and shuttle of Fig. 3). Our technique is designed to not return redundant points, so that, it had reached the best results. Figure 4 – Redundancy metric measures how the sampled set is redundant by seeing how each point correlates with each other. The higher the value, the lower a set is redundant. 2.3.2 COVERAGE We also rely on spatial proximity to assess the coverage of a given representative set. In this case, we evaluate how far from a sampled point is a non-sampled point. Equation 2.2 shows the Coverage metric, where lower values encode better quality—notice Chapter 2. SADIRE: Sampling from Dimensionality Reduction 39 that x represent each sampled point in the sampled set (X), and y represent each non- sampled point in the original set (Y ). Besides analyzing the returned value, the standard variation of the differences in Equation 2.2 is used to assess if there are points too far from the sampled set. 1 N ∑ y∈Y minx∈X(∥ y − x ∥). (2.2) Fig. 5 shows the results for coverage metric. For a few cases, SADIRE presents (for fiber and photographers-36k, for example) higher results than Reservoir and Knuth’s. This is due to the fact that the majority of the points selected from them are concentrated in denser areas—mean of lower distances gives lower values. However, the standard deviation shows that a few points are too far from a representative (sampled point), which implies in bad results for Knuth and Reservoir techniques. CSM results are related to its formulation, in which there is no way to control where to focus for extracting sample points. Figure 5 – Coverage metric to assess if a sampled set is able to represent all data structures. The standard deviation is represented by error bars. Notice that, greater standard deviation indicates that some instances are far from a representative (an instance from sampled set). 2.3.3 CLUSTERING As in Joia et al.’s work (JOIA; PETRONETTO; NONATO, 2015), we also evaluated our technique in the clustering task.. Thus, k-Means (MACQUEEN, 1967) and Expectation Maximization (EM ) (DEMPSTER; LAIRD; RUBIN, 1977) algorithms were applied in the SADIRE and CSM sampled points, then each non-sampled point was associated to the cluster of the closest medoid—the reason that we only compared our technique against CSM is that Joia et al. (JOIA; PETRONETTO; NONATO, 2015) have already compared CSM with another methods. The number of clusters were defined as the same number of classes of each dataset. Fig. 6 shows the silhouette coefficient (TAN; STEINBACH; KUMAR, 2006) for the techniques compared with the EM and k-Means algorithms, where Chapter 2. SADIRE: Sampling from Dimensionality Reduction 40 values range from −1 to 1 and greater values indicate better cohesion and separation among clusters. Figure 6 – Silhouette coefficient of the techniques applied for clustering data. SADIRE performed similar to the plain clustering processes, reflecting the structures of these datasets. CSM technique, on the other hand, tends to negligence classes with lower variation, which negatively affects the clustering process. 2.3.4 RUNNING TIME Fig. 7 shows the running time of SADIRE for the datasets in Table 1. Note that rather than the size of the datasets, the running times are more related to the density of the clusters imposed by the dimensionality reduction techniques on the R2 embedding—see, for example, the time taken for sampling from the shuttle dataset, which is the second biggest dataset in terms if size of data points. This analysis gives us the insight that although our algorithm has its greater complexity of searching for outliers in the worst-case scenario, in practice, the complexity of inserting and searching data points in a quad-tree dominates the other steps. Chapter 2. SADIRE: Sampling from Dimensionality Reduction 41 Figure 7 – Runtime of SADIRE algorithm. We can see that the algorithm takes more time for sampling from high-density embeddings. 2.3.5 PARAMETER ANALYSIS This section inspects the influence of the SADIRE’s k and α parameters. We used the 20 Newsgroups 6 and MNIST (LECUN; CORTES, 2010) datasets projected with UMAP (MCINNES; HEALY; MELVILLE, 2018) and t-SNE (MAATEN; HINTON, 2008), as shown in Fig. 8. The projection of the 20 Newsgroups dataset was messier in terms of class boundaries. However, as seen in Fig. 9a, SADIRE conveys the structures and even outliers of the embedding for lower values of k and α. Figs. 9a and 9b show the influence of the parameters k and α for the 20 Newsgroups and MNIST datasets, respectively. The first characteristic is that the k and α parameters are inversely proportional to the number of representatives and the redundancy. Moreover, the layout formed by the representative set lose its similarity to the original embedding for greater values of these parameters, deceiving user exploration. The second interesting pattern is that all of the layouts using k = 5 show high similarity to the original embedding, while a combination of higher values for k and α produce poor layouts in terms of similarity to the original embedding. Such characteristics can indicate that k is the most critical parameter. Depending on the application, SADIRE can be used with various combinations of parameters. However, a suggestion that balances similarity to the original embedding 6 http://qwone.com/ jason/20Newsgroups/ Chapter 2. SADIRE: Sampling from Dimensionality Reduction 42 (a) 20 Newsgroups projected using UMAP. (b) MNIST projected using t-SNE. Figure 8 – UMAP and t-SNE projections. and lower redundancy is to use k = 5 or 10. To support our guiding, the following section provides an user evaluation on the similarity between SADIRE samples and original scatterplots. 2.4 USER EXPERIMENT In this user experiment, the representative sets were evaluated according to their similarity to the initial embeddings and according to visual clutter. For instance, we used four datasets with incremental visual disorder to perform the experiment. For each dataset, we used six different SADIRE’s parameters configurations: k = 5, α = 3; k = 10, α = 3; k = 10, α = 5; k = 10, α = 7; k = 20, α = 5; k = 20, α = 7. The initial embeddings and layouts produced by sampling using SADIRE are shown in Fig. 10—where the circle encodes density. Due to their similarity to the other layouts, a few layout configurations were omitted, such as k = 5, alpha = 3 and k = 20, alpha = 3. The experiment was performed by 42 participants, two graduated students and 40 computer science students. The participants were given two tasks based on the six representative layouts in Fig. 10: T1: To classify representative layouts from the least visually cluttered to the most visually cluttered; T2: To rate (from 0 to 10) the representative layouts according to their similarity to the initial embedding, where 0 stands for the least similar and 10 stands for the most similar. Chapter 2. SADIRE: Sampling from Dimensionality Reduction 43 (a) Influence of the parameters on the 20 Newsgroups dataset. (b) Influence of the parameters on the MNIST dataset. Figure 9 – Influence of k and α parameters on the returned representative set. Chapter 2. SADIRE: Sampling from Dimensionality Reduction 44 Figure 10 – Initial embeddings and the layouts that result from reducing visual disorder with representative selection. SADIRE was applied to several configurations by varying the k and α parameters. For T1, the six layout configurations were presented all at once for the participants to order—we did not provide the initial embedding to avoid biasing the participants. To facilitate the rating in T2, the six layouts were presented to the participants one by one, following the initial embedding. Fig. 11 depicts users’ perceptions of the layouts produced by the sampling technique. The rows in each heatmap represent the six different representative layouts (see Fig. 10) generated by varying SADIRE’s parameters, while the columns represent the visual clutter ordering (from least to most visually cluttered) and the similarity rating. Each cell’s color represents how many times a participant placed a representative layout in that position (for visual disorder) and how many times that participant rated that representative layout (for similarity). It should be noted that for the majority of the participants, a layout was assigned a position or received a similar rating. On the heatmaps for T1, we can see a pattern indicating that as the number of representatives decreases, the visual disorder decreases as well, according to the participants. In other words, the visual clutter on the projection is related to the number of circles on the plane by the participants. For T2, we can see that the similarity of the representative layouts to the initial layout is inversely proportional, implying that as the number of data instances projected onto the plane decreases, information is lost. Given these two main patterns, representative layouts that achieve a balance between visual disorder and layout similarity are the best option. Another intriguing pattern can be seen in the elements of the heatmaps depicting visual disorder’s lower triangular matrix. This pattern can be explained by the fact that users may struggle to analyze representative layouts with fewer Chapter 2. SADIRE: Sampling from Dimensionality Reduction 45 Figure 11 – The outcomes of the user experiments. Each line denotes a unique represen- tative selection configuration. The heatmap cells encode how many times a representative layout was placed in that order position (visual disorder) and how many times a representative layout was assigned a rank number for each dataset (similarity). data points, at which point the embeddings begin to lose information. There is a link between coverage and layout similarity, as well as redundancy and visual clutter/disorder. According to participant responses, as the number of representatives is increased, the resulting set becomes more similar to the initial embedding (a natural observation). Furthermore, the initial embedding is better covered because it is represented by a greater number of data instances. On the other hand, we can see that in visually cluttered layouts, redundancy is higher (i.e., very similar data instances are projected close to each other). This analysis backs up our claim that a good representative selection technique imposes a balance between coverage (as layout similarity) and redundancy (as visual clutter). 2.5 PRACTICAL APPLICATION Fig. 12 depicts SADIRE’s suitability for representing the first level of Explor- erTree (MARCíLIO-JR et al., 2021a), a hierarchical scatterplot navigation approach. On the left is an LSP (PAULOVICH et al., 2008) projection of the Corel dataset, with 5,000 data points colored by class. On the right, we show the first level of ExplorerTree using the same sampling strategies on the same dataset. Unlike the other techniques, SADIRE uses sampled data points to cover the entire projected space, while the circle area accurately Chapter 2. SADIRE: Sampling from Dimensionality Reduction 46 depicts the densest parts of the projection. Many structures encoded by SADIRE are lost in Reservoir and CSM—these structures are highlighted in orange. Finally, bisecting k-means fails to encode density information well, aggregating data points that do not correspond to the density of the projection—see the structures highlighted in blue Figure 12 – ExplorerTree using different sampling algorithms. On the left, the traditional scatter-plot representation. On the right, the first level of ExplorerTree using four sampling strategies. Finally, in Appendix A, the results and rationale for using SADIRE in the Explor- erTree technique are documented, along with additional user experiments. 2.6 FINAL REMARKS Sampling mechanisms are an important tool for dealing with increasing data avail- ability, the high complexity of computational methods, and the interaction requirements of user-centered approaches. In the preceding sections, we show how our sampling method can be used to generate meaningful representative sets of visual representations of dimensionality reduction (DR) results. As a result, SADIRE is dependent on the ability of DR techniques to represent the underlying data structures in multidimensional space. This means that our technique is appropriate for visual tasks in which datasets are represented by R2 embeddings and a thorough examination of the DR technique and feature space has been performed. Another intriguing feature of our proposal is that the sampling mechanism can be modified to convey cluster density—this can be accomplished by returning a percentage of data points (or simply more than one data point) from each cell (see Section 2.2.1). Finally, the parameters k and α can be used to adjust the size and redundancy of the representative set. 47 3 CONTRASTIVE ANALYSIS FOR SCAT- TERPLOT REPRESENTATIONS OF DI- MENSIONALITY REDUCTION Wilson E. Marcílio-Jr, Danilo M. Eler, Rogério E. Garcia Contrastive analysis for scatterplot-based representations of dimensionality reduction, Computers & Graphics, Volume 101, 2021. Analysts inspect clusters using scatterplot representations of Dimensionality Reduc- tion (DR) results to understand data nuances and features’ contributions to embedding in the projected space. Contrastive analysis (FUJIWARA; KWON; MA, 2019; LE; AKOGLU, 2019) is a promising strategy for analyzing DR results that involves understanding how fea- tures differentiate and contribute to the formation of clusters in the visual space (R2). Such an approach has a wide range of practical applications. A tool for labeling textual data, for example, would benefit from contrastive analysis to highlight the differences among the clusters (e.g., topics)—these clusters can represent candidates for new classes due to their distinct characteristics. Another important application is determining which characteristics describe two distinct groups of patients following medical experiments (ABID et al., 2018). Only a few studies provide contrastive analysis (FUJIWARA; KWON; MA, 2019; LE; AKOGLU, 2019). Instead, the literature presents a small number of studies that support the analysis of DR results via visualization of global information (TURKAY et al., 2012; JOIA; PETRONETTO; NONATO, 2015; COIMBRA et al., 2016; STAHNKE et al., 2016; PAGLIOSA; PAGLIOSA; NONATO, 2016), emphasizing the importance of features (attributes) imposed on the embedded space. The principal components (PCs) from PCA (JOLLIFFE, 1986) are used in the main approaches to determine which features contribute to cluster formation (TURKAY et al., 2012; JOIA; PETRONETTO; NONATO, 2015). These contributions, however, do not highlight the clusters’ distinct characteristics. ccPCA (FUJIWARA; KWON; MA, 2019), on the other hand, uses contrastive PCA (ABDI; VALENTIN, 2007) to find contrastive information (i.e., the unique characteristics) for each cluster. The main limitation of ccPCA stems from PCA, specifically the prohibitively long run-time execution time for high-dimensional datasets—it takes O(m3) on the number of dimensions. ContraVis (LE; AKOGLU, 2019), another approach, is only applicable to textual data and cannot assist in the interpretation of DR layouts because it is already a dimensionality reduction approach. Furthermore, the study does not present a strategy for comprehending the contributions of the terms to the embedding organization. Chapter 3. Contrastive analysis for scatterplot representations of dimensionality reduction 48 In this chapter, we will discuss cExpression, a method for analyzing DR results that employs contrastive analysis and a