ARTICLE IN PRESS JID: INS [m3Gsc; June 20, 2018;14:49 ] Information Sciences 0 0 0 (2018) 1–21 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins An optimized unsupervised manifold learning algorithm for manycore architectures Alexandro Baldassin a , Ying Weng b , ∗, Daniel Carlos Guimarães Pedronette a , Jurandy Almeida c a Department of Statistics, Applied Mathematics and Computing, São Paulo State University – UNESP, Rio Claro/SP, 13506-900, Brazil b School of Computer Science, Bangor University, Bangor, LL57 1UT, United Kingdom c Instituto de Ciência e Tecnologia, Universidade Federal de São Paulo – UNIFESP, São José dos Campos/SP, 12247-014, Brazil a r t i c l e i n f o Article history: Received 30 January 2018 Revised 27 May 2018 Accepted 8 June 2018 Available online xxx Keywords: Multimedia retrieval Unsupervised learning Efficiency Scalability Parallelism a b s t r a c t Multimedia data, such as images and videos, has become very popular in people’s daily life as a result of the widespread use of mobile devices. The ever-increasing amount of such data, along with the necessity for real-time retrieval, has lead to the development of new methods that can process them in a timely fashion with acceptable accuracy. In this paper, we study the performance of ReckNN, an unsupervised manifold learning algorithm based on the reciprocal neighbourhood and the authority of ranked lists. Most of the related work in this field do not fully investigate optimization strategies, an aspect that is becom- ing more important with the high availability of manycore machines. In order to address that issue, we fully investigate optimization opportunities in this article and make the fol- lowing three main contributions. Firstly, we develop an efficient and scalable method for storing and accessing the distances between objects (e.g., video or image) based on dic- tionaries. Secondly, we employ memoization to speed up the computation of authority scores, leading to a significant performance gain even on single-core architectures. Lastly, we devise and implement several parallelization strategies and show that they are scalable on a 72-core Intel machine. The experimental results with MPEG-7, Corel5k and MediaE- val benchmarks show that the optimized ReckNN delivers both efficiency and scalability, highlighting the importance of the proposed optimizations for manycore machines. © 2018 Published by Elsevier Inc. 1. Introduction Multimedia data have become the mean of communication for a growing number of people sharing social content via mobile devices, social networks, and cloud environments. This change in human lifestyle and behavior has increased the availability of information in form of multimedia, enabling the creation of large multimedia collections. In this scenario, there is a growing demand for effective and efficient systems capable of managing such repositories and reduce the work and information overload when seeking a given content of interest. In the last years, several research effort s have been focused on the investigation of approaches to exploit visual or audio properties and a similarity model aiming at supporting users to perform search on multimedia collections in a content-based fashion [6,48] . Usually, these approaches require a user to provide a given multimedia object as input, named as query. ∗ Corresponding author. E-mail addresses: alex@rc.unesp.br (A. Baldassin), yingweng@gmail.com (Y. Weng). https://doi.org/10.1016/j.ins.2018.06.023 0020-0255/© 2018 Published by Elsevier Inc. Please cite this article as: A. Baldassin et al., An optimized unsupervised manifold learning algorithm for manycore archi- tectures, Information Sciences (2018), https://doi.org/10.1016/j.ins.2018.06.023 https://doi.org/10.1016/j.ins.2018.06.023 http://www.ScienceDirect.com http://www.elsevier.com/locate/ins mailto:alex@rc.unesp.br mailto:yingweng@gmail.com https://doi.org/10.1016/j.ins.2018.06.023 https://doi.org/10.1016/j.ins.2018.06.023 2 A. Baldassin et al. / Information Sciences 0 0 0 (2018) 1–21 ARTICLE IN PRESS JID: INS [m3Gsc; June 20, 2018;14:49 ] Then, it is computed the similarity between the query and each object in the collection based on the distance between their corresponding feature representations. Many different f eatures have been used to represent multimedia content. The first approaches were mostly based on low-level features, such as color, shape, and texture [24,39] . Motion patterns [1] and spatio-temporal properties [13] are some examples of other features used for encoding video content. Recently, mid- and high-level features based on local descriptors [3] and deep learning [46] have emerged as promising alternative solutions. Regardless of the features used for encoding multimedia content, a key issue for content-based multimedia retrieval sys- tems is to determine the similarity among multimedia objects. In general, it relies on distance functions used for comparing feature vectors [7] . However, pairwise distance measures, like the Manhattan or Euclidean distance, often fail to produce effective results. In fact, discovering and understanding the similarity structure of multidimensional patterns, especially in unsupervised scenarios, is of crucial relevance in data mining, pattern recognition, and machine learning applications [22] . Many data mining and pattern recognition employ high-dimensional representations, in which the curse of dimensionality phenomena is often an obstacle. Dimensionality reduction and manifold learning techniques are commonly used in such cases [20] . In retrieval scenarios, this has motivated the use of unsupervised post-processing methods for computing a more effec- tive distance measure among multimedia objects. The basic idea of such methods is to replace pairwise similarities by more global affinity measures [5,31,50] . Diffusion process [18,50] , graph-based learning methods [47] , and iterative re-ranking techniques [36,37] are some approaches proposed in the literature. The intrinsic structure of the datasets has been also ex- ploited by other recent manifold learning methods [2,25,33,34] , yielding significant effectiveness gains in different scenarios. In spite of the importance of unsupervised distance learning methods for improving the effectiveness of content-based multimedia retrieval systems, little attention has been given to efficiency and scalability issues [31] . Apart from the quality of the retrieval process, the time spent to obtain the results and the capability of handling growing multimedia collections are also indispensable [45] . Usually, distance learning methods need to be recomputed whenever significant changes are made in the datasets. Additionally, with the use of unsupervised algorithms as steps of semi-supervised interactive approaches [35] , efficiency aspects became even more relevant. Semi-supervised learning methods exploit both the dataset manifold and user interaction, therefore requiring online performance of unsupervised stages. Efficient and scalable implementations of such methods may be very challenging, demanding parallel and high- performance techniques and trade-off analysis. Some of these challenges are discussed by Valem et al. [45] in the con- text of unsupervised distance learning. Other works provide solutions based on General Purpose Graphics Processing Unit (GPGPU) [44,49] . There are also some works in other fields that propose optimizations such as artificial neural networks [9,15,42,43,51] . In most of the existing works, however, a detailed analysis on efficiency and scalability issues is missing. This article aims to fill the gap by proposing several optimizations concerning the efficient use of memory and paral- lelism. Here, we focus on the Unsupervised Manifold Reciprocal k-Nearest Neighbors Graph [34] algorithm ( ReckNN ), which is based on the reciprocal neighborhood and a graph-based analysis of ranking references. The algorithm estimates the au- thority of ranked lists, spreading the similarity information throughout the dataset by a collaborative score. We opted for ReckNN because it represents a class of unsupervised ranking algorithms. Nonetheless, the optimizations discussed in this work should be equally applied to other ranking algorithms in the literature [32,36] with the same degree of effectiveness. Moreover, the proposed optimizations are aimed at the wide group of manycore processors, virtually present in very com- puter today. In addition, the ReckNN algorithm admit an implementation of linear complexity and storage requirements. However, such implementation depends directly on optimizations aspects discussed in this paper. The major contributions of this paper are: • Starting from a baseline implementation of ReckNN , it proposes a series of optimizations devised for large scale set- tings, such as the use of dictionaries to store the distances and the use of memoization to compute the authority scores ( Section 4 ); • It presents a shared-memory parallelization strategy for ReckNN , providing a solution using Open Multi-Processing (OpenMP) [8] and different synchronization methods ( Section 5 ); • It provides a thorough analysis of the performance with the different optimizations and synchronization methods using a modern 72-core Intel machine ( Section 6 ). Our results show the feasibility of the proposed methods in a large scale learning setting, with speedups of up to 42x. The remainder of this paper is organized as follows. Section 2 presents the problem formulation. Section 3 describes the Manifold Reciprocal kNN Graph algorithm. Section 4 discusses the proposed optimizations. Section 5 describes the parallelization and tradeoffs involved. Section 6 presents the experimental evaluation and discusses the results. Finally, Section 7 draws our conclusions. 2. Problem formulation A formal definition of the rank model considered along the paper is described in this section. General multimedia re- trieval tasks are considered, in which the goal is to retrieve the most similar objects to a given query contained in a col- lection of multimedia objects. A distance is computed among objects and the system response is often defined in terms of ranked lists, which provide an unified representation for distinct multimedia content (e.g., images, videos). Please cite this article as: A. Baldassin et al., An optimized unsupervised manifold learning algorithm for manycore archi- tectures, Information Sciences (2018), https://doi.org/10.1016/j.ins.2018.06.023 https://doi.org/10.1016/j.ins.2018.06.023 A. Baldassin et al. / Information Sciences 0 0 0 (2018) 1–21 3 ARTICLE IN PRESS JID: INS [m3Gsc; June 20, 2018;14:49 ] Formally, let C = { ob j 1 , obj 2 , . . . , ob j N } be a multimedia collection, where N is its number of objects. Let D be a descriptor which defines a distance function between two objects obj i and obj j as ρ( obj i , obj j ). A common data structure used to store the distances among objects in a collection is a squared distance matrix A , such that A i j = ρ( i, j ). Based on the distance function ρ , a ranked list τ q can be computed in response to a multimedia query obj q . The ranked lists can encode information from the entire collection, although the top positions of ranked lists contain the most relevant information. In this way, it is desirable that the ranked list τ q considers only a subset of the top- n s positions, such that n s � N and n s is a constant value. In order to achieve efficiency and scalability requirements, such limitation is crucial, special for large-scale collections, where the full ranked lists are expensive to compute. The ranked list τq = (ob j 1 , obj 2 , . . . , ob j n s ) can be defined as a permutation of the subset C S ⊂ C, which contains the most similar multimedia objects to the query obj q , such that |C S | = n s . A permutation τ q is a bijection from the set C S onto the set [ n s ] = { 1 , 2 , . . . , n s } . For a permutation τ q , we interpret τ q ( i ) as the position (or rank) of the object obj i in the ranked list τ q . We can say that, if obj i is ranked before obj j in the ranked list of obj q , that is, τ q ( i ) < τ q ( j ), then ρ( q, i ) ≤ ρ( q, j ). Note that, if the position of obj i in the ranked list of obj q is higher than the constant n s , then τq (i ) = n s . Every object ob j i ∈ C can be taken as a query obj q , producing a set of ranked lists T = { τ 1 , τ 2 , . . . , τ n }. The unsupervised learning algorithm aims at computing a new and more effective distance function ˆ ρ taking as input the set T . The objective is to exploit the information encoded in the ranked lists in order to improve the effectiveness of the multimedia retrieval tasks. Formally, the Manifold Learning Reciprocal kNN Graph [34] algorithm can be defined as a function f r which takes a set of ranked lists R as the input and computes a new and more effective set of ranked lists ˆ R = f r (R ) . For readability purposes, we refer to this algorithm as simply ReckNN in future sections. 3. Unsupervised Manifold Reciprocal kNN Graph This section presents the Unsupervised Manifold Reciprocal kNN Graph [34] algorithm. An overview and its main mo- tivations are discussed in Section 3.1 . Section 3.2 presents a formal definition and Section 3.3 provides an algorithmically view for computing the method. 3.1. Overview and motivation The Manifold Reciprocal kNN Graph [34] is an unsupervised manifold learning algorithm which improves the effective- ness of retrieval tasks without the need of any user intervention. The algorithm exploits information about the dataset structure based on the information given by top positions of the ranked lists. In this way, the algorithm is able to propagate the similarity among neighbors taking into account the geometry of the dataset manifold. The ranked lists represent a relevant source of similarity information, since they define relationships not only between pairs of multimedia objects (as distance functions), but also among all multimedia objects represented in a ranked list. The manifold learning approach analyzes the reciprocal references among ranked lists using a graph approach. The distances among multimedia objects are recomputed based on the graph structure, considering all references among multimedia ob- jects at top positions of ranked lists. 3.2. Formal definition The algorithm is mainly defined in terms of three rank-based scores, which are used for computing a new distance measure. The scores and the distance measure are briefly discussed and formally defined below. 3.2.1. Reciprocal kNN score Once the nearest neighbor relationships are not symmetric [17] , the reciprocal nearest neighborhood is a much stronger indicator of similarity than the unidirectional nearest neighborhood [40] , reducing the risk of false positives at top positions of ranked lists. The Reciprocal kNN Score R s considers the position from which objects become reciprocal neighbors, as follows: R s (q, i ) = max (τq (i ) , τi (q )) n s . (1) 3.2.2. Authority score The Authority Score gives an unsupervised estimation of the effectiveness (quality) of each ranked list. The score is computed by analysing the Reciprocal kNN Graph’s density, which represents the quantity of reciprocal references among objects at top positions of ranked lists. The approach is analogous to the PageRank [30] algorithm, considering that an effective ranked list has the best returned objects for a query referencing each other at the top positions of their ranked lists. The ranked lists encode similarity information of the whole dataset, which can be exploited for improving effectiveness of other ranked lists. In this scenario, it is important to estimate the quality of the information provided. The Authority Please cite this article as: A. Baldassin et al., An optimized unsupervised manifold learning algorithm for manycore archi- tectures, Information Sciences (2018), https://doi.org/10.1016/j.ins.2018.06.023 https://doi.org/10.1016/j.ins.2018.06.023 4 A. Baldassin et al. / Information Sciences 0 0 0 (2018) 1–21 ARTICLE IN PRESS JID: INS [m3Gsc; June 20, 2018;14:49 ] Score is used for this purpose, representing an estimation of the authority of a given ranked list for contributing with other ranked lists. Let N (q, k ) denote a neighborhood set which contains the k most similar multimedia objects to obj q . The score is com- puted in terms of reciprocal references among objects at top- k positions, and is formally defined as follows: A s (q, k ) = ∑ i ∈N (q,k ) ∑ j∈N (i,k ) f in ( j, q ) k 2 , (2) where f in returns 1 if ob j j ∈ N (q, k ) and 0 otherwise. The score A s is defined in the interval [0,1], where 1 represents a perfect score. 3.2.3. Collaborative score A collaborative analysis is conducted, aiming at discovering similarity information encoded in the dataset manifold in or- der to compute a more global affinity measure. In this way, the Reciprocal kNN Collaborative Score exploits the information encoded in the ranked list according to their corresponding authority. The score considers the co-occurrence of objects in ranked lists as an indication of similarity. For example, if two objects appear at the top positions of other ranked lists, they are probably similar to each other. The increase of similarity, however, is defined proportionally to the authority of ranked list which present the co-occurrence. With the aim of giving higher weights to references at top positions, the collaborative score is computed at different values of k (varying from 1 , 2 , . . . , k max ). The collaborative score C s between two objects obj q and obj i is defined as: C s (q, i, k max ) = k max ∑ k =1 ∑ j∈C A s ( j, k ) 2 × f in (q, i, j) , (3) where f in returns 1 if ob j q , ob j i ∈ N ( j, k max ) and 0 otherwise. 3.2.4. Reciprocal kNN distance measure While the Reciprocal kNN Score ( R s ) provides a locally similarity information, the Collaborative Score ( C s ) gives a more globally measure considering the whole dataset. In order to compute a new and more effective distance measure ρr , both information are combined. All objects obj q , obj i ∈ C that present collaborative score C s ( q, i, k max ) > 0 have the distance be- tween them updated as: ρr (q, i ) = R s (q, i ) 1 + C s (q, i, k max ) . (4) The multimedia objects with C s = 0 keep their current ranking information. Formally, the distance is defined as ρr (q, i ) = τq (i ) /n s , if τ q ( i ) ≤ n s or 1, otherwise. The set of ranked lists is updated based on the distance ρr and the process can be repeated iteratively. Additionally, the value of k max grows along iterations, since it is expected that non-relevant objects are moved out from the first positions of the ranked lists and, hence, k max can be increased for considering more objects. 3.2.5. Convergence The iterative redefinition of distances based on rankings are repeated until convergence. For the ReckNN method, a con- vergence criterion was defined in order to determine the number of iterations. Once the Authority Score gives an estimation of the quality of ranked lists, the convergence criterion exploits such information. The process is computed while the authority of ranked lists is increasing more than a threshold ε per iteration. Aiming at verifying the convergence criterion, an average authority score between all objects in the collection is computed as follows: G s = ∑ k max k =1 ∑ j∈C A s ( j, k ) k max × n (5) Then, the difference between iterations is compared with the threshold ε in order to define the convergence criterion: (G (t+1) s − G (t) s ) > ε. 3.3. The algorithm As discussed in previous work [34] , the constrained size of ranked lists (to top- n s positions) allows the development of an algorithm which assumes computational and storage requirements of only O ( n ). However, for ensuring the complexity of O ( n ), various efficiency aspects, such as data structure, should be properly considered. Since the focus of [34] was the effectiveness evaluation, only a direct and non-optimized implementation of the method was initially devised. This section describes the algorithm used in [34] for computing the ReckNN method, while next sections present the proposed optimiza- tions, a detailed discussion about efficiency aspects and the evaluation on manycore architectures. Please cite this article as: A. Baldassin et al., An optimized unsupervised manifold learning algorithm for manycore archi- tectures, Information Sciences (2018), https://doi.org/10.1016/j.ins.2018.06.023 https://doi.org/10.1016/j.ins.2018.06.023 A. Baldassin et al. / Information Sciences 0 0 0 (2018) 1–21 5 ARTICLE IN PRESS JID: INS [m3Gsc; June 20, 2018;14:49 ] Algorithm 1: Baseline ReckNN Algorithm. Input : R : set of ranked lists; C: Collection of all objects Output : ˆ R : set of improved ranked lists 1 ConvScore ← 0 ; 2 ˆ R ← R ; 3 repeat 4 Initialize distance matrix A ; 5 PrevConvScore ← ConvScore ; 6 AcumScore ← 0 ; 7 foreach τq ∈ ˆ R do 8 ob j q ← τq (1) ; 9 for k ← 1 to k max do 10 AuthScore ← 0 ; 11 foreach ob j i ∈ N (q, k ) do // A s (q, k ) 12 foreach ob j j ∈ N (i, k ) do 13 if ob j j ∈ N (q, k ) then AuthScore + + ; 14 end 15 end 16 AuthScore ← AuthScore /k 2 ; 17 foreach ob j i ∈ N (q, k ) do // C s (q, i, k ) , j = q 18 foreach ob j j ∈ N (q, k ) do 19 A [ i, j] ← A [ i, j] + AuthScore 2 ; 20 end 21 end 22 AcumScore ← AcumScore + AuthScore ; 23 end 24 end 25 foreach ob j q ∈ C do // ρr (q, i ) 26 foreach ob j i ∈ C do 27 if A [ q, i ] > 0 then 28 R s (q, i ) ← max (τq (i ) , τi (q )) /n s ; 29 A [ q, i ] ← R s (q, i ) / (1 + A [ q, i ]) ; 30 end 31 end 32 end 33 foreach τq ∈ ˆ R do 34 Sort RankedList (τq ) ; 35 end 36 ConvScore ← AcumScore /( k max * n) // G s 37 k max ++; 38 until ( ConvScore − PrevConvScore ) ≤ ε; Algorithm 1 outlines a direct solution for computing the ReckNN method. The algorithm takes a set of ranked lists R as input and produces ˆ R , a new set of improved ranked lists using the Reciprocal kNN Graph context information. A dis- tance matrix A , initialized in line 4, is used as the data structure for storing the distances and scores used throughout the algorithm. The main goal of each iteration is to compute a new distance among multimedia objects, resulting in a new set of ranked lists. The distances are computed based on the scores defined in Section 3 . Therefore, the first step consists in computing the Authority Score ( Eq. (2) ), given by lines 11–22. The Authority Score is computed for each ranked list (loop of line 7) and considering different values of k (loop of line 9). The Authority Score is used for computing the Collaborative Score, defined in lines 17–22. The Reciprocal kNN Score is computed in line 28 and the new distance in line 29. The steps from line 4 to line 37 are repeated until a convergence criterion is reached, calculated by subtracting the previous convergence score from the current one and comparing the result against the threshold ε. 4. Optimizations In this section, we introduce two optimizations over the baseline version of ReckNN presented in Section 3.3 . Firstly, the use of a dictionary is suggested in Section 4.1 , followed by a technique based on memoization to speedup the computation of the authority scores in Section 4.2 . While the former optimization aims at decreasing space complexity, the latter is designed to decrease execution time, thus improving performance. Both optimizations were guided by an earlier profiling stage. Please cite this article as: A. Baldassin et al., An optimized unsupervised manifold learning algorithm for manycore archi- tectures, Information Sciences (2018), https://doi.org/10.1016/j.ins.2018.06.023 https://doi.org/10.1016/j.ins.2018.06.023 6 A. Baldassin et al. / Information Sciences 0 0 0 (2018) 1–21 ARTICLE IN PRESS JID: INS [m3Gsc; June 20, 2018;14:49 ] Fig. 1. A dictionary is implemented as an array of hash tables. Given a pair of objects ( obj q , obj i ), the index of the first object is used to select the q th hash table, while the second object is hashed and gives the position of the slot of the selected hash table in which the distance is to be stored. 4.1. Storing the distances As discussed previously, the ReckNN algorithm computes the distance ρr ( q, i ) between any pair of objects, obj q , obj i ∈ C , that appear in the set of ranked lists R . Based on the distance ρr , the set R of ranked lists is updated by ensuring that, if ρr ( q, i ) ≤ρr ( q, j ) then τ q ( i ) < τ q ( j ), thus generating a new set ˆ R of ranked lists. The baseline ReckNN algorithm makes use of a n × n matrix to store the distances, as also shown in Algorithm 1 (lines 19 and 28). Although this seems reasonable for small collections, it clearly does not scale to larger sets as its space complexity is O ( n 2 ). Besides the space complexity, there are other unsatisfactory side effects resulting from a matrix representation. Firstly, notice that all pairs of objects that present a collaborative score greater than zero need to have their distances updated, as shown in lines 25–32 of Algorithm 1 . Since the same matrix A is also used to store the collaborative score in the first part of the algorithm (line 19), one cannot predict in advance which pairs of objects have a nonzero distance, and thus all pairs must be checked, resulting in a time complexity of O ( n 2 ). Secondly, the sorting procedure (line 34) may promote objects not initially in a given ranked list if a new computed distance is significative (not shown in the algorithm). When a matrix is used, for each ranked list τ q , it is necessary to check the distances ρr ( q, i ) for every obj i ∈ C , whose time complexity is also O ( n 2 ). Note also that most entries in the matrix A will be empty since the number of objects in a ranked list is n s , which tends to be much smaller than n . The matrix design clearly trades off easy of implementation and fast accesses for space and increased time complexity. Therefore, a better strategy is needed in order for the ReckNN algorithm to be usable in a large scale learning setting. Our optimization employs a dictionary data type for that purpose, also known in the literature by map or associative array [26] . A dictionary is composed of a collection of ( key, value ) pairs, such that a key appears at most once in the collection. For the ReckNN algorithm, a key is a pair ( obj q , obj i ) of objects, and the value represents the distance between them. For good performance, it is necessary to have fast operations for searching a key and also updating it with a new value. Typical implementations of a dictionary make use of either a hash table or a search tree [11] . Considering the requirements for good performance and space limitations, our solution implements a dictionary based on an array of hash tables, with one hash table for each object, as illustrated in Fig. 1 . Given a pair of objects ( obj q , obj i ), its respective distance is located within the dictionary in two steps: (i) the q th hash table is selected using the index of the first object in the pair ( obj q ); (ii) a hash function is applied to the index of the second object ( hash ( i )), yielding a slot within the selected hash table with the distance (notice that the hashed key is also stored in the hash table, but it is not shown in Fig. 1 ). For performance reasons, each hash table is implemented as an array with open addressing. When a collision happens (i.e., the key of the located slot does not match the hashed key), a linear probing with a interval of 1 is used to locate a next spot. For this kind of structure, it is very important to have a good hash function, since open addressing has the tendency of creating clustering of hash values that are consecutive in the probe order. Our implementation makes use of Knuth’s multiplicative hash function since it is known to be simple and fast [19] , as required by our optimization. Please cite this article as: A. Baldassin et al., An optimized unsupervised manifold learning algorithm for manycore archi- tectures, Information Sciences (2018), https://doi.org/10.1016/j.ins.2018.06.023 https://doi.org/10.1016/j.ins.2018.06.023 A. Baldassin et al. / Information Sciences 0 0 0 (2018) 1–21 7 ARTICLE IN PRESS JID: INS [m3Gsc; June 20, 2018;14:49 ] Another important issue in the design of hash tables with open addressing is the load factor: the ratio between the number of entries and the total number of slots in the hash table. A small value is desired, since a value closer to 1 means that the hash table is full. The number of slots in each hash table in our implementation is proportional to n s , as indicated by Fig. 1 . Recall that n s is the number of objects in a ranked list and therefore is a good approximation for the size of each hash table. Compared to the O ( n 2 ) space complexity of the matrix implementation, the dictionary approach is only O ( n × n s ). Since n s is independent and smaller than n , the dictionary space complexity is practically O ( n ). Also, the time complexity to update the distances and promote new objects during the sorting procedure is now proportional to O ( n × n s ), since it is only required to traverse the hash table of size O ( n s ) for a given ranked list. As an example, Algorithm 2 is equivalent to lines 25–32 of Algorithm 1 when a dictionary is used. Notice that for each Algorithm 2: Computation of Collaborative Authority Scores Using a Dictionary 1 ... foreach ob j q ∈ C do 2 foreach (key, v alue ) ∈ HT [ q ] do 3 R s (q, key ) ← max (τq (key ) , τkey (q )) /n s ; 4 v alue ← R s (q, key ) / (1 + v alue ) ; 5 end 6 end 7 ... object obj q (line 2), we traverse each pair ( key, value ) in its respectivehash table, HT [ q ] (line 3), computing and storing the distance back in the dictionary (lines 4 and 5). We give more details about the parameters of our implementation, as well as comment on different values for n and n s , when discussing the performance results in Section 6 . 4.2. Memoization of authority scores Lines 11–22 of Algorithm 1 are responsible for computing the authority score A s ( q, k ) of ranked list τ q (lines 11–16) for a given k , and then using this value to update the collaborative score C s for each object pair (ob j i , ob j j ) ∈ N (q, k ) (lines 17– 22). This process is repeated for increasing values of k according to the outer loop of line 9. In particular, the computation of the authority score A s ( q, k ) requires going through the first k objects of the ranked list of q ( N (q, k ) ) and, for each object obj i , to check if the first k objects of its ranked list ( N (i, k ) ) is present in N (q, k ) . Notice that, for increasing values of k , much of the computation already performed previously (e.g., for k − 1 , k − 2 , ...) is computed again. In other words, when computing C s ( q, k ), if we could remember previous values for C s (q, k − 1) , the time efficiency of C s ( q, k ) could be improved. The goal of this optimization is to memoize (i.e., remember) previously computed values of C s (q, l) , l ∈ { 0 , 1 , 2 , . . . , k − 1 } , in order to compute C s ( q, k ). Lines 4–12 of Algorithm 3 presents the memoized version of A s ( q, k ) and replaces lines 11–16 Algorithm 3: Memoization Optimization 1 ... LastK ← 0 ; 2 for k ← 1 to k max do 3 ob j l ← τq (k ) ; 4 foreach ob j i ∈ N (q, k − 1) do 5 if ob j l ∈ N (i, k − 1) then LastK + + ; 6 if τi (k ) ∈ N (q, k ) then LastK + + ; 7 end 8 foreach ob j i ∈ N (l, k ) do 9 if ob j i ∈ N (q, k ) then LastK + + ; 10 end 11 AuthScore ← LastK /k 2 ; 12 ... 13 end 14 ... Please cite this article as: A. Baldassin et al., An optimized unsupervised manifold learning algorithm for manycore archi- tectures, Information Sciences (2018), https://doi.org/10.1016/j.ins.2018.06.023 https://doi.org/10.1016/j.ins.2018.06.023 8 A. Baldassin et al. / Information Sciences 0 0 0 (2018) 1–21 ARTICLE IN PRESS JID: INS [m3Gsc; June 20, 2018;14:49 ] Fig. 2. Computation of A s ( q , 4) using memoized values of A s ( q , 3). Each of the parts represent the new (thick lines) and memoized (regular lines) connec- tions for each object in N (q, 4) : a) ranked list of obj j ; b) ranked list of obj x ; and c) ranked list of obj l . of Algorithm 1 . The memoized values are accumulated incrementally and stored in a new variable named LastK , reset before entering the loop that increments k (line 2). To better understand Algorithm 3, Fig. 2 illustrates the extra work (thick lines) necessary to calculate A s ( q , 4), assuming the value of A s ( q , 3) is already memoized (regular lines). Initially, notice that the extra computation refers to the new kth object (line 4), which is obj l in the given example (last object of τ q ). Two main steps are required. Firstly, lines 5–8, it is necessary to check in the ranked lists of the first k − 1 objects if: (i) obj l is now present in the top k − 1 positions (line 6) – this is the case for τ j (3) in Fig. 2 a (represented with a full thick line); (ii) the last object ( τ i ( k )) is in the ranked list of obj q ( N ( q, k )) (line 7) – this is the case for obj q and obj l in Fig. 2 a and Fig. 2 b, respectively (represented with a dashed thick line). Finally, the algorithm checks whether each object in N (l, k ) is present in N (q, k ) (lines 9–11), incrementing the value of LastK if the condition is true. This is the case for the first three objects of τ l , as shown in Fig. 2 c. Line 12 then updates the authority score by dividing the accumulated values by k 2 . It should be noticed that Fig. 2 does not show the trivial connections of τ q with itself (which is always k , from which k − 1 has been already memoized), only with τ j ( Fig. 2 a), τ x ( Fig. 2 b), and τ l ( Fig. 2 c). Therefore, for the given example, the authority score A s ( q , 4) is equal to 14/4 2 . The time complexity for the computation of an authority score in Algorithm 1 (lines 11–16) is O ( k max 3 ), since the membership test in line 13 also needs to iterate through every object in N (q, k ) . On the other hand, the memoized version of the authority score computation given by Algorithm 3 is only O ( k max 2 ). Although k max can be considered a constant, the authority score is computed for every object in the collection, with different values for k , resulting in a significant portion of the execution time. As such, this optimization provides an important performance boost, as discussed in Section 6 . 5. Parallelization In this section a parallel version of ReckNN aimed at multicore processors is devised. Most of current processors provide at least two execution cores, and thus exploiting parallelism is an important characteristic for large scale learning algorithms. In particular, the parallelization techniques we present in this section are orthogonal to the optimizations described in the previous section ( Section 4 ). We start by giving a brief background on parallel computing in Section 5.1 , followed by the parallelization strategy in Section 5.2 , thread management in Section 5.3 , and synchronization techniques in Section 5.4 . 5.1. Background Parallelization involves splitting the work evenly among different threads such that they can execute concurrently and independently of each other. Ideally, if N threads are used to do the same job that required T units of time sequentially, one would expect a parallel execution time of T / N . In this case, we say that the speedup achieved by the parallelization is T T /N , or simply N . Notice that the speedup is calculated by dividing the original sequential time by the new parallel time. As a consequence, speedup numbers are expected to be higher than one, for otherwise the parallel version is slower than the sequential one, and the term slowdown is used instead. Achieving a perfect speedup is, however, not achievable in practice due to a host of different factors. Firstly, it is not always easy to split the work evenly across the working threads, in which case it is said that the parallelization suffers from load imbalance . Secondly, not all parts of a serial program may be amenable to parallelization. The impact on the speedup due to serial fragments is given by Amdahl’s Law [4] : S(N) = 1 (1 −p)+(p/N) , where p is the fraction of the program that was parallelized, and N is the number of threads or processors. Amdahl’s law states that the speedup is limited by the serial fragment ( 1 − p) so that, if 10% of the program is inherently serial, the maximum speedup is 10 regardless of the number of processors N employed. Please cite this article as: A. Baldassin et al., An optimized unsupervised manifold learning algorithm for manycore archi- tectures, Information Sciences (2018), https://doi.org/10.1016/j.ins.2018.06.023 https://doi.org/10.1016/j.ins.2018.06.023 A. Baldassin et al. / Information Sciences 0 0 0 (2018) 1–21 9 ARTICLE IN PRESS JID: INS [m3Gsc; June 20, 2018;14:49 ] Finally, speedup can be affected by the amount of communication performed by the working threads. For instance, a given part of the code may only be executed after all threads communicate their partial results to others. Communication is particularly important in distributed systems as the link latencies are more notable. However, current shared memory archi- tectures are adopting non-uniform memory access (NUMA) interconnect interfaces and communication is starting to become a bottleneck. This is particularly important for the parallelization strategies presented in this paper, since the focus of the work is on multicore architectures. These architectures use the shared memory paradigm, wherein a thread may communi- cate to others by reading/writing to the same address space. When multiple threads are acessing the same position in this address space, and at least one of them is writing, consistency errors may appear. One typical consistency error is known as the lost update , manifested when two or more threads update the same memory position and some values are overwrit- ten by other threads. These types of consistency errors do not always manifest themselves, as they are timing-dependent, a scenario known in the literature as race condition . Consistency errors are usually avoided by means of synchronization . A popular form of synchronization is to acquire a lock before attempting to update the data, releasing the lock afterwards. The region of code protected by the lock is commonly referred to as the critical section . 5.2. Parallelization strategy A brief look at Algorithm 1 reveals a lot of potential for parallelism. For instance, there are three big loops whose bodies could be executed concurrently with almost no interference: (i) authority score loop (lines 7–24); (ii) distance computation loop (lines 25–32); and (iii) sorting loop (lines 33–35). Notice, however, that these loops need to be executed in order: it is not possible to start the distance computation loop before completing the authority score computation loop. Since the focus of our optimizations are on manycore architectures, our parallelization strategy assumes a shared- memory model. A natural way of devising a parallelization of ReckNN is to evenly split the data (i.e., the ranked lists or collection of objects) among the threads that perform the same task which, in our case, is simply the actions contained in the body of the three main loops identified previously. Such strategy is known as data parallelism in the literature. Another alternative would be to use task parallelism , with threads performing different tasks simultaneously. However, as can be seen from the description of Algorithm 1 , there is not a large number of different tasks to justify using task parallelism. One way to perform the required parallel computation is known as fork-join : before entering each of the loops, N threads are forked and the data is split evenly among each one – each of the threads perform the computation on R /N distinct ranked lists or C/N distinct objects; when the loop is over, the threads are joined back together before continuing to the execution of the next step. This particular type of parallelism is also known as loop-level parallelism , since what happens in practice is that a particular thread is responsible to execute a given number of iterations of each loop. As long as iterations are independent from each other, they can be executed by the threads concurrently. It is easy to see that the iterations of the second (distance computation) and third (sorting) loops are completely inde- pendent: the former because the computation performed in lines 28–29 only affects A [ q, i ], which can only be written by the thread assigned to object q ; the latter because the sorting of one ranked list in line 34 does not depend on any other ranked list. Unfortunately, the same cannot be said about the authority score loop. Notice that the update performed to A [ i, j ] in line 19 may be executed simultaneously by multiple threads. For instance, there might be objects ( i, j ) in multiple ranked lists assigned to different threads, in which case the threads will attempt to update A [ i, j ] concurrently, given rise to the lost update problem discussed earlier in this section. Moreover, the update of AcumScore in line 22 also exhibits the lost update problem. To summarize the preceding discussion, a concurrent implementation of ReckNN requires a parallel infrastructure that allows: (i) the fork and join of threads so that the iterations of the three main loops can be evenly assigned to different threads; (ii) a way to force thread ordering between the loops, so that no thread is allowed to enter the next loop until all of them have finished the previous one; (iii) a synchronization strategy to avoid the lost update problem presented by the authority score loop. OpenMP (Open Multi-Processing) [8] is an Application Programming Interface (API) for shared memory multiprocessing that meets all these requirements. It consists of a set of compiler directives, library procedures and environment variables, and is supported by a group of major software and hardware vendors (e.g., Intel, AMD, Nvidia). Pthreads [28] could also be used in the parallelization process, but OpenMP has a clear advantage given its high-level API which further translates into higher productivity and portability. 5.3. Thread management and affinity One of the main advantages of using OpenMP is that one can start from a working sequential version of the code and incrementally add compiler directives so as to build a parallel version. In order to show how OpenMP can be employed to implement a parallel version of ReckNN , we will use a syntax close to the one used by the C language. In this language, OpenMP uses pragmas do specify parallel constructs. All OpenMP directives must start with #pragma omp . Of particular interest here is the #pragma omp parallel for directive, since it specifies that the iterations of the loop following the directive must be executed in parallel. Algorithm 4 highlights how this directive can be applied to the sequential version of Algorithm 1 . Notice that the pragmas are used just before the loops (lines 7, 10, 13). Recall from Algorithm 1 that the variable AcumScore is summed up in the first loop and then used to calculate the convergence score. By default, the scope of all variables entering a parallel region in OpenMP is global. As a result, AcumScore may also suffer Please cite this article as: A. Baldassin et al., An optimized unsupervised manifold learning algorithm for manycore archi- tectures, Information Sciences (2018), https://doi.org/10.1016/j.ins.2018.06.023 https://doi.org/10.1016/j.ins.2018.06.023 10 A. Baldassin et al. / Information Sciences 0 0 0 (2018) 1–21 ARTICLE IN PRESS JID: INS [m3Gsc; June 20, 2018;14:49 ] Algorithm 4: Parallelization of ReckNN through OpenMP directives. Input : R : set of ranked lists; C: Collection of all objects Output : R ′ : set of improved ranked lists 1 ConvScore ← 0 ; 2 R ′ ← R ; 3 repeat 4 Initialize distance matrix A ; 5 PrevConvScore ← ConvScore ; 6 AcumScore ← 0 ; 7 #pragma omp parallel for reduction(+: AcumScore ) 8 foreach τq ∈ R ′ do // calculation of authority and collaborative authority scores 9 end 10 #pragma omp parallel for 11 foreach ob j q ∈ C do // ρr (q, i ) // calculation of the distance 12 end 13 #pragma omp parallel for 14 foreach τq ∈ R ′ do 15 Sort RankedList (τq ) ; 16 end 17 ConvScore ← AcumScore /( k max *n) // G s 18 k max + + 19 until ( ConvScore − PrevConvScore ) ≤ ε; from the lost update problem. This can easily be solved by adding the OpenMP clause reduction . It specifies the type of the reduction (e.g., addition, multiplication) and the variable (of scalar type) to be reduced. In line 7 of Algorithm 4 , OpenMP will create a private copy of AcumScore for each thread and add their values at the end of the parallel loop. The parallel for directive also implicitly forces the threads that finished their computation to wait for the others before proceeding, thus solving the thread ordering problem. Another important aspect of the parallel for construct is how the iterations of the loops are divided among the threads. OpenMP provides a clause, schedule (not shown), that allows different forms of scheduling. In our case, this division is done statically since the number of ranked lists R and objects C are known for each dataset. Roughly, with a total of N threads, each one will process a total of |R| /N or |C| /N items. Notice that the amount of work performed by each thread is virtually the same, thus providing a good load balancing. One final important detail about thread management involves the assignment of threads to cores at runtime. Usually the operating system is responsible for binding threads to cores but, as setting the right thread affinity can have a dramatic effect on performance depending on the machine topology, parallel runtimes (such as OpenMP) also provide ways to pin threads to physical processing units in order to make the most out of multicore machines. Fig. 3 exhibits a machine with two sockets or processors, each providing two cores with two thread contexts (TC) each. A thread context is a processing element as seen by the operating system. In the referred figure, threads can be bound to eight different TCs. The figure also shows 3 different thread affinity settings: (i) Compact (C), assigns a thread to a free TC as close as possible to a previous one (same core); (ii) Scatter (S), distributes the threads as evenly as possible across the entire machine; and (iii) Balanced (B), binds threads to at most one TC per core, with cores as close as possible to a previous one (same socket) – a mix of Compact and Scatter . While Compact provides better locality, Scatter benefits threads which do not need to share data (each one can have the entire cache for itself, for instance). The reasoning behind Balanced is that TCs belonging to the same core are usually implemented as simultaneous multithreading (e.g., Intel’s Hyperthreading), which allows some functional units to be shared among TCs. Therefore, allocating only one TC per core tends to maximize its performance. The best affinity is application-dependent and we discuss the case of ReckNN when analyzing its performance in Section 6.1 . 5.4. Synchronization Recall that the main synchronization issue with the parallelization of ReckNN concerns the update of the distance matrix (line 19 of Algorithm 1 ). Recall also that the update of AcumScore (line 22 of Algorithm 1 ) was solved by using a reduction mechanism, wherein private copies of AcumScore are created for each thread, and their values summed up at the end. While in theory it would be possible to apply the reduction strategy to solve the problem with the update of the distance Please cite this article as: A. Baldassin et al., An optimized unsupervised manifold learning algorithm for manycore archi- tectures, Information Sciences (2018), https://doi.org/10.1016/j.ins.2018.06.023 https://doi.org/10.1016/j.ins.2018.06.023 A. Baldassin et al. / Information Sciences 0 0 0 (2018) 1–21 11 ARTICLE IN PRESS JID: INS [m3Gsc; June 20, 2018;14:49 ] Fig. 3. Different thread affinities for a dual socket, 4-core machine, each with 2 thread contexts (TC) per core: Compact (C), Scatter (S), and Balanced (B). matrix, notice that it would require the creation of multiples copies of the matrix (or the dictionaries in the case of the optimization discussed in Section 4.1 ), which is really not an option in a scalable setting. The creation of multiple copies of large data structures also tends to pollute the processor cache, thus decreasing performance. One possible approach to synchronize the access to the distance matrix is to use critical sections. OpenMP provides the #pragma omp critical directive for that purpose. One downside of using critical sections is that there is an overhead to acquire and release the lock. Moreover, critical sections tend to serialize the code with, according to Amdahl’s Law discussed earlier, negatively impact the performance. Current processors provide atomic instructions that can read, modify, and write a single datum atomically, that is, with- out the interference of any other instructions. One such popular instruction is called Compare-And-Swap (CAS), virtually present in all existing processors. As long as the data to be modified fits the word size of the processor architecture (i.e., 64 bits), a CAS instruction can be used. Each element of the distance matrix is a floating point, single-precision, data, which on most 64-bit architectures are 4 bytes, or 32 bits. Therefore, the update could be implemented using a CAS instruction. OpenMP provides the #pragma omp atomic for that purpose and was the preferred method employed to synchronize the access to the distance matrix. We did run some experiments using the first two approaches (reduction and critical sections), but their performance were subpar. The synchronization for the update of the distances using the dictionary-based optimization is slightly more complicated. While it is possible to protect each update using a critical section, this solution does not scale because it serializes the entire operation. Thus, our solution uses a non-blocking algorithm for updates in the dictionary. Non-blocking algorithms usually make use of CAS operations briefly introduced earlier. A CAS instruction takes three arguments: a memory location, an expected value, and a new value to be written. It then compares the contents of location with the expected value and, if they are equal, it atomically assigns the new value to location. Otherwise, location is left untouched. A boolean value is returned indicating whether the replacement occurred. The pseudo-code for the lock-free dictionary update operation is presented in Algorithm 5 . It is a simplification of similar approaches in the literature [27,29] , particularly tailored to our problem. In particular, we make the following simplifications: (i) the value zero (0) is used to represent empty slots (keys must be non-zero); (ii) each hash table contains a fixed amount of entries; and (iii) there is no delete operation. Also, notice that the only time that updating the distances require synchro- nization is when the authority score is calculated in the first loop. The search and insert operations performed as part of the second and third loops are private to a specific hash table, and thus they do not need to be thread-safe. Algorithm 5 takes as input a key, comprised of an object pair ( obj q , obj i ), and a value to be added ( val ). Initially, all entries (keys and values) are zero. In the first step (line 1), the correct hash table is selected using obj q as index. Next, the position inside the selected hash table ( Index ) is calculated according to a hash function (line 2). The loop from lines 4 to 18 is responsible for searching a key in the selected hash table (line 6) such that its value is either zero (empty) or matches obj i (line 7). If that is not the case, it probes the next position until all slots are eventually inspected (lines 17 and 18). If an empty slot is found (line 8), it attempts to atomically update the position using obj i as the key (line 9). If it fails, it retries by returning to the beginning of the loop (line 10). Otherwise, the CAS operation succeeded and the new key obj i was correctly written in the hash table. It then proceeds to atomically update the value (line 14). Please cite this article as: A. Baldassin et al., An optimized unsupervised manifold learning algorithm for manycore archi- tectures, Information Sciences (2018), https://doi.org/10.1016/j.ins.2018.06.023 https://doi.org/10.1016/j.ins.2018.06.023 12 A. Baldassin et al. / Information Sciences 0 0 0 (2018) 1–21 ARTICLE IN PRESS JID: INS [m3Gsc; June 20, 2018;14:49 ] Algorithm 5: Lock-free dictionary update. Input : (ob j q , ob j i ) : key comprised of the object pair; v al: distance between the pair (ob j q , ob j i ) ; Output : true if success; false if dictionary is full; 1 HTable ← HT [ ob j q ] ; 2 Index ← hash (ob j i ) ; 3 Start ← Index ; 4 repeat 5 Index ← Index % NK; 6 Key ← HTable [ Index ] .key ; 7 if ( Key == 0) || ( Key == ob j i ) then 8 if Key == 0 then 9 if ! CAS( HTable [ Index ] .key, 0 , ob j i ) then 10 continue ; 11 end 12 end 13 #pragma omp atomic 14 HTable [ Index ] . v alue = HTable [ Index ] . v alue + v al; 15 return true ; 16 end 17 Index ← Index + 1 ; 18 until Index ! = Start ; 19 return false ; To understand how Algorithm 5 provides correct synchronization, consider two different scenarios. Firstly, assume that two different threads are trying to update the same entry in the hash table with different keys. Because of the CAS operation, at most one will succeed. The other will retry and both comparisons at line 7 will fail. It will then try to update another slot. Now consider two threads updating the same entry with the same key. Again, only one will succeed, but when the other thread retries, the second comparison at line 7 will now evaluate to true, and the thread will correct update the value at line 14. Notice also that the update in line 14 needs to be atomic, for there is a chance of two or more threads with the same key to update the value at the same time. The last synchronization alternative we employed in the parallelization of the ReckNN algorithm is Hardware Lock Elision (HLE). This synchronization approach was proposed about 15 years ago [41] , but only recently appeared on Intel processors as an instruction set extension [12] . From a programmer’s point of view, the code is synchronized using lock and unlock operations. At runtime, the hardware first elides the acquisition of the lock and execute the critical section optimistically. If no other thread changes any of the memory locations accessed by the thread executing the critical section, the unlock operation is also elided and the changes are made permanent. If one or more threads also enter the critical section and their accesses conflict with other threads, the hardware will abort the execution of the conflicting threads and retry pes- simistically by acquiring the respective lock. With HLE, as long as there are no conflicts at runtime, threads can execute the critical section concurrently. 6. Experimental evaluation In this section, we provide a quantitative analysis of the optimizations described in previous sections on a modern mul- ticore machine. Section 6.1 describes the methodology, platform and datasets employed in the experiments, followed by the analysis of the impact of the optimizations in Section 6.2 . The performance achieved by the proposed parallelization strategy and different synchronization methods is discussed in Section 6.3 . Finally, Section 6.4 discusses the impact of not using any synchronization. 6.1. Setup For the experiments, four specific versions of the ReckNN algorithm were implemented using the C language, according to the different optimizations proposed in this paper: • Naive : the baseline implementation using a distance matrix ( Algorithm 1 ). Notice that this is the original version as proposed in [34] ; • Dictionary : replaces the distance matrix with a dictionary, as discussed in Section 4.1 . After some tests, we opted to use Knuth’s multiplicative hash function [19] since it provided the best results. The size for each hash table was set to n s /1.5, providing a load factor of about 50%; Please cite this article as: A. Baldassin et al., An optimized unsupervised manifold learning algorithm for manycore archi- tectures, Information Sciences (2018), https://doi.org/10.1016/j.ins.2018.06.023 https://doi.org/10.1016/j.ins.2018.06.023 A. Baldassin et al. / Information Sciences 0 0 0 (2018) 1–21 13 ARTICLE IN PRESS JID: INS [m3Gsc; June 20, 2018;14:49 ] • Memoiz : makes use of the memoization optimization described in Section 4.2 , while still using the distance matrix; • Optimized : both dictionary and memoization are present. To help in the analysis, the results present execution time for three different phases of ReckNN : • Initialization : responsible for initializing the main data structures – roughly the distance matrix in Naive and Memoiz , and the dictionary in Dictionary and Optimized ; • AuthScore : calculates the collaborative authority score – the first loop of Algorithm 1 (lines 7–24); • DistSort : updates the distances and sorts the ranked lists – second and third loops of Algorithm 1 (lines 25–36). The experimental evaluation was conducted on three different datasets, with different characteristics. Table 1 presents a general description of the datasets used. Diverse multimedia content is also considered, involving two image datasets and one video dataset. The MPEG-7 and Corel5k image datasets became de facto benchmarks and are commonly used in the literature for evaluation and comparison of image retrieval methods. The use of such common benchmarks allow the comparison and reproducibility of experiments. Since the video retrieval literature is more recent, the MediaEval dataset was chosen due to its use in a relevant competition in the area. Three different visual descriptors are considered, defined according to the dataset, including color, shape, and motion features. Since the evaluation is focused on efficiency aspects, the same parameters settings are used for all datasets. The ReckNN algorithm consider three parameters: k max , ε, and n s . The parameters k max and ε are related with the neighborhood size and convergence, which have direct impact on the effectiveness results. In order to evaluate the optimizations and efficiency aspects, without affecting the effectiveness of the method, the values suggested by the original paper [34] are used ( k max = 15 and ε = 0 . 0125 ). Regarding the n s parameter, which defines a trade-off between effectiveness and efficiency, different values were used according to the dataset ( Table 1 ) with the aim of evaluating distinct scenarios and the impact on the efficiency results. The machine and system configurations used in the experiments are described in Table 2 . The machine topology is similar to that in Fig. 3 , but each of the 2 sockets has 18 cores, each with 2 thread contexts. As a result, there are 72 different cores from the operating system point of view. Like most Intel processors, there are 2 private levels of cache (L1 and L2), with the last level (L3) shared between all cores in the socket. There is also a total of 128 GB of DRAM available. The access to memory is not uniform (NUMA), that is, if a memory position accessed by a core in the first socket is present in the DRAM slot assigned to the second socket, the latency will be higher than if the memory was present in the first socket. The Linux kernel version used (3.10.0) automatically deals with distributing and migrating memory pages around in order to reduce the NUMA impact (this feature is commonly known as AutoNUMA [10] ). Finally, the GNU C Compiler was used to compile all versions of ReckNN with optimization flag -O3 . As can be seen, our evaluation makes use of a high-end machine with an Intel 72-core processor. This is very important in order to assess the effectiveness of the proposed optimizations since they were devised for manycore architectures. We expect to see the same performance trends for lower configurations but with reduced levels of parallelism. Each experiment was executed 10 times and the results discussed in the next subsections represent the averaged ex- ecution time. In all cases, the confidence interval (95%) was calculated and the results are all consistent. The scalability evaluation performed in Section 6.3 uses the following increasing thread configurations: 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66, and 72. This large variation is enough to appropriately assess the scalability of the system given the chosen machine. We also considered the thread affinity configurations discussed in Section 5.3 in our experiments. Overall, the execution time we measured with all of them were similar when the confidence interval is considered. Therefore, the results presented here use the Balanced policy as default. The only exception we found concerning the choice of thread affinity was for the synchronization method based on HLE, which is further explained in Section 6.3 . 6.2. Impact of optimizations In this section, we quantity the impact of the several optimizations proposed in this paper. Fig. 4 shows the sequential execution time for the three datasets: MPEG-7 ( Fig. 4 a), Corel5k ( Fig. 4 b), and MediaEval ( Fig. 4 c). Firstly, notice that the execution time for each dataset is proportional to their collection size. Therefore, MPEG-7 exhibited the fast execution time, followed by Corel5k and MediaEval . Another important behavior displayed by Fig. 4 is that the impact of the Initialization phase is negligible for MPEG-7 and Corel5k , but it starts to become relevant for MediaEval when the distance matrix is used ( Naive and Memoiz ) due to its larger size. On the other hand, the dictionary used by Dictionary and Optimized does not exhibit any significant overhead since the time required to initialize this data structure is O( n ) instead of O( n 2 ). For all the datasets, the reduced execution time provided by the memoization optimization ( Memoiz and Optimized ) in the AuthScore phase is very noticeable. Consider, for instance, Corel5k : the AuthScore phase of Memoiz is 4.9x faster than the same phase of Naive (1.36s versus 6.63s). If we compare the AuthScore phase of Memoiz against Optimized , it can be seen that the latter is slower in all datasets, because the time required to access the dictionary is higher than accessing the distance matrix. However, the contrary is true if we consider the DistSort phase: the use of the dictionary in Dictionary and Optimized makes this phase faster as the time required to update the distance and sort the rank lists is proportinal to O ( n × n s ) instead of O ( n 2 ). Since values of n and n s are similar in Corel5k , it does not benefit from this feature provided by the dictionary. Please cite this article as: A. Baldassin et al., An optimized unsupervised manifold learning algorithm for manycore archi- tectures, Information Sciences (2018), https://doi.org/10.1016/j.ins.2018.06.023 https://doi.org/10.1016/j.ins.2018.06.023 1 4 A . B a ld a ssin et a l. / In fo rm a tio n Scien ces 0 0 0 (2 0 18 ) 1 – 2 1 A R T IC L E IN P R E S S JID : IN S [m 3 G sc; Ju n e 2 0 , 2 0 1 8 ;1 4 :4 9 ] Table 1 Datasets, descriptors and parameters used in the experimental evaluation. Dataset Content Type General Description Descriptor Dataset Size ( n ) Rk. List Size ( n s ) # Classes Iterations MPEG-7 Images of shapes A well-known dataset [21] composed of 1400 shapes divided in 70 classes. Commonly used for evaluation of unsupervised learning approaches. Aspect Shape Context (ASC) [23] 1400 200 70 6 Corel5K Color scenes A heterogeneous sub-set [14] of Corel image dataset, consisting of similar semantic content (such as beach, bird, mountain, jewelry, sunset, etc). Auto Color Correlograms (ACC) [16] 50 0 0 50 0 0 50 12 MediaEval Videos Provided by the MediaEval 2012 organizers, the dataset contains 3288 h of video collected from the blip.tv, distributed among 26 genre categories. Histogram of Motion Patterns (HMP) [1] 14,838 10 0 0 26 12 P le a se cite th is a rticle a s: A . B a ld a ssin e t a l., A n o p tim ize d u n su p e rv ise d m a n ifo ld le a rn in g a lg o rith m fo r m a n y co re a rch i- te ctu re s, In fo rm a tio n S cie n ce s (2 0 1 8 ), h ttp s://d o i.o rg /1 0 .1 0 1 6 /j.in s.2 0 1 8 .0 6 .0 2 3 https://doi.org/10.1016/j.ins.2018.06.023 A. Baldassin et al. / Information Sciences 0 0 0 (2018) 1–21 15 ARTICLE IN PRESS JID: INS [m3Gsc; June 20, 2018;14:49 ] Table 2 Machine/System configuration used in the experiments. Component Configuration Processor Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30 GHz Sockets 2 Cores per socket 18 Threads per core 2 L1 instruction cache 32 KB, 8-way set associative (private) L1 data cache 32 KB, 8-way set associative (private) L2 (unified) 256 KB, 8-way set associative (private) L3 (unified) 45 MB, 20-way set associative (shared) RAM 128 GB DDR4-2400, 8 ∗16 GB DIMMS Max bandwidth 71.53 GB/s Operating system Linux kernel 3.10.0 (64-bit) – Oracle Server 7.2 Compiler gcc (GCC) 4.8.5 (Red Hat 4.8.5-4) Optimization flag -O3 Fig. 4. Broadwell Optimization Results (sequential version). Overall, the results indicate that, for smaller collections, the use of the distance matrix and memoization is the best al- ternative. However, as the collection size grows, replacing the distance matrix with a dictionary is the best choice. Since our goal is to provide an efficient implementation of ReckNN for large-scale distance learning, the results strongly indicate that the proposed optimizations (i.e., memoization and dictionary) are effective. For instance, with MediaEval , the Optimized version is already faster than Memoiz (25.1s versus 26.6s, respectively) 6.3. Parallelization results In this section, the parallelization strategy and different synchronization methods presented in Section 5 are analysed. In particular, the following configurations are considered: • serial : the serial version as presented in the previous section; • nosync : in this version we execute the code in parallel, but do not make any effort to avoid the lost update problem. As a result, the accuracy of ReckNN maybe compromised. We analyze the impact of this version on the accuracy in Section 6.4 ; • atomic : this version refers to the atomic instructions used to avoid the lost update problem in Naive and Memoiz , and to the lock-free insertion in Dictionary and Optimized ; Please cite this article as: A. Baldassin et al., An optimized unsupervised manifold learning algorithm for manycore archi- tectures, Information Sciences (2018), https://doi.org/10.1016/j.ins.2018.06.023 https://doi.org/10.1016/j.ins.2018.06.023 16 A. Baldassin et al. / Information Sciences 0 0 0 (2018) 1–21 ARTICLE IN PRESS JID: INS [m3Gsc; June 20, 2018;14:49 ] Fig. 5. Overhead introduced by the different synchronization methods for each optimization. Fig. 6. Results for MediaEval with the nosync method: (a) Execution time, (b) Speedup. For each cluster, from left to right: Naive , Dictionary , Memoiz , and Optimized . • hle : it uses Intel’s hardware lock elision when updating the authority scores (both in the distance matrix and the dictionary versions). In the following paragraphs, we discuss the overhead introduced by the various synchronization mechanisms, as well as their scalability. Due to space limitations, we focus on the Corel5k and, in particular, the MediaEval datasets. The results for MPEG-7 follow the same trends. 6.3.1. Single core overhead We start by analysing the overhead introduced by the different synchronization methods in the single core case compared with the sequential execution. The results for Corel5k and MediaEval under different optimizations are shown in Fig. 5 . Notice that the overhead comes from the AuthScore phase, since this is the phase where the update is performed. As expected, nosync has virtually no overhead since no synchronization is being performed. This is not the case for atomic and hle , however. In particular, the overhead is more prominent in the Memoiz optimization, as the updates are executed more frequently. For the atomic method, the overhead can reach 20% in Corel5k and 17% in MediaEval . The situation is even worse for hle , with an overhead up to 66% in Corel5k and 50% in MediaEval . This initial analysis shows that there is an important price to be paid for the consistency of the updates. 6.3.2. No synchronization scalability In this configuration, the lost update problem is allowed to happen. Fig. 6 shows the execution time (left) and speedup (right) for MediaEval as the number of cores is increased up to 72. For each cluster in Fig. 6 a, the bars (from left to right) correspond to Naive , Dictionary , Memoiz , and Optimized . As can be seen, the execution time decreases as more cores are added. With 36 cores, the running time with Optimized is about 0.79s. It is important to notice that, with 36 cores and the Balanced policy, there is at most one thread context being used per core (hyperthreading is not used). Initially, as the number of cores is increased to 42, a slowdown is observed because of the extra overhead caused by the additional thread contexts. However, as the number of cores keeps increasing, the execution time starts to decrease again. This behavior can be easily seen in the speedup graph shown in Fig. 6 b. With 72 cores, when all thread contexts are being fully utilized, the Optimized execution time is 0.65s, a gain of 21% when compared with the same configuration with 36 cores. The speedup curves of Fig. 6 b also show that the improvements are much more significant when going from 6 to 36 cores. As mentioned before, when going from 36 to 72 cores, more thread contexts per core are used and the performance Please cite this article as: A. Baldassin et al., An optimized unsupervised manifold learning algorithm for manycore archi- tectures, Information Sciences (2018), https://doi.org/10.1016/j.ins.2018.06.023 https://doi.org/10.1016/j.ins.2018.06.023 A. Baldassin et al. / Information Sciences 0 0 0 (2018) 1–21 17 ARTICLE IN PRESS JID: INS [m3Gsc; June 20, 2018;14:49 ] Fig. 7. Results for MediaEval with the atomic method: (a) Execution time, (b) Speedup. For each cluster, from left to right: Naive , Dictionary , Memoiz , and Optimized . Fig. 8. Results for MediaEval with the hle method: (a) Execution time, (b) Speedup. For each cluster, from left to right: Naive , Dictionary , Memoiz , and Optimized . does not improve at the same pace. This is because the Intel processor utilized in our experiments use a form of simulta- neous multithreading (called hyperthreading in Intel’s terminology) that forces multiple thread contexts to share hardware resources. For instance, the Dictionary optimization achieves a speedup of 32.6x with 36 cores and 42.5x with 72 cores. It is also important to notice that the speedup numbers are calculated individually for each optimization (they have different sequential execution time). Therefore, an optimization with a speedup higher than others does not imply that its respective execution time is also faster. As an example, Dictionary achieves a 42.5x speedup with 72 cores while Optimized ’s speedup is 38.7x, although Optimized execution time is 0.65s compared to Dictionary ’s 0.93s. Overall, the results with inconsistent updates reveal a good scalability and performance, with the Optimized version providing the best execution time with 72 cores (0.65s). Also, Optimized was 1.69x faster than Naive with that core configuration (0.65s and 1.10s, respectively). 6.3.3. Atomic scalability As with the previous figure for nosync , Fig. 7 shows execution time and speedup results for atomic . Both Fig. 7 a and Fig. 7 b look very similar structurally to the ones showed previously for nosync , but the absolute values are somewhat lower due to the increased overhead added by the atomic instructions. Optimized is still the best version with a running time of 0.79s with 72 cores. Compared to nosync ’s 0.65s with the same configuration, the slowdown is 17.7%. The speedups are also less accentuated, with Optimized achieving 31.8x with 72 cores, against nosync ’s 38.7x. Nonetheless, the performance still scales and the lost updated problem is solved with the atomic synchronization. Finally, Optimized was 1.52x faster than Naive with 72 cores (0.79s versus 1.20s), showing the effectiveness of the proposed optimizations. 6.3.4. HLE scalability The last synchronization strategy considered makes use of Intel’s hardware lock elision feature. For this mechanism, the best thread affinity strategy was Compact , as both Balanced and Scatter tend to worsen the performance very early. As illustrated by Fig. 8 , Intel’s HLE is very sensitive to locality: the closest the thread contexts are, the less the overhead to elide the locks. With 36 cores and the Compact policy, all thread contexts of a single socket are fully utilized. Therefore, the system performs reasonably well from 1 to 36 cores but, after that amount, cores from the other socket start to be utilized and the performance degrades considerably. With a high thread count, the situation is worse for Dictionary and Optimized because the region of code that is covered by the lock elision mechanism is larger (recall that a dictionary update is required). Please cite this article as: A. Baldassin et al., An optimized unsupervised manifold learning algorithm for manycore archi- tectures, Information Sciences (2018), https://doi.org/10.1016/j.ins.2018.06.023 https://doi.org/10.1016/j.ins.2018.06.023 18 A. Baldassin et al. / Information Sciences 0 0 0 (2018) 1–21 ARTICLE IN PRESS JID: INS [m3Gsc; June 20, 2018;14:49 ] Looking at Fig. 8 b, it is possible to notice that Memoiz and Optimized do not scale after 18 cores. These versions execute update operations more frequently due to the memoization optimization. Although Naive and Dictionary still scale from 24 to 36 cores, their running time do not improve by a large margin. The best execution time for HLE was achieved with the Optimized version and 18 cores, totalizing 2.48s. 6.3.5. Summing up For all the synchronization methods described, the Optimized version provided the best execution time with MediaEval , showing the effectiveness of the proposed parallelization technique and optimizations for datasets with large collections. Although we did not show the results for the MPEG-7 and Corel5k datasets due to space limitations, the trends are similar but with Memoiz providing the best performance since those datasets are smaller. As expected, the best results were achieved when no synchronization is performed ( nosync ), although the price is a possible loss of accuracy of the classifier (quantified in Section 6.4 ). When atomic operations are used to synchronize the access to the distance matrix/dictionary, a slowdown of about 17.7% compared to nosync was observed with 72 cores. Unfortunately, the hardware lock elision mechanism did not scale after 36 cores and, even when only one socket is used, the slowdown is about 73.8% and 68.1% compared to nosync and atomic , respectively. 6.4. Impact of no synchronization Previous subsections have shown that the best performance is achieved when no synchronization is used to avoid the lost update problem. Earlier works [38,45] have argued that, in similar scenarios, the problem does not affect the quality of the distance learning method and thus the solution should be accepted. However, no quantitative analysis has been performed on machines with a large number of cores to investigate whether that is really the case. In order to shed some light into that issue, Table 3 provides, for each dataset, the mean average precision (MAP) and lost updates percentage as the number of cores are increased. Except for the single core case (when all values collected by the 10 executions are exactly the same), the table shows the MAP values with the largest difference to the base MAP (the one collected with 1 core) and the maximum number of lost updates among the 10 executions, for each core configuration. Consider, initially, the quality loss measured by MAP. The base value is given when the number of cores is one, since the lost update problem cannot occur in this case. For each dataset, MAP scores for each optimization are presented: N for Naive , D for Dictionary , M for Memoiz , and O for Optimized . Notice that, for MPEG-7 , the MAP for Naive and Memoiz (93.10%) slightly differs from that of Dictionary and Optimized (93.11%), even when only 1 core is used. The reason for this mismatch is due to the different positions objects with the same weight can occupy in the ranked list after the sorting procedure. While in the distance matrix ( Naive and Memoiz ) this position is given by the order they appear in the matrix, the position when using a dictionary ( Dictionary and Optimized ) is determined by the hash function. It is important to note, however, that both solutions are semantically equivalent and the resulting value should not diverge by a large margin (which is the case here). Table 3 also shows that the MAP values for increasing number of cores tend not to differ a lot from the base value (single core). This is particularly true for MediaEval , when the difference is only 0.01%. For MPEG-7 (93.14%) and Corel5k (35.47%), the difference is never greater than 0.04%. The results indicate that the version with no synchronization does not seem to lose accuracy compared to the base value, even with a large number of cores. In order to get further insight into this behavior, we instrumented the code so as to measure the number of times a core failed to update the distance due to another core also updating it. These values are shown in the second big column of Table 3 (Lost Updates) for the versions of ReckNN that use the distance matrix ( Naive and Memoiz ). As it can be seen, the amount of lost updates indeed increase with the number of cores, but very slowly. Notice that the amount of lost updates with memoization (M) tends to increase faster due to the fact that the updates are performed more frequently with this optimization. Even then, the number of lost updates was never larger than 1.63% of the total updates performed. 7. Conclusion The massive volumes of multimedia data have evidenced the necessity of efficient and scalable retrieval methods. With the goal of meeting those demands, this paper proposed and analysed a series of optimizations for the Unsupervised Mani- fold Reciprocal kNN Graph algorithm ( ReckNN ). In particular, it was investigated the use of dictionaries to store the distances between the objects in ReckNN , along with the memoization of authority scores. To effectively tackle scalability, a paralleliza- tion of ReckNN was presented together with several synchronization methods. A thorough analysis using a modern 72-core Intel machine showed that the proposed optimizations are both efficient and scalable, with speedups in the order of 40x. One possible limitation of the current study is that most of the optimization strategies were devised and evaluated in shared memory machines. Therefore, in future works we plan to investigate the optimizations for distributed environments (such as clouds). As future work, we also intend to investigate the possibility of applying the proposed strategies to other manifold learning methods. Please cite this article as: A. Baldassin et al., An optimized unsupervised manifold learning algorithm for manycore archi- tectures, Information Sciences (2018), https://doi.org/10.1016/j.ins.2018.06.023 https://doi.org/10.1016/j.ins.2018.06.023 A . B a ld a ssin et a l. / In fo rm a tio n Scien ces 0 0 0 (2 0 18 ) 1 – 2 1 1 9 A R T IC L E IN P R E S S JID : IN S [m 3 G sc; Ju n e 2 0 , 2 0 1 8 ;1 4 :4 9 ] Table 3 MAP values and lost update percentages for different core configurations. MAP (%) Lost updates (%) MPEG-7 Corel5k MediaEval MPEG-7 Corel5k MediaEval Cores N D M O N D M O N D M O N M N M N M 1 93.10 93.11 93.10 93.11 35.51 35.51 35.51 35.51 3.89 3.89 3.89 3.89 0.00 0.00 0.00 0.00 0.00 0.00 6 93.10 93.11 93.11 93.11 35.51 35.50 35.49 35.51 3.89 3.89 3.90 3.89 0.03 0.05 0.03 0.08 0.05 0.09 12 93.12 93.11 93.12 93.11 35.50 35.52 35.48 35.49 3.90 3.89 3.90 3.89 0.08 0.18 0.07 0.17 0.11 0.21 18 93.11 93.11 93.12 93.12 35.52 35.49 35.52 35.51 3.89 3.89 3.89 3.90 0.10 0.20 0.11 0.26 0.18 0.33 24 93.11 93.12 93.10 93.10 35.48 35.49 35.52 35.48 3.89 3.89 3.90 3.89 0.28 0.45 0.12 0.29 0.22 0.43 30 93.10 93.11 93.10 93.13 35.48 35.51 35.47 35.51 3.89 3.89 3.89 3.89 0.19 0.38 0.17 0.38 0.31 0.58 36 93.11 93.12 93.12 93.10 35.50 35.49 35.48 35.49 3.89 3.89 3.89 3.89 0.18 0.35 0.21 0.47 0.37 0.72 42 93.11 93.11 93.10 93.11 35.49 35.49 35.51 35.49 3.90 3.89 3.90 3.90 0.24 0.48 0.19 0.46 0.36 0.71 48 93.10 93.12 93.14 93.11 35.48 35.49 35.51 35.52 3.90 3.89 3.90 3.90 0.36 0.74 0.20 0.54 0.40 0.76 54 93.10 93.13 93.11 93.13 35.51 35.49 35.47 35.52 3.90 3.89 3.90 3.90 0.43 0.84 0.25 0.69 0.43 0.86 60 93.12 93.12 93.13 93.12 35.51 35.49 35.51 35.51 3.89 3.89 3.90 3.89 0.41 0.90 0.30 0.80 0.48 0.92 66 93.12 93.11 93.10 93.11 35.48 35.50 35.47 35.49 3.90 3.90 3.90 3.89 0.36 0.90 0.34 0.92 0.45 0.90 72 93.10 93.12 93.11 93.10 35.48 35.51 35.50 35.52 3.89 3.89 3.90 3.89 0.74 1.63 0.38 1.01 0.56 1.06 P le a se cite th is a rticle a s: A . B a ld a ssin e t a l., A n o p tim ize d u n su p e rv ise d m a n ifo ld le a rn in g a lg o rith m fo r m a n y co re a rch i- te ctu re s, In fo rm a tio n S cie n ce s (2 0 1 8 ), h ttp s://d o i.o rg /1 0 .1 0 1 6 /j.in s.2 0 1 8 .0 6 .0 2 3 https://doi.org/10.1016/j.ins.2018.06.023 20 A. Baldassin et al. / Information Sciences 0 0 0 (2018) 1–21 ARTICLE IN PRESS JID: INS [m3Gsc; June 20, 2018;14:49 ] Acknowledgments We thank the São Paulo Research Foundation- FAPESP (grants 2013/08645-0 and 2016/06441-7), CNPq (grants 446160/2014-8, 423228/2016-1, 313122/2017-2, and 308194/2017-9), and Royal Academy of Engineering(grant NRCP1516/1/33) for funding. References [1] J. Almeida , N.J. Leite , R.S. Torres , Comparison of video sequences with histograms of motion patterns, in: Proceedings of the IEEE International Confer- ence on Image Processing (ICIP), 2011, pp. 3673–3676 . [2] J. Almeida , D.C.G. Pedronette , O.A.B. Penatti , Unsupervised manifold learning for video genre retrieval, in: Proceedings of the Iberoamerican Congress on Pattern Recognition (CIARP), 2014, pp. 604–612 . [3] J. Almeida , A. Rocha , R.S. Torres , S. Goldenstein , Making colors worth more than a thousand words, in: Proceedings of the ACM International Sympo- sium on Applied Computing (ACM-SAC), 2008, pp. 1180–1186 . [4] G.M. Amdahl , Validity of the single processor approach to achieving large scale computing capabilities, in: Proceedings of the AFIPS Spring Joint Computer Conference, 1967, pp. 4 83–4 85 . [5] X. Bai , S. Bai , X. Wang , Beyond diffusion process: neighbor set similarity for fast re-ranking, Inf. Sci. 325 (2015) 342–354 . [6] C. Beecks , Distance-based Similarity models for content-based multimedia retrieval, Fakultät für Mathematik, Informatik und Naturwissenschaften, RWTH Aachen University, 2013 Ph.D. thesis . [7] C. Beecks , M.S. Uysal , T. Seidl , A comparative study of similarity measures for content-based multimedia retrieval, in: Proceedings of the IEEE Interna- tional Conference on Multimedia and Expo, 2010, pp. 1552–1557 . [8] O.A.R. Board, OpenMP Application Program Interface, Version 4.0, 2013 https://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf . [9] K.-w. Chau , Use of meta-heuristic techniques in rainfall-runoff modelling, Water 9 (3) (2017) . [10] J. Corbet, AutoNUMA: The Other Approach to NUMA Scheduling, 2012 https://lwn.net/Articles/488709/ . [11] T. Cormen , C. Leiserson , R. Rivest , C. Stein , Introduction to Algorithms, third ed., The MIT Press, 2009 . [12] I. Corporation , Intel Architecture Instruction Set Extensions Programming Reference, Intel Corporation, 2012 . [13] L.A . Duarte , O.A .B. Penatti , J. Almeida , Bag of genres for video retrieval, in: Proceedings of the Conference on Graphics, Patterns and Images (SIBGRAPI), 2016, pp. 257–264 . [14] P. Duygulu , K. Barnard , J.F.G.d. Freitas , D.A. Forsyth , Object recognition as machine translation: learning a lexicon for a fixed image vocabulary, in: Proceedings of the 7th European Conference on Computer Vision-Part IV, ECCV, 2002, pp. 97–112 . [15] V. Gholami , K. Chau , F. Fadaee , J. Torkaman , A. Ghaffari , Modeling of groundwater level fluctuations using dendrochronology in alluvial aquifers, J. Hydrol. 529 (2015) 1060–1069 . [16] J. Huang , S.R. Kumar , M. Mitra , W.-J. Zhu , R. Zabih , Image indexing using color correlograms, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1997, pp. 762–768 . [17] H. Jegou , C. Schmid , H. Harzallah , J. Verbeek , Accurate image search using the contextual dissimilarity measure, Proceedings of the IEEE Trans. Pattern Anal. Mach. Intell. 32 (1) (2010) 2–11 . [18] J. Jiang , B. Wang , Z. Tu , Unsupervised metric learning by self-smoothing operator, in: Proceedings of the International Conference on Computer Vision (ICCV), 2011, pp. 794–801 . [19] D. Knuth , The art of computer programming, vol. 3, sorting and searching, second ed., Addison-Wesley, 1998 . [20] A. Kuleshov , A. Bernstein , Manifold learning in data mining tasks, in: P. Perner (Ed.), Machine Learning and Data Mining in Pattern Recognition, 2014, pp. 119–133 . Cham [21] L.J. Latecki , R. Lakamper , U. Eckhardt , Shape descriptors for non-rigid shapes with a single closed contour, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 20 0 0, pp. 424–429 . [22] M.H.C. Law , A.K. Jain , Incremental nonlinear dimensionality reduction by manifold learning, IEEE Trans. Pattern Anal. Mach. Intell. 28 (3) (2006) 377–391 . [23] H. Ling , X. Yang , L.J. Latecki , Balancing deformability and discriminability for shape matching, in: Proceedings of the European Conference on Computer Vision (ECCV), 3, 2010, pp. 411–424 . [24] P. Liu , J.-M. Guo , K. Chamnongthai , H. Prasetyo , Fusion of color histogram and lbp-based features for texture image retrieval and classification, Inf. Sci. 390 (2017) 95–111 . [25] S. Liu , J. Wu , L. Feng , H. Qiao , Y. Liu , W. Luo , W. Wang , Perceptual uniform descriptor and ranking on manifold for image retrieval, Inf. Sci. 424 (2018) 235–249 . [26] K. Mehlhorn , P. Sanders , Algorithms and Data Structures, Springer, 2008 . [27] M.M. Michael , High performance dynamic lock-free hash tables and list-based sets, in: Proceedings of the Fourteenth Annual ACM Symposium on Parallel Algorithms and Architectures, 2002, pp. 73–82 . [28] B. Nichols , D. Buttlar , J.P. Farrell , Pthreads Programming - A POSIX Standard for Better Multiprocessing, O’Reilly, 1996 . [29] J.P. Nielsen , S. Karlsson , A scalable lock-free hash table with open addressing, in: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2016, pp. 33:1–33:2 . [30] L. Page , S. Brin , R. Motwani , T. Winograd , The PageRank citation ranking: bringing order to the web., Technical Report, Stanford InfoLab, 1999 . Previous number = SIDL-WP-1999-0120 [31] D.C.G. Pedronette , J. Almeida , R. da S. Torres , A scalable re-ranking method for content-based image retrieval, Inf. Sci. 265 (1) (2014) 91–104 . [32] D.C.G. Pedronette , J. Almeida , R.S. Torres , A graph-based ranked-list model for unsupervised distance learning on shape retrieval, Pattern Recognit. Lett. 83, Part 3 (2016) 357–367 . [33] D.C.G. Pedronette , F.M.F. Gonalves , I.R. Guilherme , Unsupervised manifold learning through reciprocal knn graph and connected components for image retrieval tasks, Pattern Recognit. 75 (2018) 161–174 . [34] D.C.G. Pedronette , O.A. Penatti , R.d.S. Torres , Unsupervised manifold learning using reciprocal knn graphs in image re-ranking and rank aggregation tasks, Image Vis. Comput. 32 (2) (2014) 120–130 . [35] D.C.G. Pedronette , R.T.C.O.A.B. Penatti , R. da S. Torres , A semi-supervised learning algorithm for relevance feedback and collaborative image retrieval, EURASIP J. Image Video Process. 2015 (1) (2015) 27 . [36] D.C.G. Pedronette , R.d.S. Torres , Exploiting pairwise recommendation and clustering strategies for image re-ranking, Inf. Sci. 207 (2012) 19–34 . [37] D.C.G. Pedronette , R.d.S. Torres , Image re-ranking and rank aggregation based on similarity of ranked lists, Pattern Recognit. 46 (8) (2013) 2350–2360 . [38] D.C.G. Pedronette , R.d.S. Torres , E. Borin , M. Breternitz , Efficient image re-ranking computation on GPUs, in: Proceedings of the International Sympo- sium on Parallel and Distributed Processing (ISPA), 2012, pp. 95–102 . [39] O.A.B. Penatti , E. Valle , R.d.S. Torres , Comparative study of global color and texture descriptors for web image retrieval, J. Vis. Commun. Image Repre- sent. 23 (2) (2012) 359–380 . [40] D. Qin , S. Gammeter , L. Bossard , T. Quack , L. van Gool , Hello neighbor: Accurate object retrieval with k-reciprocal nearest neighbors, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 777–784 . [41] R. Rajwar , J.R. Goodman , Speculative lock elision: Enabling highly concurrent multithreaded execution, in: Proceedings of the 34th annual ACM/IEEE International Symposium on Microarchitecture, 2001, pp. 294–305 . Please cite this article as: A. Baldassin et al., An optimized unsupervised manifold learning algorithm for manycore archi- tectures, Information Sciences (2018), https://doi.org/10.1016/j.ins.2018.06.023 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0001 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0001 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0001 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0001 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0002 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0002 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0002 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0002 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0003 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0003 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0003 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0003 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0003 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0004 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0004 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0005 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0005 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0005 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0005 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0006 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0006 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0007 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0007 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0007 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0007 https://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0009 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0009 https://lwn.net/Articles/488709/ http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0011 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0011 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0011 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0011 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0011 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0012 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0012 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0013 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0013 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0013 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0013 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0014 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0014 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0014 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0014 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0014 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0015 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0015 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0015 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0015 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0015 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0015 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0016 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0016 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0016 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0016 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0016 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0016 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0017 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0017 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0017 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0017 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0017 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0018 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0018 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0018 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0018 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0019 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0019 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0020 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0020 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0020 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0020 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0021 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0021 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0021 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0021 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0022 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0022 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0022 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0023 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0023 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0023 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0023 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0024 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0024 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0024 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0024 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0024 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0025 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0025 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0025 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0025 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0025 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0025 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0025 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0025 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0026 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0026 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0026 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0027 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0027 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0028 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0028 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0028 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0028 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0029 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0029 http://refhub.elsevier.com/S0020-0255(18)30463-8/sbref0029 http://refhub.elsevier.com/S0020-0255(